# File I/O: Demo

The raw data contains all the users' activities from March 30 to May 12, 2017. (Fresh data!!) Since the data amount is so huge for a PC, reading all data into memory is a mission impossible. On the other hand, it's not necessary to get all the activity information to determine if a user is a churn. To be more specific, for churn and the users' activity analysis, ony user, device and date/time of each activity log needs to be saved.

In this demo, the file operations are walked through, from unzipping original data in .tar.gz zipped files, to saving the reduced information of all logs into a new file.

### Procedure:

    1. Unzip all the "play" activity data in a batch.
    
    2. Choose a cut-off date for the churn labeling: Any user who is active before this cut-off date but has no activity after the date shall be labeled as a churn.
    
    3. Read each line, save only "user_id","device","date/time" of each activity log before and after the cut-off date, and save them in two new file respectively.    

##### 1. Unzip the 400+ .tar.gz compressed files of raw data

7z can only unzip the .tar.gz files to .tar, so the complete unzip takes two steps:
    
    1. Batch unzip from .tar.gz --> .tar
    2. Batch unzip from .tar --> log files

In [1]:
## In windows powershell, run the following iteration commands in the raw data directory:
'''
$files = Get-ChildItem "../data/raw/" -Filter *_play.log.tar.gz

foreach ($f in $files) {7z e $f -oC../data/raw/unzip}

cd unzip

$files = Get-ChildItem "../raw/unzip" -Filter *_play.log.tar

foreach ($f in $files) {7z e $f}
'''

'\n$files = Get-ChildItem "../data/raw/" -Filter *_play.log.tar.gz\n\nforeach ($f in $files) {7z e $f -so | 7z e -aoa -si -ttar -o"../data/raw/unzip"}\n'

References:

__[for loop in windows powershell](https://stackoverflow.com/questions/18847145/loop-through-files-in-a-directory-using-powershell)__


__[Unzip .tar.gz files using 7z commands](https://stackoverflow.com/questions/1359793/programmatically-extract-tar-gz-in-a-single-step-on-windows-with-7zip)__

##### 2. Cut-off date = April 21st.

  In this demo, the first file of 3/30 is before the cut-off date, while the second file of 5/12 is after the cut-off. In reality there are three weeks before cut-off, and three weeks after. The cut off might need changing later, depending on the model's performance.

##### 3. File operations on the play logs

    1. Open *play.log files one by one
    
    2. Read a line, and save ONLY the user id and device type info. 
    
    3. Do this for all lines in all files. Write the saved info into a file called 'play_lite.log'


    1. Open all play logs, using * wildcard

In [1]:
import glob

filepath = 'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\*play.log'
files = glob.glob(filepath)
len(files)

2

In [2]:
files

['C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\20170331_1_play.log',
 'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\20170512_3_play.log']

In [3]:
log_amounts = []

In [4]:
for the_file in files:
    f = open(the_file, 'r')
    lines = f.readlines()
    log_amounts.append(len(lines))
    f.close()
log_amounts

[1865402, 821673]

    2. Read the files, append the date to each line

In [5]:
with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\20170512_3_play.log','r') as f:
    content = f.readlines()
len(content)

821673

In [6]:
first_line = content[0]
first_line

'42159585\tar\t106148\t1\t\xe6\x98\x8e\xe5\xa4\xa9\xe4\xbd\xa0\xe6\x98\xaf\xe5\x90\xa6\xe4\xbe\x9d\xe7\x84\xb6\xe7\x88\xb1\xe6\x88\x91\t\xe7\xab\xa5\xe5\xae\x89\xe6\xa0\xbc\t256128\t0\t0\n'

In [53]:
first_line_fields = content[0].strip('\n').split('\t')
#first_line_fields.append(f.name.split('\\')[-1][8])
first_line_fields

['42159585',
 'ar',
 '106148',
 '1',
 '\xe6\x98\x8e\xe5\xa4\xa9\xe4\xbd\xa0\xe6\x98\xaf\xe5\x90\xa6\xe4\xbe\x9d\xe7\x84\xb6\xe7\x88\xb1\xe6\x88\x91',
 '\xe7\xab\xa5\xe5\xae\x89\xe6\xa0\xbc',
 '256128',
 '0',
 '0']

In [58]:
reduced_fields = first_line_fields[:2]

reduced_fields

['42159585', 'ar']

In [59]:
filename = f.name.split('\\')[-1]

reduced_fields.append(filename)

reduced_fields

['42159585', 'ar', '20170331_1_play.log']

In [60]:
new_first_line = '\t'.join(reduced_fields)+'\n'
new_first_line

'42159585\tar\t20170331_1_play.log\n'

    3. write the new lines into a new file

        Using the f.write() method

In [61]:
output = open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\output\\all.log','a')

In [62]:
import time


for the_file in files:
    current_time = time.clock()

    with open(the_file, 'r') as f:
        lines = f.readlines()
        print('processing file: %s' % f.name.split('\\')[-1])
        for line in lines:
            contents_to_keep = line.split('\t')[:2]
            contents_to_keep.append(f.name.split('\\')[-1])
            output.write('\t'.join(contents)+'\n')
    print('...costs %.2f seconds' % (time.clock()-current_time))
    current_time = time.clock()


processing file: 20170331_1_play.log
...costs 5.75 seconds
processing file: 20170512_3_play.log
...costs 2.43 seconds


In [63]:
output.close()

In [64]:
with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\output\\all.log','r') as output:
    lines = output.readlines()
len(lines)

2687075

In [13]:
sum(log_amounts)

2687075

    4. Save the user_ids into sets
    
        Delete the all.log just created, as the procedure can be done at the same time

In [14]:
from sets import Set

list_of_sets = []
# for each day's data, set the active users' user_id into a set.

  """Entry point for launching an IPython kernel.


In [15]:
with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\output\\all.log','a') as output:
    for the_file in files:
        with open(the_file, 'r') as f:
            lines = f.readlines()
            list_of_sets.append(Set([line.split('\t')[0] for line in lines]))
            for line in lines:
                contents = line.strip('\n').split('\t')
                contents.append(f.name.split('\\')[-1][:8])
                output.write('\t'.join(contents)+'\n')

In [16]:
[len(each_set) for each_set in list_of_sets]

[64004, 24520]

    5. Churn labeling and file saving
    
        Save the user_id of churns into a new file.

In [17]:
active_before, active_after = list_of_sets[0],list_of_sets[1]

In [18]:
#. set method: s.intersection(t) returns to a new set of s & t
loyal_users = active_before.intersection(active_after)
len(loyal_users)

13

In [19]:
#. set method: s.difference(t) returns to a new set of items in s but not in t
churn = active_before.difference(active_after)
len(churn)

63991

In [20]:
new_users = active_after.difference(active_before)
len(new_users)

24507

In [21]:
# Use loyal_user as an example

with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\loyal.log','a') as loyal_file:
    loyal_file.write('\n'.join(list(loyal_users))+'\n')

In [22]:
# check loyal_file

with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\loyal.log','r') as loyal_file:
    lines = loyal_file.readlines()

lines

['1749320\n',
 '1685126\n',
 '533817\n',
 '167986594\n',
 '167748297\n',
 '167892678\n',
 '37025504\n',
 '167826179\n',
 '0\n',
 '155948236\n',
 '20090948\n',
 '751824\n',
 '168042334\n']

In [24]:
the_file = files[1]
the_file

'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\20170512_3_play.log'

In [28]:
import time

current_time = time.clock()
date = the_file.split('\\')[-1][:8]
print("processing "+date)
with open(the_file, 'r') as f:
    lines = f.readlines()
    loyal_flags = [line.strip('\n').split('\t')[0] in loyal_users for line in lines]
    loyal_indices = [i for i, flag in enumerate(loyal_flags) if flag]
    loyal_lines = [lines[i].strip('\n')+'\t'+date+'\t'+'1'+'\n' for i in loyal_indices]
    # Write to a separate output file
    with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\output\\loyal.log', 'a') as output:
        output.writelines(loyal_lines)
print('...costs %.2f minutes' % ((time.clock()-current_time)/60.0))

processing 20170512
...costs 0.03 minutes


In [29]:
with open('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\output\\loyal.log','r') as loyal_file:
    lines = loyal_file.readlines()

len(lines)

287286

In [30]:
my_dict = {'a':1,'b':2, 'c':3}

In [31]:
my_dict['b']

2

In [36]:
my_dict.get('a'),my_dict.get('d')

(1, None)

In [37]:
alpha = ['c','d','a','f']

In [38]:
[my_dict.get(char) for char in alpha]

[3, None, 1, None]