# Data Preparation: Churn labeling and Downsampling

    1. Read the *play.log files line by line, and write only the user ID, device and date of log into a separate
    file.
    
    2. Label churn users: those who played more than three times before the cutoff day but had no acitivity after
    the cutoff.
    
    3. Down sampling is necessary. There are more than 500,000 users and 15 GB log data, which is not necessary for
    the churn prediction and which costs too much time to process. I used  a down sampling ratio of 1/10, to only
    include 1/10 users from the active and churn users for the churn prediction model.

### 1. Churn labeling

In [1]:
import glob

In [2]:
filepath = 'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\data\\play\\*play.log'
files = glob.glob(filepath)
# amount of files
len(files)

127

In [3]:
# take a look at one of the files
files[0]

'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\data\\play\\20170330_3_play.log'

In [7]:
# get an idea how many lines are in one .log file
with open(files[0],'r') as f:
    lines = f.readlines()
    log_lines = len(lines)
log_lines

4760394

In [13]:
lines[1]

'168540455\tar\t298250\t0\tSee You Again\tMiley Cyrus\t189\t190\t0\n'

In [14]:
test_list = lines[1].strip('\n').split('\t')
test_list.append(files[0].split('\\')[-1])
test_list

['168540455',
 'ar',
 '298250',
 '0',
 'See You Again',
 'Miley Cyrus',
 '189',
 '190',
 '0',
 '20170330_3_play.log']

In [17]:
import pandas as pd

schema = ['uid','device','song_id','song_type','song_name','singer','play_time','song_length','paid_flag','file_name']
df = pd.read_csv('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\data\\play\\20170330_3_play.log',\
                 delimiter='\t',header=None,index_col=None,names=schema)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
0,168550892.0,ar,23491655,0.0,裂心,林志炫,254,254,0.0,
1,168540455.0,ar,298250,0.0,See You Again,Miley Cyrus,189,190,0.0,
2,168551247.0,ar,11881432,0.0,一梦浮生,颜淡妖妖,78,149,0.0,
3,168549788.0,ip,295469,0.0,凡人歌,高胜美,16,242,0.0,
4,168551248.0,ip,21393368,0.0,重生（Intro）,幼稚园杀手,87,87,0.0,


In [22]:
df.file_name = files[0].split('\\')[-1]
df.head()

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
0,168550892.0,ar,23491655,0.0,裂心,林志炫,254,254,0.0,20170330_3_play.log
1,168540455.0,ar,298250,0.0,See You Again,Miley Cyrus,189,190,0.0,20170330_3_play.log
2,168551247.0,ar,11881432,0.0,一梦浮生,颜淡妖妖,78,149,0.0,20170330_3_play.log
3,168549788.0,ip,295469,0.0,凡人歌,高胜美,16,242,0.0,20170330_3_play.log
4,168551248.0,ip,21393368,0.0,重生（Intro）,幼稚园杀手,87,87,0.0,20170330_3_play.log


In [37]:
df.loc[df.paid_flag>0]

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
25285,37025504.0,ar,13068919.0,0.0,A Theme of ONE PUNCH MAN~正義執行~,"宮崎誠""",\t,\t,128.0,20170330_3_play.log
238591,168625846.0,ar,5989230.0,0.0,153.,●生活习惯,分级加字幕轻松练听力,2,3.0,20170330_3_play.log
238754,168625846.0,ar,5989230.0,0.0,153.,●生活习惯,分级加字幕轻松练听力,2,3.0,20170330_3_play.log
238850,168625846.0,ar,5989230.0,0.0,153.,●生活习惯,分级加字幕轻松练听力,2,3.0,20170330_3_play.log
238879,168625846.0,ar,5989230.0,0.0,153.,●生活习惯,分级加字幕轻松练听力,0,3.0,20170330_3_play.log
239279,168625846.0,ar,5989080.0,0.0,2.,●从起床到出门,分级加字幕轻松练听力,10,9.0,20170330_3_play.log
239422,168625846.0,ar,5989230.0,0.0,153.,●生活习惯,分级加字幕轻松练听力,2,3.0,20170330_3_play.log
239588,168625846.0,ar,5989080.0,0.0,2.,●从起床到出门,分级加字幕轻松练听力,2,9.0,20170330_3_play.log
288398,168667632.0,ar,6916310.0,0.0,多少的爱都不要(???? ????????,),Ten Nararak,168,415.0,20170330_3_play.log
289645,168667632.0,ar,6916310.0,0.0,多少的爱都不要(???? ????????,),Ten Nararak,0,415.0,20170330_3_play.log


In [50]:
def to_num(s):
    try:
        return float(s)
    except ValueError:
        return s
    
[to_num(string) for string in [12.0,'1.234','I-love-coding']]

[12.0, 1.234, 'I-love-coding']

In [52]:
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False
    
def not_number(s):
    try:
        float(s)
        return False
    except ValueError:
        return True
    
df.loc[df.play_time.apply(not_number)]

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
22,168551408.0,ar,15750838,2.0,惊雷,MC王小五,7>(123.138.230.80)TM,0,0.0,20170330_3_play.log
199,168546522.0,ar,5820473,2.0,Out Of The Woods,Taylor Swift,23>(113.68.113.130)TM,256,0.0,20170330_3_play.log
243,168544206.0,ar,23134997,2.0,从嘲笑到震撼!不能更赞!有梦想就要坚持,,71>=(223.104.12.211)TM,157,0.0,20170330_3_play.log
1223,168547035.0,ar,2835290,2.0,江南Style(纽约版),Psy,111>=(223.85.150.140)TM,0,0.0,20170330_3_play.log
1639,168551256.0,ar,1952511,2.0,Feels So Good (Tristan Garner Remix) (feat. Na...,Armin van Buuren,4>}(113.82.176.86)TM,371,0.0,20170330_3_play.log
2027,168544206.0,ar,22399294,2.0,一群火烈鸟走迷路了，仿佛看到了路痴的我。。。这特么是哪儿？,热门短片,43>}(223.104.12.211)TM,80,0.0,20170330_3_play.log
2476,168551256.0,ar,3462716,2.0,Feat Sophi,Armin van Buuren,0>}(113.82.176.86)TM,219,0.0,20170330_3_play.log
2724,168550027.0,ar,2834716,2.0,城门,陈浩民,0>}(113.200.107.141)TM,240,0.0,20170330_3_play.log
3082,168544206.0,ar,22943011,2.0,哈哈哈谁都别拦我!这个泰坦尼克的COS我要给满分!,,5>(223.104.12.211)TM,32,0.0,20170330_3_play.log
3746,168551619.0,ar,6661512,2.0,男神版《从前的我》回归那份感动,,5>}(113.82.176.86)TM,0,0.0,20170330_3_play.log


In [4]:
# 04/22 is the cutoff date for labeling churns
cutoff = '20170422'

##### Save reduced play logs to two log files.

Only the first two items of each line, user id and device, and the date of the log are saved, so it's called reduced play logs.

In [5]:
# destination file names to save the reduced logs.
first_period_log = 'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\data\\play_till_0421.log'
second_period_log = 'C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\data\\play_till_0512.log'

In [34]:
import time

for each_file in files:
    current_time = time.clock()

    with open(each_file, 'r') as f:
        lines = f.readlines()
        filename = f.name.split('\\')[-1]
        print('processing file: %s' % filename)
        #choose the output path
        if filename < cutoff:
            output_path = first_period_log
        else:
            output_path = second_period_log
        # write to the output file
        with open(output_path, 'a') as output:
            for line in lines:
                contents_to_keep = line.split('\t')[:2]
                contents_to_keep.append(filename)
                output.write('\t'.join(contents_to_keep)+'\n')
    print('...costs %.2f seconds' % (time.clock()-current_time))
    current_time = time.clock()

processing file: 20170330_3_play.log
...costs 10.36 seconds
processing file: 20170331_1_play.log
...costs 5.20 seconds
processing file: 20170331_2_play.log
...costs 3.71 seconds
processing file: 20170331_3_play.log
...costs 8.27 seconds
processing file: 20170401_1_play.log
...costs 5.08 seconds
processing file: 20170401_2_play.log
...costs 5.43 seconds
processing file: 20170401_3_play.log
...costs 6.07 seconds
processing file: 20170402_1_play.log
...costs 5.55 seconds
processing file: 20170402_2_play.log
...costs 4.91 seconds
processing file: 20170402_3_play.log
...costs 5.78 seconds
processing file: 20170403_1_play.log
...costs 5.41 seconds
processing file: 20170403_2_play.log
...costs 5.11 seconds
processing file: 20170404_1_play.log
...costs 4.93 seconds
processing file: 20170404_2_play.log
...costs 4.10 seconds
processing file: 20170404_3_play.log
...costs 0.65 seconds
processing file: 20170405_1_play.log
...costs 0.13 seconds
processing file: 20170405_2_play.log
...costs 3.70 seco

In [52]:
import pandas as pd

schema = ['uid','device','file_name']
df_1 = pd.read_csv(first_period_log,delimiter='\t',header=None,index_col=None,names=schema, dtype = {'uid':'str'})
df_1.head()

Unnamed: 0,uid,device,file_name
0,168550892,ar,20170330_3_play.log
1,168540455,ar,20170330_3_play.log
2,168551247,ar,20170330_3_play.log
3,168549788,ip,20170330_3_play.log
4,168551248,ip,20170330_3_play.log


In [100]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82001784 entries, 0 to 82001783
Data columns (total 3 columns):
uid          object
device       object
file_name    object
dtypes: object(3)
memory usage: 1.8+ GB


##### Label active users in the first period.

    Criteria: number of activities before the cutoff > 3

In [60]:
df_1.uid.value_counts()

1685126      3617414
37025504     2613945
751824       2097443
1791497      1888301
497685       1393958
1062806      1164170
736305        856422
0             568498
1749320       372206
1679121       244088
46532274      233379
28638487      200651
637650        108488
155948236      97851
32166203       74229
533817         74064
64268006       65829
168127634      35003
22730453       31866
16517426       31675
32104145       29951
924065         29727
168922214      29260
167982849      28483
27954505       24917
1710083        23176
165313253      21655
168674332      17807
168751924      17717
168334631      16474
              ...   
167731479          1
168763247          1
168859231          1
168804426          1
168761136          1
169022327          1
168969919          1
168546198          1
168444376          1
168546199          1
168801780          1
168801781          1
168001689          1
167931183          1
168791070          1
168934216          1
168791075    

In [101]:
# total number of active users and inactive users
active = df_1.uid.value_counts()>3
sum(active),sum(active==0)

(472632, 91992)

In [65]:
active_users = [active.index[i] for i in xrange(len(active)) if active[i]]
active_users[:5]

['1685126', '37025504', '751824', '1791497', '497685']

In [67]:
active_set_1 = set(active_users)

In [68]:
df_2 = pd.read_csv(second_period_log,delimiter='\t',header=None,index_col=None,names=schema, dtype = {'uid':'str'})
df_2.head()

Unnamed: 0,uid,device,file_name
0,751824,ar,20170422_1_play.log
1,167806444,ar,20170422_1_play.log
2,28638487,ar,20170422_1_play.log
3,167854540,ar,20170422_1_play.log
4,1685126,ar,20170422_1_play.log


In [102]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56753407 entries, 0 to 56753406
Data columns (total 3 columns):
uid          object
device       object
file_name    object
dtypes: object(3)
memory usage: 1.3+ GB


In [70]:
active_2 = df_2.uid.value_counts()
active_2[:5]

1685126     3531544
37025504    2570137
751824      2221580
497685      1518085
1791497     1325354
Name: uid, dtype: int64

In [71]:
active_set_2 = set(active_2.index)
len(active_set_2)

272501

In [73]:
churn_set = active_set_1 - active_set_2
len(churn_set)

238889

In [74]:
good_set = active_set_1 & active_set_2
len(good_set)

233743

### Down Sampling

In [80]:
import random

random.seed(42)

In [81]:
good_sample = random.sample(good_set,len(good_set)/10)
len(good_sample)

23374

In [82]:
churn_sample = random.sample(churn_set,len(churn_set)/10)
len(churn_sample)

23888

In [144]:
df_churn_log = df_1.loc[df_1.uid.isin(churn_sample),:]

In [90]:
df_churn_log.shape

(1507639, 3)

In [91]:
df_good_log = df_1.loc[df_1.uid.isin(good_sample),:]

In [92]:
df_good_log.shape

(7577773, 3)

In [145]:
df_churn_log.head()

Unnamed: 0,uid,device,file_name
23,168551179,ar,20170330_3_play.log
34,168551221,ar,20170330_3_play.log
45,168551397,ar,20170330_3_play.log
66,168549013,ip,20170330_3_play.log
76,168543049,ar,20170330_3_play.log


#####  play with the logs, and create features for churn prediction
    
    just a few examples

In [103]:
df_good_log.groupby(['uid','file_name']).size()

uid        file_name          
100415077  20170331_1_play.log     39
           20170402_1_play.log      3
           20170403_1_play.log     13
           20170408_1_play.log     31
           20170411_1_play.log     13
           20170415_1_play.log     32
           20170417_1_play.log      7
           20170418_1_play.log     66
           20170419_1_play.log      1
           20170420_1_play.log     27
100549339  20170331_2_play.log     62
           20170401_2_play.log     19
           20170402_2_play.log     13
           20170403_2_play.log     54
           20170404_2_play.log     14
           20170405_2_play.log     23
           20170406_2_play.log     23
           20170407_2_play.log     27
           20170408_2_play.log      7
           20170409_2_play.log      9
           20170410_2_play.log      2
           20170412_2_play.log      9
           20170413_2_play.log      9
           20170414_2_play.log      2
           20170415_2_play.log     21
           20170416

In [105]:
df_daily_good = df_good_log.groupby(['uid','file_name']).size().to_frame(name = 'count').reset_index()
df_daily_good.head()

Unnamed: 0,uid,file_name,count
0,100415077,20170331_1_play.log,39
1,100415077,20170402_1_play.log,3
2,100415077,20170403_1_play.log,13
3,100415077,20170408_1_play.log,31
4,100415077,20170411_1_play.log,13


In [107]:
df_device_good = df_good_log.groupby(['uid','device']).size().to_frame(name = 'count').reset_index()
df_device_good.head()

Unnamed: 0,uid,device,count
0,100415077,ar,232
1,100549339,ar,534
2,101372876,ar,328
3,102311569,ar,359
4,102393511,ar,356


In [157]:
import numpy as np

df_churn_log['date'] = df_churn_log.file_name.str.split('_').apply(lambda x: x[0])
df_churn_log.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,uid,device,file_name,date
23,168551179,ar,20170330_3_play.log,20170330
34,168551221,ar,20170330_3_play.log,20170330
45,168551397,ar,20170330_3_play.log,20170330
66,168549013,ip,20170330_3_play.log,20170330
76,168543049,ar,20170330_3_play.log,20170330


In [161]:
df_churn_log.loc[:,'label'] = 1
df_churn_log.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,uid,device,file_name,date,label
23,168551179,ar,20170330_3_play.log,20170330,1
34,168551221,ar,20170330_3_play.log,20170330,1
45,168551397,ar,20170330_3_play.log,20170330,1
66,168549013,ip,20170330_3_play.log,20170330,1
76,168543049,ar,20170330_3_play.log,20170330,1


In [167]:
df_good_log['date'] = df_good_log.file_name.str.split('_').apply(lambda x: x[0])
df_good_log.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,uid,device,file_name,date
61,168534518,ar,20170330_3_play.log,20170330
80,751824,ar,20170330_3_play.log,20170330
99,168551186,ar,20170330_3_play.log,20170330
129,168548840,ip,20170330_3_play.log,20170330
135,168547123,ar,20170330_3_play.log,20170330


In [168]:
df_good_log.loc[:,'label'] = 0
df_good_log.head()

Unnamed: 0,uid,device,file_name,date,label
61,168534518,ar,20170330_3_play.log,20170330,0
80,751824,ar,20170330_3_play.log,20170330,0
99,168551186,ar,20170330_3_play.log,20170330,0
129,168548840,ip,20170330_3_play.log,20170330,0
135,168547123,ar,20170330_3_play.log,20170330,0


In [169]:
df_complete = pd.concat([df_good_log,df_churn_log])
df_complete.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9085412 entries, 61 to 82001380
Data columns (total 5 columns):
uid          object
device       object
file_name    object
date         object
label        int64
dtypes: int64(1), object(4)
memory usage: 415.9+ MB


In [170]:
df_complete.label.unique()

array([0, 1], dtype=int64)

In [171]:
df_complete.date.unique()

array(['20170330', '20170331', '20170401', '20170402', '20170403',
       '20170404', '20170405', '20170406', '20170407', '20170408',
       '20170409', '20170410', '20170411', '20170412', '20170413',
       '20170414', '20170415', '20170416', '20170417', '20170418',
       '20170419', '20170420', '20170421'], dtype=object)

##### paid_flag
    Take care of user who PAY !!!

In [75]:
import numpy as np
import pandas as pd
schema = ['uid','device','song_id','song_type','song_name','singer','play_time','song_length','paid_flag','file_name']
df_paid_user = pd.DataFrame(columns = schema)
df_paid_user.head()

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name


In [None]:
with open('t.ini') as f:
    for line in f:

In [76]:
import time

current_time = time.clock()
for each_file in files:
    filename = each_file.split('\\')[-1]
    print('processing file: %s' % filename)
    df_temp = pd.read_csv(each_file,delimiter='\t',header=None,index_col=None,names=schema)
    df_temp.file_name = filename
    df_temp = df_temp.loc[df_temp.paid_flag>0]
    df_paid_user = df_paid_user.append(df_temp)
    print('...costs %.2f seconds' % (time.clock()-current_time))
    current_time = time.clock()
    
df_paid_user.shape

processing file: 20170330_3_play.log
...costs 13.74 seconds
processing file: 20170331_1_play.log
...costs 5.27 seconds
processing file: 20170331_2_play.log
...costs 3.61 seconds
processing file: 20170331_3_play.log
...costs 8.49 seconds
processing file: 20170401_1_play.log
...costs 7.32 seconds
processing file: 20170401_2_play.log
...costs 7.08 seconds
processing file: 20170401_3_play.log
...costs 17.15 seconds
processing file: 20170402_1_play.log
...costs 13.81 seconds
processing file: 20170402_2_play.log
...costs 12.03 seconds
processing file: 20170402_3_play.log
...costs 25.76 seconds
processing file: 20170403_1_play.log
...costs 25.05 seconds
processing file: 20170403_2_play.log
...costs 31.32 seconds
processing file: 20170404_1_play.log
...costs 28.06 seconds
processing file: 20170404_2_play.log
...costs 10.32 seconds
processing file: 20170404_3_play.log
...costs 1.33 seconds
processing file: 20170405_1_play.log
...costs 0.69 seconds
processing file: 20170405_2_play.log
...costs 1

(133439, 10)

In [59]:
df_temp = df_temp.loc[df_temp.paid_flag>0]
df_temp.shape

(10, 10)

In [65]:
df_paid_user.head()

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
52804,168533000.0,ar,6916310.0,0,多少的爱都不要(???? ????????,),Ten Nararak,35,415,0
285061,168905000.0,ip,6916310.0,0,多少的爱都不要（???? ????????,）,Ten Nararak,415,415,0
285083,168905000.0,ip,6916310.0,0,多少的爱都不要（???? ????????,）,Ten Nararak,415,415,0
315778,168681000.0,ip,6916310.0,0,多少的爱都不要（???? ????????,）,Ten Nararak,415,415,0
324711,168757000.0,ar,6916310.0,0,多少的爱都不要(???? ????????,),Ten Nararak,414,415,0


In [95]:
df_check = df_paid_user.loc[df_paid_user.play_time.apply(is_number) & df_paid_user.play_time.notnull()]
df_check.paid_flag.value_counts()

0    129843
Name: paid_flag, dtype: int64

In [97]:
df_paid_user.device.value_counts()

ar           116180
ip            17258
168589573         1
Name: device, dtype: int64

In [98]:
del df_paid_user

##### Conclusion: no real paid users

##### check non-iphone or android users

In [102]:
df_temp.loc[df_temp.device.apply(lambda x: x!= 'ar' and x != 'ip')]

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name


In [5]:
import numpy as np
import pandas as pd
import time

schema = ['uid','device','song_id','song_type','song_name','singer','play_time','song_length','paid_flag','file_name']

df_device = pd.DataFrame(columns = schema)
current_time = time.clock()
for each_file in files:
    filename = each_file.split('\\')[-1]
    print('processing file: %s' % filename)
    df_temp = pd.read_csv(each_file,delimiter='\t',header=None,index_col=None,names=schema)
    df_temp.file_name = filename
    df_temp = df_temp.loc[df_temp.device.apply(lambda x: x!= 'ar' and x != 'ip')]
    df_device = df_device.append(df_temp)
    print('...costs %.2f seconds' % (time.clock()-current_time))
    current_time = time.clock()
    
df_device.shape

processing file: 20170330_3_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 11.21 seconds
processing file: 20170331_1_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 4.42 seconds
processing file: 20170331_2_play.log
...costs 3.28 seconds
processing file: 20170331_3_play.log
...costs 7.24 seconds
processing file: 20170401_1_play.log
...costs 7.14 seconds
processing file: 20170401_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 4.49 seconds
processing file: 20170401_3_play.log
...costs 5.18 seconds
processing file: 20170402_1_play.log
...costs 17.19 seconds
processing file: 20170402_2_play.log
...costs 3.75 seconds
processing file: 20170402_3_play.log
...costs 7.00 seconds
processing file: 20170403_1_play.log
...costs 5.27 seconds
processing file: 20170403_2_play.log
...costs 4.86 seconds
processing file: 20170404_1_play.log
...costs 14.03 seconds
processing file: 20170404_2_play.log
...costs 78.49 seconds
processing file: 20170404_3_play.log
...costs 1.20 seconds
processing file: 20170405_1_play.log
...costs 0.38 seconds
processing file: 20170405_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 4.12 seconds
processing file: 20170405_3_play.log
...costs 4.13 seconds
processing file: 20170406_1_play.log
...costs 0.23 seconds
processing file: 20170406_2_play.log
...costs 3.75 seconds
processing file: 20170406_3_play.log
...costs 3.93 seconds
processing file: 20170407_1_play.log
...costs 4.34 seconds
processing file: 20170407_2_play.log
...costs 4.34 seconds
processing file: 20170407_3_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 4.88 seconds
processing file: 20170408_1_play.log
...costs 4.27 seconds
processing file: 20170408_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 6.08 seconds
processing file: 20170408_3_play.log
...costs 4.21 seconds
processing file: 20170409_1_play.log
...costs 4.35 seconds
processing file: 20170409_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 4.88 seconds
processing file: 20170409_3_play.log
...costs 4.70 seconds
processing file: 20170410_1_play.log
...costs 4.10 seconds
processing file: 20170410_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 8.70 seconds
processing file: 20170410_3_play.log
...costs 3.78 seconds
processing file: 20170411_1_play.log
...costs 4.10 seconds
processing file: 20170411_2_play.log
...costs 5.17 seconds
processing file: 20170412_1_play.log
...costs 13.37 seconds
processing file: 20170412_2_play.log
...costs 5.51 seconds
processing file: 20170412_3_play.log
...costs 4.05 seconds
processing file: 20170413_1_play.log
...costs 3.53 seconds
processing file: 20170413_2_play.log
...costs 4.07 seconds
processing file: 20170413_3_play.log
...costs 3.78 seconds
processing file: 20170414_1_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 1.90 seconds
processing file: 20170414_2_play.log


  interactivity=interactivity, compiler=compiler, result=result)


...costs 2.40 seconds
processing file: 20170414_3_play.log
...costs 2.11 seconds
processing file: 20170415_1_play.log
...costs 4.41 seconds
processing file: 20170415_2_play.log
...costs 4.08 seconds
processing file: 20170415_3_play.log
...costs 4.75 seconds
processing file: 20170416_1_play.log
...costs 3.79 seconds
processing file: 20170416_2_play.log
...costs 3.88 seconds
processing file: 20170416_3_play.log
...costs 4.57 seconds
processing file: 20170417_1_play.log
...costs 3.31 seconds
processing file: 20170417_2_play.log
...costs 3.16 seconds
processing file: 20170418_1_play.log
...costs 2.74 seconds
processing file: 20170418_2_play.log
...costs 3.32 seconds
processing file: 20170418_3_play.log
...costs 3.31 seconds
processing file: 20170419_1_play.log
...costs 3.67 seconds
processing file: 20170419_2_play.log
...costs 3.61 seconds
processing file: 20170419_3_play.log
...costs 3.21 seconds
processing file: 20170420_1_play.log
...costs 8.02 seconds
processing file: 20170420_2_play.l

  interactivity=interactivity, compiler=compiler, result=result)


...costs 2.74 seconds
processing file: 20170427_2_play.log
...costs 3.15 seconds
processing file: 20170427_3_play.log
...costs 4.91 seconds
processing file: 20170428_1_play.log
...costs 4.57 seconds
processing file: 20170428_2_play.log
...costs 2.64 seconds
processing file: 20170428_3_play.log
...costs 2.55 seconds
processing file: 20170429_1_play.log
...costs 3.35 seconds
processing file: 20170429_2_play.log
...costs 2.79 seconds
processing file: 20170429_3_play.log
...costs 3.17 seconds
processing file: 20170430_1_play.log
...costs 3.28 seconds
processing file: 20170430_2_play.log
...costs 2.77 seconds
processing file: 20170430_3_play.log
...costs 2.77 seconds
processing file: 20170501_1_play.log
...costs 3.18 seconds
processing file: 20170501_2_play.log
...costs 5.09 seconds
processing file: 20170501_3_play.log
...costs 2.62 seconds
processing file: 20170502_1_play.log
...costs 2.39 seconds
processing file: 20170502_2_play.log
...costs 2.23 seconds
processing file: 20170502_3_play.l

(40197, 10)

In [11]:
df_device.device.value_counts()

薛之谦                                     2053
梦然                                      1521
mc                                      1309
云菲菲                                     1170
苏勒亚其其格                                  1016
庄心妍                                      880
王冕                                       861
经典老歌                                     584
汪苏泷                                      548
铃声                                       527
乌兰图雅                                     521
TFBOYS                                   507
儿童歌曲                                     459
张碧晨                                      446
Beyond                                   434
杨洋                                       412
网络歌手                                     359
那英                                       354
朴树                                       335
Delacey                                  292
G.E.M.邓紫棋                                289
李荣浩                                      275
任妙音&amp;何鹏

In [13]:
df_mc_log = df_device.loc[df_device.device.apply(lambda x: x== 'mc')]
df_mc_log.head()

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name
95600,168577000.0,mc,12598500.0,2,Love You Right,Caius,,0,0,20170330_3_play.log
95705,168577000.0,mc,12598500.0,2,Love You Right,Caius,,0,0,20170330_3_play.log
97451,168577000.0,mc,616808.0,2,勿忘心安,张杰,,249,0,20170330_3_play.log
97475,168577000.0,mc,2692870.0,2,Today,G-Dragon&Nell,,221,0,20170330_3_play.log
97508,168577000.0,mc,5078260.0,2,情歌,群星,,260,0,20170330_3_play.log


In [15]:
df_mc_log.shape

(1309, 10)

In [17]:
df_mc_log.groupby('uid').size()

uid
47342301.0     652
168165825.0     92
168166684.0    100
168170738.0    242
168171292.0     17
168171514.0     16
168576820.0      2
168577045.0     20
168577224.0     27
dtype: int64

In [19]:
# active and churns among them
cutoff = '20170422'
df_mc_log['before_cutoff'] = df_mc_log.file_name.apply(lambda x: x[:8]<cutoff)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [20]:
df_mc_log.head()

Unnamed: 0,uid,device,song_id,song_type,song_name,singer,play_time,song_length,paid_flag,file_name,before_cut_off,before_cutoff
95600,168577000.0,mc,12598500.0,2,Love You Right,Caius,,0,0,20170330_3_play.log,True,True
95705,168577000.0,mc,12598500.0,2,Love You Right,Caius,,0,0,20170330_3_play.log,True,True
97451,168577000.0,mc,616808.0,2,勿忘心安,张杰,,249,0,20170330_3_play.log,True,True
97475,168577000.0,mc,2692870.0,2,Today,G-Dragon&Nell,,221,0,20170330_3_play.log,True,True
97508,168577000.0,mc,5078260.0,2,情歌,群星,,260,0,20170330_3_play.log,True,True


In [23]:
sum(df_mc_log.groupby('uid').sum()['before_cutoff']<=3)

1

In [24]:
df_mc_log['after_cutoff'] = df_mc_log.file_name.apply(lambda x: x[:8]>=cutoff)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [25]:
sum(df_mc_log.groupby('uid').sum()['before_cutoff']>0)

9

    There is no churn in mc users.

In [26]:
df_mc_log.to_csv('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\mc_users_log.csv')
df_mc_log.to_pickle('C:\\Users\\Sean\\Documents\\BitTiger\\Capston_music_player_python\\mc_users_log.pkl')