# Introduction: Partition Pipeline

In this notebook, we will work with a single partition to develop a pipeline for processing the data. The end goal is code that can take a partition on disk and generate a feature matrix from the partition. This will then be parallelized using Spark in PySpark.

In [34]:
import pandas as pd 
import numpy as np

import featuretools as ft

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [35]:
directory = '/data/churn/partitions/p0'
import os
os.listdir(directory)

['transactions.csv', 'members.csv', 'test.csv', 'train.csv', 'logs.csv']

In [36]:
all_partitions = os.listdir('/data/churn/partitions/')
len(all_partitions)

1000

In [37]:
members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], infer_datetime_format = True)
trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], infer_datetime_format = True)
logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
train = pd.read_csv(f'{directory}/train.csv')
test = pd.read_csv(f'{directory}/test.csv')

In [38]:
members.head()
trans.head()
logs.head()
train.head()
test.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,jk6WQA2qSx3az+A3sZwCtQDDP/Lsw6kd0UdQ4gjbyOY=,1,0,,4,2016-12-24
1,eRZ8pH3tR5Ss9rn5dkJpOs4q07b72+pjOfuiVwHsEyw=,3,28,male,9,2006-09-23
2,uj8Fs7lyFg8c1iOGV1eFSlBcV7Y1FziPQS62GLh23J0=,15,33,male,9,2009-03-07
3,Kv9V2xGzAZyOUlD0dudmj3bgYamLuPKlJ2hieIEhOo8=,11,23,female,9,2007-06-17
4,IUQ6diSNvqj+YMwRwiZ7tGv83H61pKz+pEn6p2U0jAI=,15,50,female,9,2011-01-11


Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel


Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs


Unnamed: 0,msno,is_churn


Unnamed: 0,msno,is_churn


## How Many Unique Members are There? 

Who do we need to find data for? The best choice is probably only the customers in the transactions dataframe since we can make labels for them. 

The defintion of a label will be: within 30 days of cancelling, does a customer resubscribe? Given this definition, we can write a function to generate labels.

In [21]:
train.groupby('msno')['is_churn'].nunique().head()

msno
+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=    2
+2rgJpEKJWYFwVkHKnSzQUnieMwfLMrHiJzCxK9AhGo=    2
+3KltBa/1dUuXwOzDKksw11Nwdwf7/pXv47sDv4mInY=    2
+4lC2x3ltrVTmmT3CgS+vuFD/1yzi97C6icTr7hFuRY=    2
+6KgKovFigr5lk3+G8srZUoUHhPS8a+rTa/N2Vg1wsg=    2
Name: is_churn, dtype: int64

In [15]:
trans.groupby('msno').count().sort_values('plan_list_price').tail()

Unnamed: 0_level_0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
prvwhLDR5WjSgche9r0XTDuF2lbr9KpPKyqBPAAUwio=,41,41,41,41,41,41,41,41
8O9UXcStlZzME2YTyJDJf4m4WnfBV1FNOOBI1hoeueE=,42,42,42,42,42,42,42,42
AKZNktiVbiDxV6Un0C84th9hKQPn/NwGXNffbVbR9Ps=,42,42,42,42,42,42,42,42
zwhR3q2j/NM4e56g3ekekoyx/8s0Ghomij/3/BsSBWs=,44,44,44,44,44,44,44,44
fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,54,54,54,54,54,54,54,54


In [24]:
ex = trans[trans['msno'] == 'fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA='].copy().\
     sort_values(['transaction_date', 'membership_expire_date'])
ex.head(10)

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
23433,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-01-08,2016-08-10,0
537,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,119,1,2015-01-08,2016-09-10,0
23080,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-01-11,2016-10-11,0
8397,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-02-08,2016-11-08,0
10817,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,119,1,2015-02-08,2016-12-06,0
7868,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-02-11,2017-01-03,0
11836,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-03-08,2017-02-03,0
24714,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,119,1,2015-03-08,2017-03-06,0
27308,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-03-11,2017-04-06,0
26136,fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=,41,30,149,149,1,2015-04-08,2017-05-06,0


In [25]:
ex_start = members.loc[members['msno'] == 'fItJlEs671EQOapBdqMZ/9zJe0/Mzzt6A5wCp4Iu/wA=', 'registration_init_time']
ex_start

1692   2013-10-09
Name: registration_init_time, dtype: datetime64[ns]

In [26]:
months = pd.date_range(pd.datetime(ex_start.dt.year, ex_start.dt.month, 1),
                       pd.datetime(2018, 1, 1), freq = 'M')
len(months)

51

In [27]:
months[1]

Timestamp('2013-11-30 00:00:00', freq='M')

In [30]:
statuses = []
is_subscribed = True

for month in months:
    if month < (ex['transaction_date'].min() - pd.Timedelta(30, 'D')):
        statuses.append(np.nan)
    else:
        status = 0 
        subset = ex.loc[(ex['transaction_date'].dt.year == month.year) & (ex['transaction_date'].dt.month == month.month)].copy()
        
        if any(subset['is_cancel'] == 1):
            is_subscribed = 0
                