<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-loading" data-toc-modified-id="Imports-and-loading-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and loading</a></span><ul class="toc-item"><li><span><a href="#Library-imports" data-toc-modified-id="Library-imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Library imports</a></span></li><li><span><a href="#Data-Loading" data-toc-modified-id="Data-Loading-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data Loading</a></span></li></ul></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Null-Values-Handling" data-toc-modified-id="Null-Values-Handling-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Null Values Handling</a></span><ul class="toc-item"><li><span><a href="#Members-data" data-toc-modified-id="Members-data-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Members data</a></span></li><li><span><a href="#Transactions-Data" data-toc-modified-id="Transactions-Data-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Transactions Data</a></span></li><li><span><a href="#User-Logs-Data" data-toc-modified-id="User-Logs-Data-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>User Logs Data</a></span></li></ul></li></ul></li></ul></div>

# Imports and loading

## Library imports

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import warnings
from IPython.core.interactiveshell import InteractiveShell
import numpy as np
import dask.dataframe as dd
import os
alt.renderers.enable('notebook')
InteractiveShell.ast_node_interactivity = "all"
sns.set_theme(style="darkgrid")
warnings.filterwarnings('ignore')

## Data Loading

In [2]:
uid = 'msno'

In [3]:
data_dir = './data'

# Training data for january, contains two columns : user id and binary churn target variable
train = pd.read_csv(os.path.join(data_dir, 'train.csv'))

train.head()

Unnamed: 0,msno,is_churn
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,1
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,1
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,1
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,1


The previous dataset is the train set, containing the user ids and whether they have churned.

- **msno**: user id
- **is_churn**: This is the target variable. Churn is defined as whether or not a user has subscribed within 30 days of the expiration date. is_churn = 1 means that the client churned, is_churn = 0 means renewal. 

In [4]:
# A quick summary of the train dataset using .info() and .describe()
train.info()
train.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 992931 entries, 0 to 992930
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   msno      992931 non-null  object
 1   is_churn  992931 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 15.2+ MB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
is_churn,992931.0,0.063923,0.244616,0.0,0.0,0.0,0.0,1.0


In [5]:
# Members info : city, age, registration method and date of first subscription
members = pd.read_csv(os.path.join(data_dir, 'members_v3.csv'),
                      dtype={'gender': str, "registered_via": str, 'city': str, 'registration_init_time': str})
members = members[members[uid].isin(train[uid])]

members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,1,0,,7,20110914
5,yLkV2gbZ4GLFwqTOXLVHz0VGrMYcgBGgKZ3kj9RiYu8=,4,30,male,9,20110916
7,WH5Jq4mgtfUFXh2yz+HrcTXKS4Oess4k4W3qKolAeb0=,5,34,male,9,20110916
9,I0yFvqMoNkM8ZNHb617e1RBzIS/YRKemHO7Wj13EtA0=,13,63,male,9,20110918
10,OoDwiKZM+ZGr9P3fRivavgOtglTEaNfWJO4KaJcTTts=,1,0,,7,20110918


The previous dataset has the members info, it contains the following:

- **msno**: user id 
- **city**
- **bd**: age
- **gender**
- **registered_via**: registration method
- **registration_init_time**: format %Y%m%d

In [6]:
# A quick summary of the members data using .info() and .describe()
members.info()
members.describe().T

<class 'pandas.core.frame.DataFrame'>
Int64Index: 877161 entries, 1 to 6769469
Data columns (total 6 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   msno                    877161 non-null  object
 1   city                    877161 non-null  object
 2   bd                      877161 non-null  int64 
 3   gender                  391692 non-null  object
 4   registered_via          877161 non-null  object
 5   registration_init_time  877161 non-null  object
dtypes: int64(1), object(5)
memory usage: 46.8+ MB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bd,877161.0,13.453973,20.226865,-3152.0,0.0,0.0,27.0,2016.0


In [7]:
# User logs : number and percentage of listened tracks, total seconds listened aggregated daily
user_logs = dd.read_csv(os.path.join(data_dir, 'user_logs_final.csv'))
user_logs = user_logs[user_logs[uid].isin(train[uid])]

user_logs.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
2,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150105,3,3,0,0,68,36,17364.956
3,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150306,1,0,1,1,97,27,24667.317
4,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150501,3,0,0,0,38,38,9649.029
5,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150702,4,0,1,1,33,10,10021.52
6,yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8=,20150830,3,1,0,0,4,7,1119.555


The previous dataset has the daily user logs, it contains the listening behavior of each user.

- **msno**: user id
- **date**: format %Y%m%d
- **num_25**: number of songs played less than 25% of the song length
- **num_50**: number of songs played between 25% to 50% of the song length
- **num_75**: number of songs played between 50% to 75% of of the song length
- **num_985**: number of songs played between 75% to 98.5% of the song length
- **num_100**: number of songs played over 98.5% of the song length
- **num_unq**: number of unique songs played
- **total_secs**: total seconds played

In [None]:
summary_user_logs = user_logs.describe().compute()

In [8]:
# Transaction data : subscription and cancel data, amount paid and length of membership in days
transactions = pd.concat([pd.read_csv(os.path.join(data_dir, 'transactions.csv')), pd.read_csv(os.path.join(data_dir, 'transactions_v2.csv'))])

# Since the training data is for the month of January, we only select data up to the 28th February for exploration
transactions = transactions[(transactions.transaction_date <= 20170228) & (transactions[uid].isin(train[uid]))]
transactions.sort_values('transaction_date', inplace=True)
transactions.reset_index(drop=True, inplace=True)

transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,ilnvctpFS1yQreLvFtfwzcy61RCqp1lqMjE+Fa3Makk=,41,30,149,149,1,20150101,20150201,0
1,mcQZRmsTjR/tzXhmCjRzTF8lOrP9X0WtfkK9FMs+EHU=,41,30,149,119,1,20150101,20150209,0
2,Vz6kFWULqt09djZYvvNTLPnulpBlulo2vXJ73sZGZmM=,40,31,149,149,1,20150101,20150131,0
3,KfY3ZJnGibtcGyENN6rIoezc4iX5iYJDLCwhkDL2IZw=,40,31,149,149,1,20150101,20150201,0
4,WTPUh0+3NKZc6Ey8S8bYHfYIdWpRdLGNKVq2ae98nP0=,40,31,149,149,1,20150101,20150202,0


The previous dataset contains the transactions history (both subscription and cancel transactions)

- **msno**: user id
- **payment_method_id**: payment method
- **payment_plan_days**: length of membership plan in days
- **plan_list_price**: in New Taiwan Dollar (NTD)
- **actual_amount_paid**: in New Taiwan Dollar (NTD)
- **is_auto_renew**
- **transaction_date**: format %Y%m%d
- **membership_expire_date**: format %Y%m%d
- **is_cancel**: whether or not the user canceled the membership in this transaction.

In [9]:
# A quick summary of the transactions data using .info() and .describe()
transactions.info()
transactions[['plan_list_price', 'actual_amount_paid']].describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16072542 entries, 0 to 16072541
Data columns (total 9 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   msno                    object
 1   payment_method_id       int64 
 2   payment_plan_days       int64 
 3   plan_list_price         int64 
 4   actual_amount_paid      int64 
 5   is_auto_renew           int64 
 6   transaction_date        int64 
 7   membership_expire_date  int64 
 8   is_cancel               int64 
dtypes: int64(8), object(1)
memory usage: 1.1+ GB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
plan_list_price,16072542.0,132.478276,79.543314,0.0,99.0,149.0,149.0,2000.0
actual_amount_paid,16072542.0,136.998733,76.347247,0.0,99.0,149.0,149.0,2000.0


# Data Cleaning

## Null Values Handling

### Members data

In [10]:
members.isnull().agg(['mean', 'sum']).T.add_prefix('null_')

Unnamed: 0,null_mean,null_sum
msno,0.0,0.0
city,0.0,0.0
bd,0.0,0.0
gender,0.553455,485469.0
registered_via,0.0,0.0
registration_init_time,0.0,0.0


We can see that **more than half** of the '**gender**' column contains null values. Since it might be a good feature of our model, it might be better to fill the values.

Since it is a <ins>categorical</ins> variable, we'll use the **mode** as a *filling value*.

In [11]:
members['gender'].fillna(members['gender'].mode()[0], inplace=True)

### Transactions Data

In [12]:
transactions.isnull().agg(['mean', 'sum']).T.add_prefix('null_')

Unnamed: 0,null_mean,null_sum
msno,0.0,0.0
payment_method_id,0.0,0.0
payment_plan_days,0.0,0.0
plan_list_price,0.0,0.0
actual_amount_paid,0.0,0.0
is_auto_renew,0.0,0.0
transaction_date,0.0,0.0
membership_expire_date,0.0,0.0
is_cancel,0.0,0.0


There are no null values in the transactions dataset.

### User Logs Data

Since the user logs dataset is quite large, I will be loading it in chunks and checking for each chunk the number of null values.

In [13]:
user_logs_iterator = pd.read_csv(os.path.join(data_dir, 'user_logs_final.csv'), chunksize=200000)

In [14]:
first_chunk = next(user_logs_iterator)

In [17]:
first_chunk.isnull().sum()

msno          0
date          0
num_25        0
num_50        0
num_75        0
num_985       0
num_100       0
num_unq       0
total_secs    0
dtype: int64

In [16]:
for chunk in user_logs_iterator:
    print(chunk.isnull().mean())
    break

RangeIndex(start=200000, stop=400000, step=1)
