<h1> Capstone 2 Data Wrangling </h1><a id='Capstone_2_Data_Wrangling'></a>

## Table of Contents<a id='Table_of_Contents'></a>
* [1 Data Wrangling](#Capstone_2_Data_Wrangling)
    * [1.1. The Data Science Problem](#The_Data_Science_Problem)
        *  [1.1.1 Context](#Context)
        *  [1.1.2 Data Source Citation](#Data_Source_Citation)
 
    * [1.2 Imports](#Data_Imports)
        * [1.2.1 Import Libraries](#Import_Libraries)
        * [1.2.2 Import Data](#Import_Data)
            * [1.2.2.1 Read in train_v2.csv](#Read_in_train_v2.csv)
            * [1.2.2.2 Read in members_v3.csv](#Read_in_members_v3.csv)
            * [1.2.2.3 Read in transactions_v2.csv](#Read_in_transactions_v2.csv)
            * [1.2.2.4 Read in user_logs_v2.csv](#Read_in_user_logs_v2.csv)
            
    
* [1.3 Merging Dataframes](#Merging_Dataframes)
    * [1.3.1 Merge train_df and transactions_df: train_trans_df](#train_trans_df)
    * [1.3.2 Merge train_trans_df and members_df: tt_mem_df](#tt_mem_df)
    * [1.3.3 Merge tt_mem_df and user_logs_df: full_df](#full_df)
    
    
* [1.4 Data Exploration](#Data_Exploration)
    * [1.4.1](#)
    * [1.4.2](#)
    


## 1.1 The Data Science Problem<a id='The_Data_Science_Problem'></a>

### 1.1.1 Context<a id='Context'></a>

KKBox is a music streaming service popular in South East Asia with over 10 million users. It functions on a subscription-based business model, with the majority of subscriptions lasting 30 days. An account is marked as churn if there are no new transactions within 30 days after a subscription has expired. KKBox would like to be able to predict which subscribers are likely to renew within a month of their membership ending and which ones will churn.

By using customers’ demographic information, listening history, and transaction history, we will train a classification algorithm to predict if a particular user will renew their subscription or churn.

### 1.1.2 Data Source Citation<a id='Data_Source_Citation'></a>

KKBOX Group. (2017, September). WSDM - KKBox's Churn Prediction Challenge, Version 2. Retrieved March 3, 2021 from https://www.kaggle.com/c/kkbox-churn-prediction-challenge/overview/evaluation.

## 1.2 Imports<a id='Data_Imports'></a>

### 1.2.1 Import Libraries<a id='Import_Libraries'></a>

In [1]:
# import necessary libraries and packages
# we will be using dask to read data into dataframes as we are dealing with large files 
import dask.dataframe as dd
import numpy as np

### 1.2.2 Import Data<a id='Import_Data'></a>

#### 1.2.2.1 Read in train_v2.csv<a id='Read_in_train_v2.csv'></a>

In [2]:
# read csv file with target data: train_df
train_df = dd.read_csv("../data/raw/train_v2.csv")

In [3]:
train_df.head()

Unnamed: 0,msno,is_churn
0,ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=,1
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1
3,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1
4,K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA=,1


This dataset has our target value, is_churn, and a unique identifier for each customer, msno.

In [4]:
# Let's explore our target column, is_churn:
print("Number of missing values: ", sum(train_df.is_churn.isna()))
train_df.is_churn.value_counts().compute()

Number of missing values:  0


0    883630
1     87330
Name: is_churn, dtype: int64

We want to end up with a dataframe with each row representing a unique customer. The following function, all_unique_values, will check if a given column in a dataframe has all unqiue values. 

In [5]:
def all_unique_values(df, column):
    """
    Parameters:
        df:     a dask dataframe
        column: column to check if all the values are unique
        
    Returns True if all the values in the given column are unique, False otherwise
    """
    return len(df) == df[column].nunique().compute()

In [6]:
all_unique_values(train_df, "msno")

True

#### 1.2.2.2 Read in members_v3.csv<a id='Read_in_members_v3.csv'></a>

This dataset has a unique identifier for each member, msno, as well as some demographic information. 

msno: unique identifier

city: user's city

bd: user's age

gender: user's gender

registered_via: registration method

registration_init_time: date the user registered, format %Y%m%d

In [7]:
# read in the members dataframe: members_df
members_df = dd.read_csv("../data/raw/members_v3.csv")

In [8]:
members_df.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,1,0,,11,20110911
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,1,0,,7,20110914
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,1,0,,11,20110915
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,1,0,,11,20110915
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,6,32,female,9,20110915


In [9]:
members_df.describe()

Unnamed: 0_level_0,city,bd,registered_via,registration_init_time
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float64,float64,float64,float64
,...,...,...,...


In [10]:
# Rename the "bd" column as "age" for clarity
members_df = members_df.rename(columns={'bd':'age'})
members_df.head()

Unnamed: 0,msno,city,age,gender,registered_via,registration_init_time
0,Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8=,1,0,,11,20110911
1,+tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU=,1,0,,7,20110914
2,cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw=,1,0,,11,20110915
3,9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g=,1,0,,11,20110915
4,WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk=,6,32,female,9,20110915


In [11]:
all_unique_values(members_df, "msno")

True

#### 1.2.2.3 Read in transactions_v2.csv<a id='Read_in_transactions_v2.csv'></a>

This dataset is a record of each customer's transactions. 

msno: user id

payment_method_id: payment method

payment_plan_days: length of membership plan in days

plan_list_price: in New Taiwan Dollar (NTD)

actual_amount_paid: in New Taiwan Dollar (NTD)

is_auto_renew: whether or not the user signed up to have their membership renew automatically

transaction_date: format %Y%m%d

membership_expire_date: format %Y%m%d

is_cancel: whether or not the user canceled the membership in this transaction

In [12]:
# read in the transactions csv file: transactions_df
transactions_df = dd.read_csv("../data/raw/transactions_v2.csv")

In [13]:
transactions_df.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134=,32,90,298,298,0,20170131,20170504,0
1,++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA=,41,30,149,149,1,20150809,20190412,0
2,+/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc=,36,30,180,180,1,20170303,20170422,0
3,+/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk=,36,30,180,180,1,20170329,20170331,1
4,+00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c=,41,30,99,99,1,20170323,20170423,0


In [14]:
# check the data types of each column:
transactions_df.describe()

Unnamed: 0_level_0,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...


In [15]:
all_unique_values(transactions_df, "msno")

False

#### 1.2.2.4 Read in user_logs_v2.csv<a id='Read_in_user_logs_v2.csv'></a>

This dataset is a log of a user's activity.

msno: user id

date: format %Y%m%d

num_25: number of songs played less than 25% of the song length

num_50: number of songs played between 25% to 50% of the song length

num_75: number of songs played between 50% to 75% of of the song length

num_985: number of songs played between 75% to 98.5% of the song length

num_100: number of songs played over 98.5% of the song length

num_unq: number of unique songs played

total_secs: total seconds played

In [16]:
# read in the user_logs csv file: user_logs_df
user_logs_df = dd.read_csv("../data/raw/user_logs_v2.csv")

In [17]:
user_logs_df.head()

Unnamed: 0,msno,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
0,u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg=,20170331,8,4,0,1,21,18,6309.273
1,nTeWW/eOZA/UHKdD5L7DEqKKFTjaAj3ALLPoAWsU8n0=,20170330,2,2,1,0,9,11,2390.699
2,2UqkWXwZbIjs03dHLU9KHJNNEvEkZVzm69f3jCS+uLI=,20170331,52,3,5,3,84,110,23203.337
3,ycwLc+m2O0a85jSLALtr941AaZt9ai8Qwlg9n0Nql5U=,20170331,176,4,2,2,19,191,7100.454
4,EGcbTofOSOkMmQyN1NMLxHEXJ1yV3t/JdhGwQ9wXjnI=,20170331,2,1,0,1,112,93,28401.558


In [18]:
# Use describe to check data types
user_logs_df.describe()

Unnamed: 0_level_0,date,num_25,num_50,num_75,num_985,num_100,num_unq,total_secs
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...


In [19]:
all_unique_values(user_logs_df, "msno")

False

## 1.3 Merging Dataframes<a id='Merging_Dataframes'></a>

### 1.3.1 Merge train_df and transactions_df: train_trans_df<a id='train_trans_df'></a>

In [20]:
# Each dataframe has an 'msno' column which we will use to join the dataframes. 
# Let's start by finding the intersection of train_df and transactions_df, train_trans_df:
join_train_trans = dd.merge(train_df, transactions_df, on='msno', how='inner')

In [21]:
join_train_trans.head()

Unnamed: 0,msno,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,1,36,30,180,180,0,20170311,20170411,0
1,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,17,60,0,0,0,20170311,20170314,0
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,1,15,90,300,300,0,20170314,20170615,0
3,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41,30,149,149,1,20150908,20170608,0
4,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,1,41,30,149,149,1,20150908,20170708,0


In [22]:
# let's see if we have unique msno ids:
all_unique_values(join_train_trans, "msno")

False

We must have some customers who have multiple transactions. The most recent transaction is most valuable when it comes to predicting churn, so I will isolate the most recent transaction.

In [23]:
# Group train_trans_df by msno and choose the most recent transaction_date.
# Since the transaction_date is stored as a float64 as YYYYMMDD, I can just choose
# the maximum value of transaction_date to find the most recent transaction
most_recent_trans = join_train_trans.groupby("msno")["transaction_date"].max()
most_recent_trans = most_recent_trans.to_frame().reset_index()
train_trans_df = dd.merge(most_recent_trans, join_train_trans, on=["msno", "transaction_date"], how="inner")

In [24]:
all_unique_values(train_trans_df, "msno")

False

There are still duplicates, so there must be some customers with multiple transactions on the same day. Let's go ahead and drop those duplicates:

In [27]:
train_trans_df = train_trans_df.drop_duplicates(subset=["msno", "transaction_date"], keep="first")
all_unique_values(train_trans_df, "msno")

True

### 1.3.2 Merge train_trans_df and members_df: tt_mem_df<a id='tt_mem_df'></a>

When we read in the members_df dataframe, we saw that every msno in members_df is unique. So this join is more straightforward than the previous one. We are simply adding the biographical data in members_df for each of the entries. 

In [28]:
tt_mem_df = dd.merge(train_trans_df, members_df, on='msno', how='inner')
tt_mem_df.head()

Unnamed: 0,msno,transaction_date,is_churn,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,membership_expire_date,is_cancel,city,age,gender,registered_via,registration_init_time
0,++0nOC7BmrUTtcSboRORfg6ZXTajnBDt1f/SEgH6ONo=,20170306,0,40,30,149,149,1,20170414,0,13,25,male,9,20100203
1,++6xEqu4JANaRY4GjEfEFtLtqOvZvYPyP3uk/PW9Ces=,20170331,0,41,30,99,99,1,20170430,0,1,0,,7,20160501
2,++95tJZADNg8U8HKbYdxbbXIRsO6pw1zBK4tHI7BtZo=,20170331,0,39,30,149,149,1,20170524,0,14,35,female,3,20120603
3,++A8p4GrsTnMjI6hAZEtlRsaz6s6O9ddUoH0fmS4s7s=,20170326,0,30,30,149,149,1,20170426,0,5,43,female,9,20141118
4,++EcAZQCSSJQMx37/+/QqjiVQq3cS/hGug6JlzCufig=,20170331,0,39,30,149,149,1,20170518,0,4,28,male,9,20110205


### 1.3.3 Merge train_trans_df and user_logs_df: full_df<a id='full_df'></a>

Before we merge the datasets, we want to sum up the data in user_logs so that we can get a picture of a customer's habits for the full duration of their membership, not just in a particular day.

In [29]:
# First, we will convert the date column to 1s so we can get the total number of days
all_time_df = user_logs_df.copy()
all_time_df.date = 1

# rename columns to reflect that we are looking at the totals for each features
all_time_df = all_time_df.rename(columns={"date"   :"total_days", 
                                          "num_25" :"total_num_25", 
                                          "num_50" :"total_num_50", 
                                          "num_75" :"total_num_75", 
                                          "num_985":"total_num_985",
                                          "num_100":"total_num_100",
                                          "num_unq":"total_num_unique"})

In [30]:
# Next, we will group by msno and aggregate using sum
all_time_df = all_time_df.groupby('msno').sum()
all_time_df = all_time_df.reset_index()
all_time_df.head()

Unnamed: 0,msno,total_days,total_num_25,total_num_50,total_num_75,total_num_985,total_num_100,total_num_unique,total_secs
0,+++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s=,26,86,11,10,5,472,530,117907.425
1,+++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o=,31,191,90,75,144,589,885,192527.892
2,+++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw=,28,43,12,15,12,485,468,115411.26
3,++0+IdHga8fCSioOVpU8K7y4Asw8AveIApVH2r9q9yY=,25,190,34,21,20,331,582,90177.554
4,++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI=,8,21,8,17,7,104,115,28450.268


In [31]:
# Finally, we can merge tt_mem_df and all_time_df to get full_df:
full_df = dd.merge(tt_mem_df, all_time_df, on='msno', how='inner')

In [32]:
# Let's get the shape of our final dataframe:
print("rows:", f'{len(full_df):,}')
print("columns:", len(full_df.columns))

rows: 725,722
columns: 23


In [33]:
# Confirm we still have all unique mnso ids
all_unique_values(full_df, "msno")

True

In [34]:
# Let's take a look at all our features
full_df.head(1).T

Unnamed: 0,0
msno,++6xEqu4JANaRY4GjEfEFtLtqOvZvYPyP3uk/PW9Ces=
transaction_date,20170331
is_churn,0
payment_method_id,41
payment_plan_days,30
plan_list_price,99
actual_amount_paid,99
is_auto_renew,1
membership_expire_date,20170430
is_cancel,0


## 1.4 Data Exploration<a id='Data_Exploration'></a>