# KKBox Customer Churn - Supervised Learning Capstone - Data Loading Methods 

Kaggle Competition: https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data

**Scope:** I am trying to see if a user who is active in February 2017, will still be a user in March 2017

## Import and Preview Data

#### - <font color=blue>Import Libraries</font> -

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import missingno as msno
import sqlite3 as sql

#### - <font color=blue>Summary of Datasets</font> -
For this project we are given several massive datasets totaling over 30 GB. In general the datasets are divided into two versions: ***v1*** and ***v2***.
_____
**train_v1:** containing the user ids and whether they churned until ***2/28/2017***.

**train_v2:** containing the user ids and whether they churned for the month of ***March 2017***.

Features:
    - msno: user id
    - is_churn: This is the target variable. Churn is defined as whether the user did not continue the subscription within 30 days of expiration. is_churn = 1 means churn,is_churn = 0 means renewal.

_____
**transactions_v1:** transactions of users up until ***2/28/2017***.

**transactions_v2:** transactions of users up until ***3/31/2017***.

Features:
    - msno: user id (***Repeated as a user can have various Transactions***)
    - payment_method_id: payment method
    - payment_plan_days: length of membership plan in days
    - plan_list_price: in New Taiwan Dollar (NTD)
    - actual_amount_paid: in New Taiwan Dollar (NTD)
    - is_auto_renew
    - transaction_date: format %Y%m%d
    - membership_expire_date: format %Y%m%d
    - is_cancel: whether or not the user canceled the membership in this transaction.

_____
**user_log_v1:** transactions of users up until ***2/28/2017***.

**user_log_v2:** transactions of users for the month of ***March 2017***.

Features:
    - msno: user id (***Repeated as a user can have various Logins***)
    - date: format %Y%m%d
    - num_25: # of songs played less than 25% of the song length
    - num_50: # of songs played between 25% to 50% of the song length
    - num_75: # of songs played between 50% to 75% of of the song length
    - num_985: # of songs played between 75% to 98.5% of the song length
    - num_100: # of songs played over 98.5% of the song length
    - num_unq: # of unique songs played
    - total_secs: total seconds played

_____
**members_v3:** All user information data.

Features:
    - msno: user id
    - city
    - bd: age. Note: this column has outlier values ranging from -7000 to 2015, please use your judgement.
    - gender
    - registered_via: registration method
    - registration_init_time: format %Y%m%d

_____

#### - <font color=blue>Dataset Statistics</font> -
**# of Observations**: > 300,000,000

**Dataset Sizes**

- ***train_v1 Dataset:*** 45.56 MB 
- ***train_v2 Dataset:*** 44.56 MB   
- ***transactions_v1 Dataset:*** 1.68 GB     
- ***transactions_v2 Dataset:*** 112.69 MB     
- ***user_log_v1 Dataset:*** 29.78 GB     
- ***user_log_v2 Dataset:*** 1.40 GB 
- ***members_v3 Dataset:*** 417.89 MB
- ***<font color=red>Total:  33.48 GB</font>***


For the most part, ***January 2017 - Train Set*** , ***February 2017 - Validation Set*** data will be coming from the ***v1*** files and ***March 2017 - Test Set*** data will be coming from the ***v2*** files. Although the initial sets will be somewhat limited in features, once we have them imported we will make various queries to create new features.

### - <font color=blue>Import Data into Database</font> -
As described in the 'KKBox Data Loading Methods' Notebook, after working out a succefully lengthy solution completely in Python, I decided to utilize SQLite3 for a more 'Persistent' solution. Postgre was all so considered but this was more practical for the task.

Since the data in each dataset are in different timeframes, the initial Train, Validation, and Test Sets will contain general information for each member. For example:
- The Transaction datset has recorded every single transaction made by a user.
- The User Log dataset has recorded every single time a user has logged onto the platform.

Since these datasets capture different types of user behaviors over different timeframes we can't just join them all together immediately. However since they do capture behavior over time, I believe that there would be a ton of value if we are able to get creative on how we capture this ***retrospective data***. As we go through EDA and Feature Creation we will create these new features through additional queries and python commands.

In [3]:
# Create Connection to SQLite
cnx = sql.connect("C:\J-5 Local SSD\Projects\KKBox Customer Churn\Database\KKBox_DB.db")  # Opens file if exists, else creates file
cur = cnx.cursor()  # This object lets us actually send messages to our DB and receive results
print("Opened database successfully")

Opened database successfully


In [None]:
# # Set file path for all Data
# path = 'C:/J-5 Local SSD/Projects/KKBox Customer Churn/Datasets/'

# # Create list of all dataset names
# data_list = ['train_v1', 'train_v2', "transactions_v1", 'transactions_v2', 'user_logs_v1', 'user_logs_v2', 'members_v3']

# for dset in data_list:
#     for chunk in pd.read_csv(path+dset+'.csv', chunksize=1000000):
#         chunk.to_sql(name=dset, con=cnx, if_exists="append", index=False)  #"name" is name of table 

Now that our datasets are loaded, we will create our Train, Validation, and Test sets respective to their indiviudal months and write them as ***train_jan, valid_feb, test_march.***
    
By creating these sets in our SQLite Database we avoid having to do this in-memory here in this JupyterNB. Also, this is mainly done to simulate working with a massive dataset stored in a database as is expected to happen in practice. Not toy datasets here :) 

Let's get started!

### - <font color=blue>Build out our Datasets</font> -

**How we will build them:**
    
    1) Get all unique users from transactions v1 and v2 with expirations dates falling within each timeframe.
       - Now that we have the basis for each set we will merge continuous merge on msno for each set.

    2) Join members_v3 and all sets on msno

    3.a) Join train_v1 and train_jan on msno

    3.b) Join train_v1 and val_feb on msno 

    4) Join train_v2 and test_mar on msno

This will be all done using SQL commands from this notebook.

**Build all the basis for all three Main Datasets**

In [56]:
# Build all three sets with respect to their respective months and all the users who have Memebership Expirations falling in those months
sets = {'train_jan': ['transactions_v1','20170101','20170101'], 'valid_feb': ['transactions_v1','20170201','20170228'], 'test_mar': ['transactions_v2','20170301','20170331']}

for setname, info in sets.items():  
    cur.execute(f'''CREATE TABLE IF NOT EXISTS {setname} AS
                    SELECT *
                    FROM {info[0]}
                    WHERE membership_expire_date >= {info[1]} AND membership_expire_date <= {info[2]}
    ''')
    
cnx.commit()

In [4]:
# Create list of all table names in DB
alltable_names = [name[0] for name in cur.execute("SELECT name FROM sqlite_master WHERE type='table';")]

# Create a list of all dataset names
datasets = alltable_names[-3:]

**Index msno columns on all tables**

In [19]:
# Index msno columns in all tables, to help with performance moving forward
for table in alltable_names:
    cur.execute(f"CREATE INDEX IF NOT EXISTS msno_idx ON {table}(msno);")
    cnx.commit
    print(f'{table} index created!')


train_v1 index created!
train_v2 index created!
transactions_v1 index created!
transactions_v2 index created!
user_logs_v1 index created!
user_logs_v2 index created!
members_v3 index created!
train_jan index created!
valid_feb index created!
test_mar index created!


**Join members_v3 data onto Main Datasets**

In [36]:
# Make list of member_v3 columns
member_columns = [column[0] for column in cur.execute('select * from members_v3 limit 10').description]
member_columns.remove('msno')

# Create dictionary of member_columns and their datatypes
datatypes = ['INTEGER','INTEGER','TEXT','INTEGER','INTEGER']
member_dict = dict(zip(member_columns, datatypes))

In [34]:
# Create new columns in each Main Dataset to add members_v3 data
for dataset in datasets:
    for column, coltype in member_dict.items():
        cur.execute(f'ALTER TABLE {dataset} ADD COLUMN {column} {coltype};')
        cnx.commit()
        
    print(f'{dataset} Columns have been added')

train_jan Columns have been added
valid_feb Columns have been added
test_mar Columns have been added


In [None]:
# Join members_v3 and all sets on msno
for dataset in datasets:
    for column, coltype in member_dict.items():
        cur.execute(f'UPDATE {dataset} SET {column} = (SELECT {column} FROM members_v3 WHERE msno = {dataset}.msno)')
        cnx.commit()
    
    print(f'{dataset} Has joined successfully')

**Join train_v1 and train_v2 onto all Main Datasets**

In [None]:
# Make list of train_v1/train_v2 columns
train_columns = [column[0] for column in cur.execute('select * from train_v1 limit 1').description]
train_columns.remove('msno')

# Create dictionary of member_columns and their datatypes
datatypes = ['INTEGER']
train_dict = dict(zip(train_columns, datatypes))

In [None]:
# Create new columns in each Main Dataset to add train_v1/train_v2 data
for dataset in datasets:
    for column, coltype in train_dict.items():
        cur.execute(f'ALTER TABLE {dataset} ADD COLUMN {column} {coltype};')
        cnx.commit()
        
    print(f'{dataset} Columns have been added')

In [None]:
# Join train_v1 and train_jan,valid_feb on msno
for dataset in datasets[:2]:
    for column, coltype in train_dict.items():
        cur.execute(f'UPDATE {dataset} SET {column} = (SELECT {column} FROM train_v1 WHERE msno = {dataset}.msno)')
        cnx.commit()
    
    print(f'{dataset} Has joined successfully')

# Join train_v2 and test_mmar on msno
for dataset in datasets[2]:
    for column, coltype in train_dict.items():
        cur.execute(f'UPDATE {dataset} SET {column} = (SELECT {column} FROM train_v2 WHERE msno = {dataset}.msno)')
        cnx.commit()
    
    print(f'{dataset} Has joined successfully')

### - <font color=blue>Pull Training Data from Database</font> -
Our Train

In [8]:
# Create Connection to SQLite
df = pd.read_sql_query("SELECT * FROM train_jan", cnx)

### Clean Data

#### - <font color=blue>Detect Missing Values in Dataset</font> -

In [79]:
# Table of Features and their respective totals percentages of missing data in both Test and Train sets
total_missing_train = train_jan.isnull().sum().sort_values(ascending=False)
percent_missing_train = (train_jan.isnull().sum()/train_jan.isnull().count())

total_missing_valid = valid_feb.isnull().sum().sort_values(ascending=False)
percent_missing_valid = (valid_feb.isnull().sum()/valid_feb.isnull().count())

total_missing_test = test_mar.isnull().sum().sort_values(ascending=False)
percent_missing_test = (test_mar.isnull().sum()/test_mar.isnull().count())

columns = [total_missing_train, percent_missing_train, total_missing_valid, percent_missing_valid, total_missing_test, percent_missing_test]

missing_data = pd.concat(columns, axis=1, keys=['Total Missing Train', 'Percent Missing Train', 'Total Missing Validation', 'Percent Missing Validation', 'Total Missing Test', 'Percent Missing Test'], sort=False).sort_values(by='Percent Train',ascending = False)
missing_data

Unnamed: 0,Total Train,Percent Train,Total Test,Percent Test
gender,601525,0.605428,582129,0.599324
num_25_sum,229963,0.231455,216427,0.22282
num_50_sum,229963,0.231455,216427,0.22282
date_count,229963,0.231455,216427,0.22282
total_secs_sum,229963,0.231455,216427,0.22282
num_unq_sum,229963,0.231455,216427,0.22282
num_100_sum,229963,0.231455,216427,0.22282
num_985_sum,229963,0.231455,216427,0.22282
num_75_sum,229963,0.231455,216427,0.22282
bd,115800,0.116551,110000,0.113249
