Data from https://www.kaggle.com/blastchar/telco-customer-churn

## Notes

This session covered data obtention and some procedures of data preparation. 

**Commands, functions, and methods:** 

* `!wget` - Linux shell command for downloading data 
* `pd.read.csv()` - read csv files 
* `df.head()` - take a look of the dataframe 
* `df.head().T` - take a look of the transposed dataframe 
* `df.columns` - retrieve column names of a dataframe 
* `df.columns.str.lower()` - lowercase all the letters 
* `df.columns.str.replace(' ', '_')` - replace the space separator 
* `df.dtypes` - retrieve data types of all series 
* `df.index` - retrive indices of a dataframe
* `pd.to_numeric()` - convert a series values to numerical values. The `errors=coerce` argument allows making the transformation despite some encountered errors. 
* `df.fillna()` - replace NAs with some value 
* `(df.x == "yes").astype(int)` - convert x series of yes-no values to numerical values.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
len(df)

## Initial data preparation

## Notes

Splitting the dataset with **Scikit-Learn**. 

**Classes, functions, and methods:** 

* `train_test_split` - Scikit-Learn class for splitting datasets. Linux shell command for downloading data. The `random_state` argument set a random seed for reproducibility purposes.  
* `df.reset_index(drop=True)` - reset the indices of a dataframe and delete the previous ones. 
* `df.x.values` - extract the values from x series
* `del df['x']` - delete x series from a dataframe 

In [None]:
df.head()

In [None]:
df.head().T

In [None]:
df.dtypes

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(0)

In [None]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

In [None]:
df.churn = (df.churn == 'yes').astype(int)

In [None]:
df.head().T

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [None]:
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

In [None]:
y_train = df_train.churn.values
y_val = df_val.churn.values

In [None]:
del df_train['churn']
del df_val['churn']

## Exploratory data analysis

## Notes

The EDA for this project consisted of: 
* Checking missing values 
* Looking at the distribution of the target variable (churn)
* Looking at numerical and categorical variables 

**Functions and methods:** 

* `df.isnull().sum()` - retunrs the number of null values in the dataframe.  
* `df.x.value_counts()` returns the number of values for each category in x series. The `normalize=True` argument retrieves the percentage of each category. In this project, the mean of churn is equal to the churn rate obtained with the value_counts method. 
* `round(x, y)` - round an x number with y decimal places
* `df[x].nunique()` - returns the number of unique values in x series 


In [None]:
df_train_full.isnull().sum()

In [None]:
df_train_full.churn.value_counts()

In [None]:
global_mean = df_train_full.churn.mean()
round(global_mean, 3)

In [None]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
               'phoneservice', 'multiplelines', 'internetservice',
               'onlinesecurity', 'onlinebackup', 'deviceprotection',
               'techsupport', 'streamingtv', 'streamingmovies',
               'contract', 'paperlessbilling', 'paymentmethod']
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [None]:
df_train_full[categorical].nunique()

## Feature importance

## Notes

1. **XXX rate:** Difference between mean of the target variable and mean of categories for a feature. If this difference is greater than 0, it means that the category is less likely to be predicted as 1, and if the difference is lower than 0, the group is more likely to churn. The larger differences are indicators that a variable is more important than others.  For instance, this is a Churn use case so this is a measure of the churn rate. 

2. **Risk ratio:** Ratio between mean of categories for a feature and mean of the target variable. If this ratio is greater than 1, the category is more likely to churn, and if the ratio is lower than 1, the category is less likely to churn. It expresses the feature importance in relative terms. 

**Functions and methods:** 

* `df.groupby('x').y.agg([mean()])` - returns a datframe with mean of y series grouped by x series 
* `display(x)` displays an output in the cell of a jupyter notebook. 

In [None]:
female_mean = df_train_full[df_train_full.gender == 'female'].churn.mean()
print('gender == female:', round(female_mean, 3))

male_mean = df_train_full[df_train_full.gender == 'male'].churn.mean()
print('gender == male:  ', round(male_mean, 3))

In [None]:
female_mean / global_mean

In [None]:
male_mean / global_mean

In [None]:
partner_yes = df_train_full[df_train_full.partner == 'yes'].churn.mean()
print('partner == yes:', round(partner_yes, 3))

partner_no = df_train_full[df_train_full.partner == 'no'].churn.mean()
print('partner == no :', round(partner_no, 3))

In [None]:
partner_yes / global_mean

In [None]:
partner_no / global_mean

In [None]:
df_group = df_train_full.groupby(by='gender').churn.agg(['mean'])
df_group['diff'] = df_group['mean'] - global_mean
df_group['risk'] = df_group['mean'] / global_mean
df_group

In [None]:
from IPython.display import display

In [None]:
global_mean = df_train_full.churn.mean()
global_mean

In [None]:
for col in categorical:
    df_group = df_train_full.groupby(by=col).churn.agg(['mean'])
    df_group['diff'] = df_group['mean'] - global_mean
    df_group['risk'] = df_group['mean'] / global_mean
    display(df_group)

## Notes

Mutual information is a concept from information theory, which measures how much we can learn about one variable if we know the value of another. In this project, we can think of this as how much do we learn about churn if we have the information from a particular feature. So, it is a measure of the importance of a categorical variable. 

**Classes, functions, and methods:** 

* `mutual_info_score(x, y)` - Scikit-Learn class for calculating the mutual information between the x target variable and y feature. 
* `df[x].apply(y)` - apply a y function to the x series of the df dataframe. 
* ` df.sort_values(ascending=False).to_frame(name='x')` - sort values in an ascending order and called the column as x. 

In [None]:
from sklearn.metrics import mutual_info_score

In [None]:
def calculate_mi(series):
    return mutual_info_score(series, df_train_full.churn)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')


display(df_mi.head())
display(df_mi.tail())

## Notes

**Correlation coefficient** measures the degree of dependency between two variables. This value is negative if one variable grows while the other decreases, and it is positive if both variables increase. Depending on its size, the dependency between both variables could be low, moderate, or strong. It allows measuring the importance of numerical variables. 

**Functions and methods:** 

* `df[x].corrwith(y)` - returns the correlation between x and y series. 


In [None]:
df_train_full[numerical].corrwith(df_train_full.churn).to_frame('correlation')

In [None]:
df_train_full.groupby(by='churn')[numerical].mean()

## One-hot encoding

## Notes

One-Hot Encoding allows encoding categorical variables in numerical ones. This method represents each category of a variable as one column, and a 1 is assigned if the value belongs to the category or 0 otherwise. 

**Classes, functions, and methods:** 

* `df[x].to_dict(oriented='records')` - convert x series to dictionaries, oriented by rows. 
* `DictVectorizer().fit_transform(x)` - Scikit-Learn class for converting x dictionaries into a sparse matrix, and in this way doing the one-hot encoding. It does not affect the numerical variables. 
* `DictVectorizer().get_feature_names()` -  returns the names of the columns in the sparse matrix.  

In [None]:
from sklearn.feature_extraction import DictVectorizer

In [None]:
train_dict = df_train[categorical + numerical].to_dict(orient='records')

In [None]:
train_dict[0]

In [None]:
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)

In [None]:
X_train = dv.transform(train_dict)

In [None]:
X_train.shape

In [None]:
dv.get_feature_names()

## Training logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(solver='liblinear', random_state=1)
model.fit(X_train, y_train)

In [None]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [None]:
model.predict_proba(X_val)

In [None]:
y_pred = model.predict_proba(X_val)[:, 1]

In [None]:
y_pred

In [None]:
churn = y_pred > 0.5

In [None]:
(y_val == churn).mean()

## Model interpretation

In [None]:
model.intercept_[0]

In [None]:
dict(zip(dv.get_feature_names(), model.coef_[0].round(3)))

In [None]:
subset = ['contract', 'tenure', 'totalcharges']
train_dict_small = df_train[subset].to_dict(orient='records')
dv_small = DictVectorizer(sparse=False)
dv_small.fit(train_dict_small)

X_small_train = dv_small.transform(train_dict_small)

dv_small.get_feature_names()

In [None]:
model_small = LogisticRegression(solver='liblinear', random_state=1)
model_small.fit(X_small_train, y_train)

In [None]:
model_small.intercept_[0]

In [None]:
dict(zip(dv_small.get_feature_names(), model_small.coef_[0].round(3)))

In [None]:
val_dict_small = df_val[subset].to_dict(orient='records')
X_small_val = dv_small.transform(val_dict_small)

In [None]:
y_pred_small = model_small.predict_proba(X_small_val)[:, 1]

## Using the model

In [None]:
customer = {
    'customerid': '8879-zkjof',
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'no',
    'dependents': 'no',
    'tenure': 41,
    'phoneservice': 'yes',
    'multiplelines': 'no',
    'internetservice': 'dsl',
    'onlinesecurity': 'yes',
    'onlinebackup': 'no',
    'deviceprotection': 'yes',
    'techsupport': 'yes',
    'streamingtv': 'yes',
    'streamingmovies': 'yes',
    'contract': 'one_year',
    'paperlessbilling': 'yes',
    'paymentmethod': 'bank_transfer_(automatic)',
    'monthlycharges': 79.85,
    'totalcharges': 3320.75,
}

In [None]:
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]

In [None]:
print(list(X_test[0]))

In [None]:
customer = {
    'gender': 'female',
    'seniorcitizen': 1,
    'partner': 'no',
    'dependents': 'no',
    'phoneservice': 'yes',
    'multiplelines': 'yes',
    'internetservice': 'fiber_optic',
    'onlinesecurity': 'no',
    'onlinebackup': 'no',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'yes',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 85.7,
    'totalcharges': 85.7
}

In [None]:
X_test = dv.transform([customer])
model.predict_proba(X_test)[0, 1]

## 3.14 Explore more

More things

* Try to exclude least useful features

Use scikit-learn in project of last project

* Re-implement train/val/test split using scikit-learn in the project from the last week
* Also, instead of our own linear regression, use `LinearRegression` (not regularized) and `RidgeRegression` (regularized). Find the best regularization parameter for Ridge
* There are other ways to implement one-hot encoding. E.g. using the `OneHotEncoding` class. Check how to use it [here](notebook-scaling-ohe.ipynb).
* Sometimes numerical features requeire scaling, especially for iterative solves like "lbfgs". Check how to use `StandardScaler` for that [here](notebook-scaling-ohe.ipynb).


Other projects

* Lead scoring - https://www.kaggle.com/ashydv/leads-dataset
* Default prediction - https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
