# 3. Machine learning for classification: Churn prediction

Start of video: [ML Zoomcamp 3.1 - Churn Prediction Project](https://www.youtube.com/watch?v=0Zw04wdeTQo&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=29&t=34s)

## Problem description:

This week we will talk about machine learning for **classification**. The project we will work on this week is **churn prediction**. Let's imagine we work at a telecom company. We have some customers/clients who uses our services such as phone, internet, tv etc. Not all customers are happy with our services and some of them are thinking of stopping the contract and leave for other companies. We want to identify these clients who wants to leave our compnay or, churn and assign some score 20%, 30%, 85% etc. that tells how likely a customer is going to leave. Closer this number is to 1, higer the likelihood of the person churning. Then we will classify them into 0 and 1s. 0 means custoemr will stay with us and 1 means he is lilely to leave. A customer who is likely to churn, we will send some promotional emails such as 25% discount etc. so that they decide to stay with us. The way we approach this problem is **binary classification**. 

There are many different supervised machine learning algorithms/problems: regression, classification - binary or multiclass classification. Here it is a binary classification problem. Target variables will take values of 0 and 1s.

## 3.1 Churn prediction project

* Link to Dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

* https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv

End of video: [ML Zoomcamp 3.1 - Churn Prediction Project](https://www.youtube.com/watch?v=0Zw04wdeTQo&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=29&t=34s)

Start of video: [ML Zoomcamp 3.2 - Data Preparation](https://www.youtube.com/watch?v=VSGGU9gYvdg&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=30)

## 3.2 Data Preparation

* Download the data, read it with pandas
* Look at the data
* Make column names and values look uniform
* Check if all the columns read correctly
* Check if the churn variable needs any preparation

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

In [3]:
#!wget $data -O data-week-3.csv 

In [4]:
ls

data-week-3.csv  WK03-classification-notebook.ipynb


In [5]:
# ! before wget means that we are executing a shell command.

In [6]:
df = pd.read_csv('data-week-3.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [7]:
df.head().T #to see the columns well

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [8]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')  #we used the same previous week. If not clear, check it out.

In [9]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [10]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [11]:
df.totalcharges

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [12]:
#we need to convert total charges into a number
tc = pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

In [None]:
tc = pd.to_numeric(df.totalcharges, errors = 'coerce')

In [None]:
df.totalcharges = pd.to_numeric(df.totalcharges, errors = 'coerce')

In [None]:
tc.isnull().sum()

In [None]:
df[tc.isnull()][['customerid', 'totalcharges']] 

In [None]:
df.totalcharges=df.totalcharges.fillna(0)

# zero is not always the best way in terms of common sense, but in practice it is okay

In [None]:
df[tc.isnull()][['customerid', 'totalcharges']] 

In [None]:
df.churn.head()

In [None]:
(df.churn == 'yes').head()

In [None]:
(df.churn == 'yes').astype(int).head()

In [None]:
df.churn = (df.churn == 'yes').astype(int)
df.churn

End of video: [ML Zoomcamp 3.2 - Data Preparation](https://www.youtube.com/watch?v=VSGGU9gYvdg&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=30)

## 3.3 Validation framework

start of video: [ML Zoomcamp 3.3 - Setting Up The Validation Framework](https://www.youtube.com/watch?v=_lwz34sOnSE&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=31)


* Perform the train/validation/test split with Scikit-Learn

In [None]:
from sklearn.model_selection import train_test_split

`train_test_split?` #if we want to see documentation for this function

`train_test_split?` splits dataset into two parts as we see below. 
random_state is the random seed value.

In [None]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [None]:
len(df_full_train), len(df_test)

In [None]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [None]:
len(df_full_train), len(df_train),len(df_val), len(df_test)

Now we need to get our y-varibale.

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

In [None]:
del df_train['churn']
del df_val['churn']
del df_test['churn']

We haven't deleted churn variable in dataframe `df_full_train`. Reason for this is that we'll look at the target variable little bit in the next lesson.

end of video: [ML Zoomcamp 3.3 - Setting Up The Validation Framework](https://www.youtube.com/watch?v=_lwz34sOnSE&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=31)

start of video: [ML Zoomcamp 3.4 - EDA](https://www.youtube.com/watch?v=BNF1wjBwTQA&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=32)

## 3.4 EDA

In [None]:
df_full_train = df_full_train.reset_index(drop=True)

In [None]:
df_full_train.head()

In [None]:
df_full_train.isnull().sum()

In [None]:
df_full_train.churn.value_counts()

In [None]:
df_full_train.churn.value_counts(normalize=True)

26% above is also called as the churn rate, the rate at which users churn. 26.99% is our global churn rate.

In [None]:
df_full_train.churn.mean()

In [None]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate,2)

Now we'll look at other variables: categorical and numerical.

In [None]:
df_full_train.dtypes

In [None]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']  #list of numerical variables

In [None]:
df_full_train.columns

In [None]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents','phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod'] # list of categorical variables

In [None]:
df_full_train[categorical].nunique()

end of video: [ML Zoomcamp 3.4 - EDA](https://www.youtube.com/watch?v=BNF1wjBwTQA&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=32)

start of video: [ML Zoomcamp 3.5 - Feature Importance: Churn Rate And Risk Ratio](https://www.youtube.com/watch?v=fzdzPLlvs40&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=33)

## 3.5 Feature Importance: Churn rate and risk ratio

Feature importance analysis (part of EDA) - identifying which features affect our target variable

* Churn rate
* Risk ratio
* Mutual information - later

In [None]:
df_full_train.head()

Now lets look at churn rate of among different groups, instead of looking at glaobal churn rate.

In [None]:
df_full_train[df_full_train.gender=='female'].head()

In [None]:
churn_female = df_full_train[df_full_train.gender=='female'].churn.mean()
churn_female

In [None]:
churn_male = df_full_train[df_full_train.gender=='male'].churn.mean()
churn_male

In [None]:
global_churn = df_full_train.churn.mean()
global_churn

We can do this for different categories. In stead of looking at genders, we cab look at partners.

In [None]:
df_full_train.partner.value_counts()

We can do the same as above.

In [None]:
churn_partner = df_full_train[df_full_train.partner=='yes'].churn.mean()
churn_partner

This is below global rate.

In [None]:
global_churn-churn_partner

In [None]:
churn_no_partner = df_full_train[df_full_train.partner=='no'].churn.mean()
churn_no_partner

Churn rate for customers without partner is about 5% more than global churnrate.

In [None]:
global_churn-churn_no_partner

Thus is says that partner variable is more important than gender variable. It seems gender really doesn't matter. 

#### Risk ratio

Instead of subtracting (Difference) we can also divide one by another. If greater than 1, they are more likely to churn and if less than 1, they are less likely to churn. Both gives us the same information but in a different way. 

In [None]:
churn_no_partner/global_churn

In [None]:
churn_partner/global_churn

Thus people without a partner is more likely to churn.<br>
By this **Difference** or **Risk Ratio** analysis we can determine which variables are important. <br>
We can do this for every variable like that.
This can be implemented directly in SQL like below.

```
SELECT
    gender,
    AVG(churn),
    AVG(churn) - global_churn AS diff,
    AVG(churn) / global_churn AS risk
FROM
    data
GROUP BY
    gender;
```

Now we'll translate this SQL query into Pandas.

In [None]:
df_full_train.groupby('gender').churn.mean()

In [None]:
df_full_train.groupby('gender').churn.agg(['mean', 'count'])

In [None]:
df_group = df_full_train.groupby('gender').churn.agg(['mean', 'count'])
df_group['diff'] = df_group['mean'] - global_churn
df_group['risk'] = df_group['mean'] / global_churn
df_group

We can repeat this for all the variables we have. 

In [None]:
from Ipython.display import display

In [None]:
for c in categorical:
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['risk'] = df_group['mean'] / global_churn
    df_group    

In [None]:
df_group