# 3. Machine learning for classification: Churn prediction

Start of video: [ML Zoomcamp 3.1 - Churn Prediction Project](https://www.youtube.com/watch?v=0Zw04wdeTQo&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=29&t=34s)

## Problem description:

This week we will talk about machine learning for **classification**. The project we will work on this week is **churn prediction**. Let's imagine we work at a telecom company. We have some customers/clients who uses our services such as phone, internet, tv etc. Not all customers are happy with our services and some of them are thinking of stopping the contract and leave for other companies. We want to identify these clients who wants to leave our compnay or, churn and assign some score 20%, 30%, 85% etc. that tells how likely a customer is going to leave. Closer this number is to 1, higer the likelihood of the person churning. Then we will classify them into 0 and 1s. 0 means custoemr will stay with us and 1 means he is lilely to leave. A customer who is likely to churn, we will send some promotional emails such as 25% discount etc. so that they decide to stay with us. The way we approach this problem is **binary classification**. 

There are many different supervised machine learning algorithms/problems: regression, classification - binary or multiclass classification. Here it is a binary classification problem. Target variables will take values of 0 and 1s.

## 3.1 Churn prediction project

* Link to Dataset: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

* https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv

End of video: [ML Zoomcamp 3.1 - Churn Prediction Project](https://www.youtube.com/watch?v=0Zw04wdeTQo&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=29&t=34s)

Start of video: [ML Zoomcamp 3.2 - Data Preparation](https://www.youtube.com/watch?v=VSGGU9gYvdg&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=30)

## 3.2 Data Preparation

* Download the data, read it with pandas
* Look at the data
* Make column names and values look uniform
* Check if all the columns read correctly
* Check if the churn variable needs any preparation

In [124]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [137]:
#data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

In [138]:
#!wget $data -O data-week-3.csv 

In [139]:
ls

data-week-3.csv  WK03-classification-notebook.ipynb


In [140]:
# ! before wget means that we are executing a shell command.

In [141]:
df = pd.read_csv('data-week-3.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [142]:
df.head().T #to see the columns well

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [143]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')  #we used the same previous week. If not clear, check it out.

In [144]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [145]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [146]:
df.totalcharges

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [147]:
#we need to convert total charges into a number
tc = pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

In [148]:
tc = pd.to_numeric(df.totalcharges, errors = 'coerce')

In [149]:
df.totalcharges = pd.to_numeric(df.totalcharges, errors = 'coerce')

In [150]:
tc.isnull().sum()

11

In [151]:
df[tc.isnull()][['customerid', 'totalcharges']] 

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,
753,3115-czmzd,
936,5709-lvoeq,
1082,4367-nuyao,
1340,1371-dwpaz,
3331,7644-omvmy,
3826,3213-vvolg,
4380,2520-sgtta,
5218,2923-arzlg,
6670,4075-wkniu,


In [152]:
df.totalcharges=df.totalcharges.fillna(0)

# zero is not always the best way in terms of common sense, but in practice it is okay

In [153]:
df[tc.isnull()][['customerid', 'totalcharges']] 

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,0.0
753,3115-czmzd,0.0
936,5709-lvoeq,0.0
1082,4367-nuyao,0.0
1340,1371-dwpaz,0.0
3331,7644-omvmy,0.0
3826,3213-vvolg,0.0
4380,2520-sgtta,0.0
5218,2923-arzlg,0.0
6670,4075-wkniu,0.0


In [154]:
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

In [155]:
(df.churn == 'yes').head()

0    False
1    False
2     True
3    False
4     True
Name: churn, dtype: bool

In [156]:
(df.churn == 'yes').astype(int).head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

In [157]:
df.churn = (df.churn == 'yes').astype(int)
df.churn

0       0
1       0
2       1
3       0
4       1
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: churn, Length: 7043, dtype: int64

End of video: [ML Zoomcamp 3.2 - Data Preparation](https://www.youtube.com/watch?v=VSGGU9gYvdg&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=30)

## 3.3 Validation framework

start of video: [ML Zoomcamp 3.3 - Setting Up The Validation Framework](https://www.youtube.com/watch?v=_lwz34sOnSE&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=31)


* Perform the train/validation/test split with Scikit-Learn

In [158]:
from sklearn.model_selection import train_test_split

`train_test_split?` #if we want to see documentation for this function

`train_test_split?` splits dataset into two parts as we see below. 
random_state is the random seed value.

In [159]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [160]:
len(df_full_train), len(df_test)

(5634, 1409)

In [161]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [162]:
len(df_full_train), len(df_train),len(df_val), len(df_test)

(5634, 4225, 1409, 1409)

Now we need to get our y-varibale.

In [164]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [165]:
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

In [166]:
del df_train['churn']
del df_val['churn']
del df_test['churn']

We haven't deleted churn variable in dataframe `df_full_train`. Reason for this is that we'll look at the target variable little bit in the next lesson.

end of video: [ML Zoomcamp 3.3 - Setting Up The Validation Framework](https://www.youtube.com/watch?v=_lwz34sOnSE&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=31)

start of video: [ML Zoomcamp 3.4 - EDA](https://www.youtube.com/watch?v=BNF1wjBwTQA&list=PL3MmuxUbc_hIhxl5Ji8t4O6lPAOpHaCLR&index=32)

## 3.4 EDA

In [167]:
df_full_train = df_full_train.reset_index(drop=True)

In [168]:
df_full_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
2,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
3,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
4,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


In [169]:
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [170]:
df_full_train.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [171]:
df_full_train.churn.value_counts(normalize=True)

0    0.730032
1    0.269968
Name: churn, dtype: float64

26% above is also called as the churn rate, the rate at which users churn. 26.99% is our global churn rate.

In [172]:
df_full_train.churn.mean()

0.26996805111821087

In [177]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate,2)

0.27

Now we'll look at other variables: categorical and numerical.

In [178]:
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

In [185]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']  #list of numerical variables

In [186]:
df_full_train.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [188]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents','phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod'] # list of categorical variables