# Customer Churn Prediction 
## Problem Statement:
The problem of churn is a problem when a number of individuals move out of a collective group. It is one of the main problems that determine the steady-state level of customers in any type of business.
Recently a large number of customers has left Telecom. To address this problem Telecom provides customer data to solve two important tasks:


### Descriptive task: 
Characterize loyal and churn customers and propose a focused customer retention program. (This can be done through visualization, descriptive models etc.)
 

### Predictive task: 
Find a model that identifies churn customers. 

Then:
- Select 300 customers using that model from a separate test set and report the number of true churn customers among them.
- Calculate the expected costs for  Telecom for one month when using your model on the test set if: every customer predicted as churn will get a gift of 10 euro and every true churn customer predicted as loyal will cause a loss of 64 euros (an average month subscription).

In [2]:
import pandas as pd 
import numpy as np 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
import plotly.express as px 

# Modeling
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV #splitting the dataset into test-train
from imblearn.over_sampling import SMOTE #SMOTE technique to deal with unbalanced data problem
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,confusion_matrix,roc_curve,roc_auc_score,classification_report # performance metrics
from sklearn.preprocessing import MinMaxScaler # to scale the numeric features
from scipy import stats

# Feature Selection, XAI, Feature Importance
import shap #!pip install shap
from sklearn.inspection import permutation_importance
import eli5
from eli5.sklearn import PermutationImportance
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from sklearn.feature_selection import SelectFromModel

# Algorithms for supervised learning methods
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Filtering future warnings
import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.
2023-12-09 18:58:59.661595: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Data Exploration

In [29]:
# Import the dataset
data = pd.read_csv('../data/churn-train.csv')

In [30]:
data.head(10)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Male,0,Yes,Yes,61,No,'No phone service',DSL,Yes,No,Yes,No,No,No,Month-to-month,No,'Bank transfer (automatic)',33.6,2117.2,No
1,Male,0,Yes,Yes,72,Yes,Yes,'Fiber optic',No,Yes,Yes,Yes,No,No,'Two year',No,'Bank transfer (automatic)',90.45,6565.85,No
2,Female,0,No,No,5,Yes,Yes,'Fiber optic',No,No,No,No,Yes,No,Month-to-month,Yes,'Electronic check',84.0,424.75,No
3,Female,0,No,No,49,Yes,No,DSL,Yes,Yes,Yes,Yes,No,No,'Two year',No,'Bank transfer (automatic)',67.4,3306.85,No
4,Male,0,No,No,8,Yes,No,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service',Month-to-month,Yes,'Bank transfer (automatic)',19.7,168.9,No
5,Male,0,No,No,3,Yes,No,'Fiber optic',No,No,No,No,No,Yes,Month-to-month,Yes,'Electronic check',80.35,253.8,No
6,Male,0,Yes,Yes,9,Yes,No,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service','Two year',No,'Mailed check',19.6,197.4,No
7,Male,0,Yes,Yes,67,Yes,No,DSL,Yes,Yes,No,No,No,No,'Two year',Yes,'Bank transfer (automatic)',54.2,3838.2,No
8,Male,0,No,No,46,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,Yes,'Credit card (automatic)',45.2,2065.15,No
9,Female,0,Yes,No,67,Yes,Yes,DSL,Yes,Yes,Yes,No,No,Yes,'One year',No,'Mailed check',75.1,5064.45,No


In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4695 entries, 0 to 4694
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            4695 non-null   object 
 1   SeniorCitizen     4695 non-null   int64  
 2   Partner           4695 non-null   object 
 3   Dependents        4695 non-null   object 
 4   tenure            4695 non-null   int64  
 5   PhoneService      4695 non-null   object 
 6   MultipleLines     4695 non-null   object 
 7   InternetService   4695 non-null   object 
 8   OnlineSecurity    4695 non-null   object 
 9   OnlineBackup      4695 non-null   object 
 10  DeviceProtection  4695 non-null   object 
 11  TechSupport       4695 non-null   object 
 12  StreamingTV       4695 non-null   object 
 13  StreamingMovies   4695 non-null   object 
 14  Contract          4695 non-null   object 
 15  PaperlessBilling  4695 non-null   object 
 16  PaymentMethod     4695 non-null   object 


In [32]:
data.nunique()

gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1423
TotalCharges        4429
Churn                  2
dtype: int64

In [33]:
data.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [36]:
for col in data.columns:
    print(f'\n{col} : {data[col].unique()}')
    # print data type of cplumn
    print(f'{col} : {data[col].dtype}')



gender : ['Male' 'Female']
gender : object

SeniorCitizen : ['No' 'Yes']
SeniorCitizen : object

Partner : ['Yes' 'No']
Partner : object

Dependents : ['Yes' 'No']
Dependents : object

tenure : [61 72  5 49  8  3  9 67 46 55 33 62  1 14 18 64 69 71 66  2 11 47 35 32
 60 29 21 48 43 20 31 38 12  6 42 45 28  7 25 40 27 10  4 68 57 26 17 59
 30 50 15 70 53 56 24 39 13 41 44 34 23 52 16 36 65 58 37 63 22 19 51 54
  0]
tenure : int64

PhoneService : ['No' 'Yes']
PhoneService : object

MultipleLines : ["'No phone service'" 'Yes' 'No']
MultipleLines : object

InternetService : ['DSL' "'Fiber optic'" 'No']
InternetService : object

OnlineSecurity : ['Yes' 'No' "'No internet service'"]
OnlineSecurity : object

OnlineBackup : ['No' 'Yes' "'No internet service'"]
OnlineBackup : object

DeviceProtection : ['Yes' 'No' "'No internet service'"]
DeviceProtection : object

TechSupport : ['No' 'Yes' "'No internet service'"]
TechSupport : object

StreamingTV : ['No' 'Yes' "'No internet service'"]
Stream

# Data Preprocessing
* TotalCharges is object type, convert it to numeric type 
TotalCharges column contains 6 values that are '?' `ValueError: Unable to parse string "?" at position 983`
We will convert these values to NaN and then remove them since that are only 6 rows of the dataset.

* Churn is object type, convert it to numeric type (0,1)
* SeniorCitizen is numeric type, convert it to object type (Replace 0,1 with No,Yes)


In [35]:
# Convert TotalCharges to numeric
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Convert Churn values and datatype from categorical to binary. Replacing Yes with 1 and No with 0.
data['Churn'] = data['Churn'].replace({'Yes':1,'No':0})

# Senior Citizen (age range < 67) : 0
data['SeniorCitizen'] = data['SeniorCitizen'].replace({0:'No',1:'Yes'})
data['SeniorCitizen'] = data['SeniorCitizen'].astype('object')


In [26]:
data.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        6
Churn               0
dtype: int64

In [27]:
# print rows that totalChargets is null
# data[data['TotalCharges'].isnull()]

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
983,Male,No,Yes,Yes,0,Yes,No,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service','Two year',No,'Mailed check',19.85,,0
1478,Male,No,Yes,Yes,0,Yes,Yes,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service','Two year',No,'Mailed check',25.35,,0
2032,Female,No,Yes,Yes,0,Yes,No,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service','Two year',No,'Mailed check',20.0,,0
2870,Male,No,Yes,Yes,0,Yes,No,No,'No internet service','No internet service','No internet service','No internet service','No internet service','No internet service','One year',Yes,'Mailed check',19.7,,0
4322,Female,No,Yes,Yes,0,Yes,Yes,DSL,No,Yes,Yes,Yes,Yes,No,'Two year',No,'Mailed check',73.35,,0
4406,Male,No,No,Yes,0,Yes,Yes,DSL,Yes,Yes,No,Yes,No,No,'Two year',Yes,'Bank transfer (automatic)',61.9,,0


In [40]:
# drop nan values
data.dropna(inplace = True)

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4689 entries, 0 to 4694
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            4689 non-null   object 
 1   SeniorCitizen     4689 non-null   object 
 2   Partner           4689 non-null   object 
 3   Dependents        4689 non-null   object 
 4   tenure            4689 non-null   int64  
 5   PhoneService      4689 non-null   object 
 6   MultipleLines     4689 non-null   object 
 7   InternetService   4689 non-null   object 
 8   OnlineSecurity    4689 non-null   object 
 9   OnlineBackup      4689 non-null   object 
 10  DeviceProtection  4689 non-null   object 
 11  TechSupport       4689 non-null   object 
 12  StreamingTV       4689 non-null   object 
 13  StreamingMovies   4689 non-null   object 
 14  Contract          4689 non-null   object 
 15  PaperlessBilling  4689 non-null   object 
 16  PaymentMethod     4689 non-null   object 
 17  

In [None]:
# Now data types are corrected:
