<a href="https://www.kaggle.com/code/ankitkumar2635/churn-prediction-of-telecom-consumers?scriptVersionId=116588990" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

In this notebook, I build a machine learning model using python which predicts whether a customer will churn (leave the company) or not. Logistic Regression and XGB Classifier has been implemented and compared, also the XGB model is fine-tuned using GridSearchCV. (A conclusion worth reading) 

**Dataset:** [Telco customer churn: IBM dataset](http://https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset) 

**Steps and modules:**
1. pandas - data manipulation and EDA including one-hot encoding catagorical variables 
2. matplotlib & seaborn - Visualization
3. MinMaxScaler - to scale continuous features 
4. sklearn's resample -to treat imbalance in target variable using upsampling technique
5. sklearn's LogisticRegression & XGBClassifier - to train the models
6. sklearn's metrics - to build confusion matrix
7. GridSearchCV - to fine-tune parameters, implement k-Fold Cross-Validation

Other usuals: sklearn's recall and precision score and train_test-split

**Column Description:**

[Please head here](http://https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset), as describing all columns here makes the introduction too long and there is a high chance that you will skip it anyway :)

### Importing the libraries 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
churn_df = pd.read_excel("/kaggle/input/telco-customer-churn-ibm-dataset/Telco_customer_churn.xlsx")

In [None]:
churn_df.head()

# EDA & Feature Selection

In [None]:
# Shape of the dataset
churn_df.shape

In [None]:
churn_df.info()

In [None]:
# Check for null
churn_df.isna().sum()

**We do not need the "Churn reason" col to do a predictive analysis. Also, "Churn Value" and "Churn Label" are the same except for their data type, i.e; (0 or 1) and (Yes or No).**

In [None]:
#Drop Churn Reason and Churn Label cols
churn_df.drop("Churn Reason", inplace = True, axis = 1)
churn_df.drop("Churn Label", inplace = True, axis = 1)

In [None]:
# Lets explore our target variable "Churn Label"
churn_df['Churn Value'].value_counts()

In [None]:
sns.countplot(x = "Churn Value", data = churn_df)
plt.title("Distribution of Churn Value")

**We have a significant imbalance in the datset, but before we treat the imbalance lets explore our explantory variables.**

Lets look into the catagorical cols first:

In [None]:
# Create list of catagorical cols
Cat_cols = []

for col in churn_df.columns:
    if churn_df[col].dtype == "object":
        Cat_cols.append(col)

print("We have {} catagorical columns:".format(len(Cat_cols)))
Cat_cols

**Lets find out the number of binary cols**

In [None]:
# Lets find out the number of binary variables
binary_cols = []

for col in churn_df.columns:
    if churn_df[col].value_counts().shape[0] ==2:
        binary_cols.append(col)

print("We have {} binary columns:".format(len(binary_cols)))
binary_cols

Lets examine distribution of these variables:

In [None]:
fig, axes = plt.subplots(2,3, figsize = (12,8), sharey = True)
plt.suptitle("Distribution of binary features")
sns.countplot(x = "Gender", data = churn_df, ax=axes[0,0])
sns.countplot(x = "Senior Citizen", data = churn_df, ax=axes[0,1])
sns.countplot(x = "Partner", data = churn_df, ax=axes[0,2])
sns.countplot(x = "Dependents", data = churn_df, ax=axes[1,0])
sns.countplot(x = "Paperless Billing", data = churn_df, ax=axes[1,1])
sns.countplot(x = "Phone Service", data = churn_df, ax=axes[1,2])

#### We can observe high imbalance in:

* Senior Citizen - Most of the customers are below the age of 65 years
* Dependents - Majority do not live with any dependent (kids, parents etc) 
* Phone Service - Most use phone service

#### Lets explore how these variables affect the average churn rate

In [None]:
churn_df[['Gender', 'Churn Value']].groupby(['Gender']).mean()

In [None]:
churn_df[['Senior Citizen', 'Churn Value']].groupby('Senior Citizen').mean()

In [None]:
churn_df[['Partner', 'Churn Value']].groupby('Partner').mean()

In [None]:
churn_df[['Dependents', 'Churn Value']].groupby('Dependents').mean()

In [None]:
churn_df[['Phone Service', 'Churn Value']].groupby('Phone Service').mean()

In [None]:
churn_df[['Paperless Billing', 'Churn Value']].groupby('Paperless Billing').mean()

In [None]:
# Get all non-binary catagorical variables
non_binary_cat_cols = [i for i in Cat_cols if i not in binary_cols ]
non_binary_cat_cols

Country and state columns does not provide any variablity as all the observations are from California, U.S. Also, the observations are distributed across 1,129 cites, so on an average its 6.5 observations per city. 

So we will omit aa the geographical columns in our model.

### Inspecting other catagorical variables

In [None]:
# Examine Multiple lines 
sns.countplot(x = "Multiple Lines", data = churn_df)

In [None]:
# Impact on churn value
churn_df[['Multiple Lines', 'Churn Value']].groupby('Multiple Lines').mean().sort_values(by = 'Churn Value', ascending=False)

Customers with multipe lines connection have a higher churn rate. 

In [None]:
# Examine Internet Services
sns.countplot(x = "Internet Service", data = churn_df, 
              order = churn_df['Internet Service'].value_counts().index)

In [None]:
# Impact on churn value
churn_df[['Internet Service', 'Churn Value']].groupby('Internet Service').mean().sort_values(by= 'Churn Value', ascending = False)

People with Fiber Optic have a much higher churn rate. Lets explore the factor behind it

In [None]:
churn_df[['Monthly Charges', 'Internet Service']].groupby('Internet Service').mean().sort_values(by = 'Monthly Charges')

The monthly charges for fiber optic connection is much higher than the other two.

### Lets explore the internet related features

In [None]:
fig, axes = plt.subplots(2,3, figsize = (12,10), sharey = True)
plt.suptitle('Customer Distribution Across Internet Services')
sns.countplot(x='Online Security', data = churn_df, ax=axes[0,0], order = churn_df['Online Security'].value_counts().index)
sns.countplot(x='Online Backup', data = churn_df, ax=axes[0,1], order = churn_df['Online Backup'].value_counts().index)
sns.countplot(x='Device Protection', data = churn_df, ax=axes[0,2], order = churn_df['Device Protection'].value_counts().index)
sns.countplot(x='Tech Support', data = churn_df, ax=axes[1,0], order = churn_df['Tech Support'].value_counts().index)
sns.countplot(x='Streaming TV', data = churn_df, ax=axes[1,1], order = churn_df['Streaming TV'].value_counts().index)
sns.countplot(x='Streaming Movies', data = churn_df, ax=axes[1,2], order = churn_df['Streaming Movies'].value_counts().index)

Though these services are offered for free by the company, most people do not use them. Lets see if these feature impact the churn rate

In [None]:
churn_df[['Online Security', 'Churn Value']].groupby('Online Security').mean().sort_values(by='Online Security')

In [None]:
churn_df[['Device Protection', 'Churn Value']].groupby('Device Protection').mean().sort_values(by='Device Protection')

In [None]:
churn_df[['Online Backup', 'Churn Value']].groupby('Online Backup').mean().sort_values(by='Online Backup')

In [None]:
churn_df[['Tech Support', 'Churn Value']].groupby('Tech Support').mean().sort_values(by='Tech Support')

In [None]:
churn_df[['Streaming TV', 'Churn Value']].groupby('Streaming TV').mean().sort_values(by='Streaming TV')

In [None]:
churn_df[['Streaming Movies', 'Churn Value']].groupby('Streaming Movies').mean().sort_values(by='Streaming Movies')

So the conclusion is people who don't opt for these internet services have a high churn rate. However, the churn rate difference between customers who use and do not use Streaming TV and Streaming Movies is quite low.

### Exploring Contract and Payment Method

In [None]:
sns.countplot(x = 'Contract', data = churn_df)
plt.title('Customers by Contract Type')

In [None]:
churn_df[['Contract', 'Churn Value']].groupby('Contract').mean()

No surprises here - Customers with shorter contract tend to churn more

In [None]:
plt.figure(figsize = (10,6))
sns.countplot(x = 'Payment Method', data = churn_df, order = churn_df['Payment Method'].value_counts().index)
plt.title('Customers by Contract Type')

In [None]:
churn_df[['Payment Method', 'Churn Value']].groupby('Payment Method').mean().sort_values(by = 'Churn Value')

Here is something interesting, customers who pay electronic check are more likely to churn and also this payment method is most common among the costumers.

### Exploring Continuous Feature

In [None]:
num_cols = []

for col in churn_df.columns:
    if churn_df[col].dtype.kind in 'iufc':
        num_cols.append(col)
num_cols

### Here we have three numerical variables to deal with:

* Tenure Month - How long the customer has been with the company
* Monthly Charges 
* CLTV - Customer Lifetime Value

Note: We are ommiting Total Charges becasue its Tenure times Monthly Charges


In [None]:
# Explporing Tenure
sns.displot(data = churn_df, x = "Tenure Months", hue = "Churn Value", kind = "kde")
plt.title('Tenure vs Churn Value')

**Customers with lower tenure tend to churn more and vice-versa.**

In [None]:
# Exploring Monthly Charges
sns.displot(data = churn_df, x= 'Monthly Charges', hue = 'Churn Value', kind = "kde")
plt.title('Monthly Charges vs Churn Value')

**As the monthly charges go up customers tend to churn more.**

In [None]:
churn_df[['Monthly Charges', 'Churn Value', 'Tenure Months']].groupby('Churn Value').mean()

Average monthly charges for churned customers is about 13 dollars higher and churned customers tend to leave the company about 20 months earlier than the not churned ones.

In [None]:
# Explore CLTV
sns.displot(data=churn_df, x='CLTV', hue='Churn Value', kind='kde')
plt.title('CLTV vs Churn Value')

In [None]:
churn_df[['CLTV', 'Churn Value']].groupby('Churn Value').mean()

In [None]:
churn_df.columns

From the EDA, I have further decided to drop finally I would drop:
1. First 9 cols
2. Gender
3. Total Charges

In [None]:
# Filter features
filtered_df = churn_df.iloc[:, 9:]
filtered_df.drop(['Gender','Total Charges'], inplace = True, axis=1)

In [None]:
# Extract catagorical features from selected features 
cat_features = [i for i in filtered_df.columns if filtered_df[i].dtype == 'object']
cat_features

# Data Pre-processing 

### One-hot encoding the catagorical features

In [None]:
# Encode cat_features
encoded_df = pd.get_dummies(filtered_df, columns = cat_features, drop_first = True)

### Scaling the continous features


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# MinMax scaling of continous features 
scaler = MinMaxScaler()
temp_1 = scaler.fit_transform(filtered_df[["Tenure Months"]])
temp_2 = scaler.fit_transform(filtered_df[["Monthly Charges"]])
temp_3 = scaler.fit_transform(filtered_df[['CLTV']])

In [None]:
# Replacing the original cols with scaled ones 
encoded_df['Tenure Months'] = temp_1
encoded_df['Monthly Charges'] = temp_2
encoded_df['CLTV'] = temp_3

## Treating the imbalance using upsampling technique

In [None]:
sns.countplot(x = "Churn Value", data = churn_df)
plt.title('Distribution of Target Before Upsampling')

In [None]:
churned = encoded_df[encoded_df['Churn Value']==1]
not_churned = encoded_df[encoded_df['Churn Value'] == 0]

In [None]:
from sklearn.utils import resample

In [None]:
churned_upsampled = resample(churned, 
                             replace = True, 
                             n_samples = len(not_churned),
                             random_state = 1)

In [None]:
# Combining the upsampled data
final_df = pd.concat([churned_upsampled, not_churned])
sns.countplot(x = "Churn Value", data = final_df)
plt.title("Distribution of Target After Upsampling")

# Model Building and Model Selection
1. Logistic Regression
2. XGB Classifier 
3. Parameter tuning using GridSearchCV

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Explaintory vars
X = final_df.drop('Churn Value', axis = 1)

# Target var
Y = final_df['Churn Value']

In [None]:
# Separating the dataset into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, 
                                                    test_size = 0.2, 
                                                    random_state = 1)

## 1. Using LogisticRegression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
Logi_model = LogisticRegression(max_iter = 500)

In [None]:
Logi_model.fit(X_train, Y_train)

In [None]:
# Accuracy score on training data
logi_train_pred = Logi_model.predict(X_train)
logi_acc_train = accuracy_score(logi_train_pred, Y_train)
print("Accuracy score on trianing data:",logi_acc_train)

# Accurcy on test data
logi_test_pred = Logi_model.predict(X_test)
logi_acc_test = accuracy_score(logi_test_pred, Y_test)
print("Accuracy score on test data:",logi_acc_test)

That's a very good accuracy score. Also, accuracy score on training and test data are very close, so we can say that our model does not suffer from over-fitting.

Let's look into precision and recall scores:

In [None]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn import metrics

In [None]:
logi_recall = recall_score(Y_test,logi_test_pred)
logi_precision = precision_score(Y_test, logi_test_pred)
print("LogisticRegression model's metrics:\n")
print("Accuracy on Training Data:", round(logi_acc_train, 2))
print("Accuracy on Test Data:", round(logi_acc_test,2))
print("Recall Score:", round(logi_recall,2))
print("Precision Score:", round(logi_precision,2))

### The logistic model seems very good. Lets build a confusion matrix:

In [None]:
confusion_matrix = metrics.confusion_matrix(Y_test, logi_test_pred)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, 
                                            display_labels = ['Negative', 'Positive'])
cm_display.plot()
plt.title('Confusion Matrix: LogisticRegression')
plt.show()

Lets see if XGBClassifier can deliver a better result:

## 2. Using XGBClasssifer

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_model = XGBClassifier()

In [None]:
xgb_model.fit(X_train, Y_train)

In [None]:
# Accuracy score on training data
xgb_train_pred = xgb_model.predict(X_train)
xgb_acc_train = accuracy_score(xgb_train_pred, Y_train)


# Accuracy score on test data
xgb_test_pred = xgb_model.predict(X_test)
xgb_acc_test = accuracy_score(xgb_test_pred, Y_test)

In [None]:
xgb_recall = recall_score(Y_test,xgb_test_pred)
xgb_precision = precision_score(Y_test, xgb_test_pred)
print("XGBClassification model's metrics:\n")
print("Accuracy on Training Data:", round(xgb_acc_train, 2))
print("Accuracy on Test Data:", round(xgb_acc_test,2))
print("Recall Score:", round(xgb_recall,2))
print("Precision Score:", round(xgb_precision,2))

### I suspect slight over-fitting here.
Lets try k-fold and parameter tuning

# Parameter tuning 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'n_estimators':[150,200,250,300], 
              'max_depth':[5,10,15,20,25], 
              'learning_rate': [0.1,0.2,0.3,0.4]}
gscv = GridSearchCV(estimator = xgb_model, param_grid = parameters, cv = 5, n_jobs = -1)

In [None]:
gscv.fit(X,Y)

In [None]:
gscv.best_params_

In [None]:
gscv.best_score_

### Model with best parameters

In [None]:
best_xgb_model = XGBClassifier(max_depth = 20, n_estimators = 250, learning_rate = 0.4)

In [None]:
best_xgb_model.fit(X_train, Y_train)

In [None]:
# Accuracy score on training data
train_pred = best_xgb_model.predict(X_train)
acc_train = accuracy_score(train_pred, Y_train)

# Accuracy score on test data
test_pred = best_xgb_model.predict(X_test)
acc_test = accuracy_score(test_pred, Y_test)

In [None]:
xgb2_recall = recall_score(Y_test,test_pred)
xgb2_precision = precision_score(Y_test, test_pred)
print("Tuned XGBClassification model's metrics:\n")
print("Accuracy on Training Data:", round(acc_train, 2))
print("Accuracy on Test Data:", round(acc_test,2))
print("Recall Score:", round(xgb2_recall,2))
print("Precision Score:", round(xgb2_precision,2))

In [None]:
# CM for initial XGB model
confusion_matrix= metrics.confusion_matrix(Y_test, xgb_test_pred)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, 
                                            display_labels = ['Negative', 'Positive'])
cm_display.plot()
plt.title('Confusion Matrix: XGBClassifier')
plt.show()

# CM for fine-tuned XGB model
confusion_matrix= metrics.confusion_matrix(Y_test, test_pred)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, 
                                            display_labels = ['Negative', 'Positive'])
cm_display.plot()
plt.title('Confusion Matrix: Fine-tuned XGB')
plt.show()

# Conclusion:
Well, that not much of an improvement from the initial XGB model. We have a 1% tradeoff between the precision score and recall score, which means- the tuned model is 1% better at avoiding false positives as it has 1% higher precision rate. Meanwhile the model tradeoffs 1% ability to aviod false negatives (1% lower recall score).


In this case the XGBClassifier works better than the logistic regression model. 

#### Deciding among the two XGB models:

As per our use-case we don't want the costumers to churn (leave), so when the model predicts that a specific consumer will churn, the company proposes better offers. So here we are more concerned that the we identify the customer who is about to churn **(true positives)**. So I recommend the ***first XGB model***. 

* 1034 true positives vs 1027
* 64 false positives vs 59

Downside - Number of **false negative** is higher than he fine-tuned one, so we end up giving better offers to them as well, **no major harm done** - those false negative costumers will tend to stick with us for a longer term. Business!!!!

So I would stick with the initial XGB model.


Note: The issue of slight over-fitting still persist. I will try to improve the model in the next take. 

Thankyou! 