---
# Customer Churn Prediction 
---

| Column Name       | Description                                  | Data Type   |
|-------------------|----------------------------------------------|-------------|
| customerID        | Unique customer identifier                   | Object      |
| gender            | Customer's gender (Male or Female)          | Object      |
| SeniorCitizen     | Senior citizen status (1 for Yes, 0 for No) | Integer     |
| Partner           | Whether the customer has a partner (Yes/No) | Object      |
| Dependents        | Whether the customer has family members or dependents who are also using the same service or subscription (Yes/No)| Object      |
| tenure            | Number of months customer stayed            | Integer     |
| PhoneService      | Phone service status (Yes or No)            | Object      |
| MultipleLines     | Multiple phone lines status (Yes/No/No svc) | Object      |
| InternetService   | Internet service provider (DSL, Fiber, No)  | Object      |
| OnlineSecurity    | Online security status (Yes/No/No svc)      | Object      |
| OnlineBackup      | Online backup status (Yes/No/No svc)        | Object      |
| DeviceProtection  | Device protection status (Yes/No/No svc)    | Object      |
| TechSupport       | Tech support status (Yes/No/No svc)         | Object      |
| StreamingTV       | Streaming TV status (Yes/No/No svc)         | Object      |
| StreamingMovies   | Streaming movies status (Yes/No/No svc)     | Object      |
| Contract          | Contract type (Month-to-month, 1 yr, 2 yr)  | Object      |
| PaperlessBilling  | Paperless billing status (Yes/No)           | Object      |
| PaymentMethod     | Payment method chosen by customer           | Object      |
| MonthlyCharges    | Monthly charges incurred by the customer   | Float       |
| TotalCharges      | Total charges incurred by the customer     | Object      |
| Churn             | Churn status (Yes/No)...measure of customers leaving the business                        | Object      |

### Import Libraries

In [None]:
!pip install xgboost

In [1]:
# Data manipulation and analysis
import pandas as pd
# Linear algebra/numerical operations  
import numpy as np
# Visualization    
import seaborn as sns  
import matplotlib.pyplot as plt     
%matplotlib inline
# For Machine or Deep Learning
import tensorflow as tf
# High-level api for tensorflow   
from tensorflow import keras  
# Pre-processing tool
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
# Classifiers  
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
# Evaluation metrics  
from sklearn.metrics import f1_score, precision_score, recall_score, r2_score   
from scipy.stats import randint, boxcox
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

ModuleNotFoundError: No module named 'xgboost'

In [None]:
cc = pd.read_csv('telco_customer_churn.csv')
cc.info()

In [None]:
# cc.isnull().sum()
# No null values

In [None]:
cc.shape

In [None]:
# Tenure means the number of months the customer has stayed with the company
plt.figure(figsize=(10, 5))
sns.histplot(data=cc, x='tenure', hue='Churn', bins=30, kde=False, common_norm=False)

plt.xlabel('Tenure (Number of Months)')
plt.ylabel('Count')
plt.title('Distribution of Tenure by Churn')

plt.show()

### Data Cleaning

In [None]:
# pd.to_numeric(cc['TotalCharges'])
# ValueError: Unable to parse string " " at position 488
pd.to_numeric(cc['TotalCharges'], errors = 'coerce') # It works but...

In [None]:
# ...we got null values to deal with.
print("Null values in TotalCharges:", pd.to_numeric(cc['TotalCharges'], errors = 'coerce').isnull().sum())

In [None]:
cc[pd.to_numeric(cc['TotalCharges'], errors = 'coerce').isnull()]

In [None]:
# Converting dtype of TotalCharges from string to integer/float 
cc.TotalCharges = pd.to_numeric(cc['TotalCharges'], errors = 'coerce')

In [None]:
# Filling null values with mean of TotalCharges
mean_value = cc['TotalCharges'].mean()
cc['TotalCharges'].fillna(mean_value, inplace=True)

In [None]:
cc.TotalCharges.isnull().sum()

In [None]:
# Dropping unwanted columns
cc.drop('customerID', axis='columns', inplace = True)

In [None]:
cc.dtypes

In [None]:
cc.sample(10)

In [None]:
cc.TotalCharges

### Dealing with Outliers

In [None]:
def var_summary(x):
# UC = MEAN + 2 STD
    uc = x.mean()+(2*x.std())
    lc = x.mean()-(2*x.std())
    
    for i in x:
        if i<lc or i>uc:
            count = 1             # This means that column is having an OUTLier
        else:
            count = 0             # That column in not having an outliers
            
    outlier_flag = count
    return pd.Series([x.count(), x.isnull().sum(), x.sum(), x.mean(), x.median(),  x.std(), 
                      x.var(), x.min(), x.quantile(0.01), x.quantile(0.05),x.quantile(0.10),
                      x.quantile(0.25),x.quantile(0.50),x.quantile(0.75), 
                      x.quantile(0.90),x.quantile(0.95), x.quantile(0.99),x.max() , 
                      lc , uc,outlier_flag],
                  index=['N', 'NMISS', 'SUM', 'MEAN','MEDIAN', 'STD', 'VAR', 'MIN', 
                         'P1' , 'P5' ,'P10' ,'P25' ,'P50' ,'P75' ,'P90' ,'P95' ,'P99' ,
                         'MAX','LC','UC','outlier_flag'])

In [None]:
numeric_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

In [None]:
cc[numeric_columns].apply(lambda x: var_summary(x))

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(cc['TotalCharges'], kde=True, bins = 10)
plt.title('Histogram of TotalCharges')
plt.xlabel('TotalCharges')
plt.show()

#### With Imputation 
Replacing the outliers in the 'TotalCharges' column with the mean values

In [None]:
# mtc = cc['TotalCharges'].median()
# cc['TotalCharges'] = cc['TotalCharges'].apply(lambda x: mtc if abs(x) > 2 else x)

#### Removing the Outliers (Not-Recommended)

In [None]:
# cc.shape

In [None]:
# Define a Z-Score threshold for outliers (e.g., 2 or 3 standard deviations)
# threshold = 2 
# z_scores = stats.zscore(cc['TotalCharges'])
# Create a new column with Z-Score values
# cc['Z_Score'] = z_scores

In [None]:
# Finding rows with outliers
# outlier_rows = cc[abs(z_scores) > threshold]
# outlier_rows

In [None]:
# cc = cc[abs(z_scores) <= threshold]

In [None]:
# cc.shape

In [None]:
# cc.drop('Z_Score', axis='columns', inplace = True)

#### Log Transformation 
Instead of imputing or removing outliers, we can also use log transformations to make our model less sensitive to extreme values.

In [None]:
# Apply log transformation to the feature with outliers
# cc['TotalCharges'] = np.log1p(cc['TotalCharges'])

#### Winsorization (Recommended)
Winsorization caps extreme values at a specified threshold.

In [None]:
lower_bound = cc['TotalCharges'].quantile(0.01)
upper_bound = cc['TotalCharges'].quantile(0.99)
cc['TotalCharges'] = cc['TotalCharges'].clip(lower_bound, upper_bound)

#### Box-Cox Transformation
The Box-Cox transformation is a family of power transformations that includes logarithm (when the power is 0) and square root (when the power is 0.5) transformations.

In [None]:
# Adding 1 to handle zero values
# cc['TotalCharges_boxcox'], _ = boxcox(cc['TotalCharges'] + 1) 

#### Square Root Transformation:
Taking the square root is another option, particularly effective for data with moderate positive skewness.

In [None]:
# cc['TotalCharges_sqrt'] = np.sqrt(cc['TotalCharges'])

In [None]:
cc[numeric_columns].apply(lambda x: var_summary(x))

In [None]:
cc.sample(5)

### Data Scaling

In [None]:
# Replacing unnescessary data 
for col in cc:
    if cc[col].dtypes == 'object':
        print(col ,cc[col].unique())

In [None]:
cc.replace('No phone service', 'No', inplace = True)
cc.replace('No internet service', 'No', inplace = True)

In [None]:
for col in cc:
    if cc[col].dtypes == 'object':
        print(col ,cc[col].unique())

In [None]:
# Store categorical columns containing 'Yes' and 'No'
yes_no_cols = []
for col in cc:
    unique_cols = cc[col].unique()
    if sorted(unique_cols) == ['No', 'Yes']:
        yes_no_cols.append(col)

print(yes_no_cols)

In [None]:
# Replacing catcategorical data with binary values
for col in yes_no_cols:
    cc[col].replace({'Yes': 1, 'No': 0}, inplace=  True)

In [None]:
for col in cc:
    if cc[col].dtypes == 'object':
        print(col ,cc[col].unique())

In [None]:
cc['gender'].replace({'Male': 0, 'Female': 1}, inplace = True)

In [None]:
for col in cc:
    if cc[col].dtypes == 'object':
        print(col ,cc[col].unique())

If you have many categorical features or features with high cardinality (many unique values), this can lead to a significant increase in the number of columns and data size.

Whereas for us, our cardinality is 3

In [None]:
cc1 = pd.get_dummies(data = cc, columns=['InternetService', 'Contract', 'PaymentMethod'])

In [None]:
cc1.head()

In [None]:
# Finding bool columns
bool_cols = []
for col in cc1:
    if cc1[col].dtypes == 'bool':
        bool_cols.append(col)

print(bool_cols)

In [None]:
# Converting booleans into integer dtypes
for col in bool_cols:
    cc1[col] = cc1[col].astype(int)

In [None]:
# Scaling the remaining columns 
cols_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']
scaler = MinMaxScaler()
cc1[cols_to_scale] = scaler.fit_transform(cc1[cols_to_scale])

In [None]:
for col in cc1:
    print(col ,cc1[col].unique())

In [None]:
correlation_matrix = cc1.corr()
churn_correlations = correlation_matrix["Churn"].sort_values(ascending=False)
print(churn_correlations)

- A positive correlation (e.g., values close to 1) suggests that as the values in the other column increase, the likelihood of churn increases. For example, "Contract_Month-to-month" and "InternetService_Fiber optic" have positive correlations with churn, indicating that customers with month-to-month contracts or fiber optic internet service are more likely to churn.
<br><br>
- A negative correlation (e.g., values close to -1) suggests that as the values in the other column increase, the likelihood of churn decreases. For example, "tenure" has a negative correlation with churn, suggesting that customers with longer tenure are less likely to churn.
<br><br>
- Correlation values close to 0 indicate a weak linear relationship between the columns. For example, "gender" and "PhoneService" have correlation values close to 0, suggesting that these factors have little to no impact on churn.

In [None]:
corr_matrix = cc1.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix)
plt.title('Correlation Heatmap')
plt.show()

## Building Model

In [None]:
X = cc1.drop('Churn', axis = 'columns')
y = cc1['Churn']

In [None]:
models = [
    ('Logistic Regression', LogisticRegression()),
    ('Random Forest', RandomForestClassifier()),
    ('SVM', SVC()),
    ('Gradient Boosting', GradientBoostingClassifier()),
    ('Decision Tree', DecisionTreeClassifier())
]

# Iterating through each model and performing cross-validation
for name, model in models:
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f'{name}: Average accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})')

In [None]:
best_acc = 0
best_state = 0
model = GradientBoostingClassifier()

# Loop over random states from 0 to 99
for state in range(201):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=state)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    if acc > best_acc:
        best_acc = acc
        best_state = state

print(f'Best random state {model}: {best_state} with accuracy: {best_acc:.2f}')

In [None]:
# Best random state for GradientBoostingClassifier(): 158 with accuracy: 0.83
# Best random state LogisticRegression(): 150 with accuracy: 0.82
# Best random state SVC(): 35 with accuracy: 0.82

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=best_state)

In [None]:
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

### Prediction

In [None]:
yp = model.predict(X_test)
yp[:10]

In [None]:
y_test[:10]

In [None]:
y_pred = []
for ele in yp:
    if ele > 0.5:
        y_pred.append(1)
    else:
        y_pred.append(0)
        
y_pred[:10]

In [None]:
# how well your model fits the data.
r_squared = r2_score(y_test, y_pred)
print("R-squared (R²):", round(r_squared,4)*10)
print("R-squared (R²):", round(r_squared,1)*10)

Adjusted R² penalizes the inclusion of unnecessary predictors and provides a more realistic assessment of a model's fit

In [None]:
n = len(y_test)  # Number of data points
p = X_test.shape[1]  # Number of predictors (features)

adjusted_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1)
print("Adjusted R-squared (Adjusted R²):", round(adjusted_r_squared,2)*10)
print("Adjusted R-squared (Adjusted R²):", round(adjusted_r_squared,1)*10)

In [None]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", round(accuracy,2)*100)

print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
print(tf.math.confusion_matrix(labels = y_test, predictions= y_pred))

In [None]:
cm = tf.math.confusion_matrix(labels = y_test, predictions= y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Truth")

- **True Positives (TP):** These are cases where the model correctly predicted the positive class. In our matrix, there are (approx) 200 true positives (bottom-right corner). These are customers who were correctly predicted to churn.
<br><br>
- **True Negatives (TN):** These are cases where the model correctly predicted the negative class. In our matrix, there are (approx) 967 true negatives (top-left corner). These are customers who were correctly predicted not to churn.
<br><br>
- **False Positives (FP):** These are cases where our model incorrectly predicted the positive class when it should have predicted the negative class. There are 66 false positives (top-right corner). These are customers who were incorrectly predicted to churn when they didn't (Type I error).
<br><br>
- **False Negatives (FN):** These are cases where the model incorrectly predicted the negative class when it should have predicted the positive class. There are 176 false negatives (bottom-left corner). These are customers who were incorrectly predicted not to churn when they did (Type II error).
<br><br>
Accuracy can be calculated as (TP + TN) / (TP + TN + FP + FN), and it represents the overall correctness of predictions. 
<br><br>
So, in our case, accuracy would be approximately (200 + 967) / (200 + 967 + 66 + 176).

### Hyper-Parameter Tuning

In [None]:
# Default parameters of the model
default_params = model.get_params()
print("Default Parameters:")
for param, value in default_params.items():
    print(f"{param}: {value}")

In [None]:
param_dist = {
    'n_estimators': randint(10, 200),
    'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'subsample': [0.8, 0.9, 1.0],
    'max_features': [None]  
}

In [None]:
classifier = GradientBoostingClassifier()
# WE'll use RandomizedSearchCV for this problem
# It's more efficient than grid search, especially when the search space is large.
rand_search = RandomizedSearchCV(classifier, param_distributions=param_dist, n_iter=20, cv=3, 
                                 scoring='accuracy', random_state=best_state, n_jobs=-1)

rand_search.fit(X_train, y_train)
best_params = rand_search.best_params_
print("Best Parameters:", best_params)

In [None]:
best_model = rand_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on Test Set:", round(accuracy,2))

In [None]:
cm = tf.math.confusion_matrix(labels = y_test, predictions= y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Truth")