# __Telco Customer Churn Evaluation and Analysis__

>## Question

>> How can we improve customer churn prediction and derive actionable insights to reduce churn rates?

>## Introduction

>> Customer churn rate refers to the proportion of customers who cancel their subscription with a company. This is a critical metric for businesses as losing customeres directly impacts a company's revenue, it  costs more money to attract new customers than maintain current ones, and if there are high churn rates there is likely an underlying reason of dissatisfaction with the company. Due to all these reasons, it is paramount that companies work towards having as low a churn rate as possible. The goal of this notebook is to explore, evaluate, and analyze the IBM telco customer churn dataset in an attempt to derive actionable insights that the company can take in order to start moving towards a lower number for customer churn rate. The notebook will start by cleaning and exploring the data, it will then move onto creating a fundamental model for churn prediction, followed by refining the model to further improve churn prediction, before finally breaking down what features are important in terms of prediction and what actionable insights can be derived to decrease churn rate.

>## Exploration of Data

In [1]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from matplotlib.pyplot import subplots
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

In [2]:
data = pd.read_csv('customer_data.csv')

>> According to the dataset, the company is currently experiencing a churn rate of around 27%. Though this proportion is not too high, there it shows us that there is potential for improvements and modifications to be made.

In [3]:
total_customers = data.shape[0]
churned_customers = data[data['Churn'] == 1].shape[0]
churn_rate = churned_customers / total_customers

print(f"Dataset Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print(data.dtypes)

Dataset Shape: (7043, 21)
Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


>> A quick observation of the structure of data reveal that the dataset includes 7043 observations with 21 parameters. A number of the columns have 'object' as their data type. We will need to work towards chnaging the data types of the columns to ensure a smooth modelling process.

In [4]:
print(data.isnull().sum())

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


>> We can see that there are no null values in any of the columns which is a good sign. However, this does not mean that every single entry is significant nor useful. For example, some entries could just be " ", which will not be registered as a null value, but has the potential to cause problems down the road when moving onto the analysis and model building proportion of the notebook.

In [5]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


>> We can see that some of the categorical columns have more than two choices. This will also need to be dealt with to ensure that the model captures the intricacies and dynamic of all useful columns.

In [6]:
paying_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for c in paying_cols:
    print(data[c].unique())

data['TotalSubscriptions'] = 0

for idx, row in data.iterrows():
    total_services = 0 
    for col in paying_cols:
        if row[col] == 'Yes' or row[col] == 'DSL' or row[col] == 'Fiber optic':
            total_services += 1

    data.at[idx, 'TotalSubscriptions'] = total_services


['No phone service' 'No' 'Yes']
['DSL' 'Fiber optic' 'No']
['No' 'Yes' 'No internet service']
['Yes' 'No' 'No internet service']
['No' 'Yes' 'No internet service']
['No' 'Yes' 'No internet service']
['No' 'Yes' 'No internet service']
['No' 'Yes' 'No internet service']


In [7]:
data_mapping = {
    'Yes': 1,
    'No': 0,
    'No phone service': 0,
    'Female': 1,
    'Male': 0
}

internet_dummies = pd.get_dummies(data['InternetService'], prefix='InternetService', drop_first=True)
security_dummies = pd.get_dummies(data['OnlineSecurity'], prefix='OnlineSecurity', drop_first=True)
backup_dummies = pd.get_dummies(data['OnlineBackup'], prefix='OnlineBackup', drop_first=True)
protection_dummies = pd.get_dummies(data['DeviceProtection'], prefix='DeviceProtection', drop_first=True)
support_dummies = pd.get_dummies(data['TechSupport'], prefix='TechSupport', drop_first=True)
tv_dummies = pd.get_dummies(data['StreamingTV'], prefix='StreamingTV', drop_first=True)
movie_dummies = pd.get_dummies(data['StreamingMovies'], prefix='StreamingMovies', drop_first=True)
contract_dummies = pd.get_dummies(data['Contract'], prefix='Contract', drop_first=True)
payment_dummies = pd.get_dummies(data['PaymentMethod'], prefix='PaymentMethod', drop_first=True)

og_col_name = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']

data['Partner'] = data['Partner'].map(data_mapping)
data['Dependents'] = data['Dependents'].map(data_mapping)
data['PhoneService'] = data['PhoneService'].map(data_mapping)
data['MultipleLines'] = data['MultipleLines'].map(data_mapping)
data['gender'] = data['gender'].map(data_mapping)
data['PaperlessBilling'] = data['PaperlessBilling'].map(data_mapping)
data["Churn"] = data['Churn'].map(data_mapping)

data = data.drop(og_col_name, axis=1)

data = pd.concat([data, internet_dummies, security_dummies, backup_dummies, protection_dummies, support_dummies, tv_dummies, movie_dummies, contract_dummies, payment_dummies], axis=1)



>> After checking what unique values each of the categorical columns that have more than two types of responses hold, we can move onto creating a data map. This data map is then being used to help create the dummy variables which will facilitate the modelling process. 

In [8]:
data = data.replace(r'^\s*$', np.nan, regex=True)

total_charges_mean = data['TotalCharges'].astype(float).mean()
data['TotalCharges'].fillna(total_charges_mean, inplace=True)
data['TotalCharges'] = data['TotalCharges'].astype(float)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['TotalCharges'].fillna(total_charges_mean, inplace=True)


>> As mentioned previously, just because a column does not have any null values does not mean that all the values in that column are usable. ```'TotalCharges``` happened to contain empty quotation marks without anyything else. This would have created a problem down the line when we try to scale the data. Here, we first changed all of the entries that are empty quotation marks to be null values. Then we took the average of ```'TotalCharges'``` and replaced all the null values under that column with the mean before finally changing the column data type to float.

In [9]:
t_range = [0, 12, 36, 60, float('inf')]
years = ['0-1 year', '1-3 years', '3-5 years', '5+ years']

data['tenure_group'] = pd.cut(data['tenure'], bins=t_range, labels=years, right=False)


>> ```'Tenure'``` was another column that I would modify, however, instead of actually changing anything directly under that column, I created a new column with bins that would categorize the observations depending on what value ```'Tenure'``` was. For example, if an observation had a ```'Tenure'``` value of 4 months, under the new column ```'tenure_group'```, it w would be classified as having been subscribed for ```'0-1 year'```. By doing this, I am able to create visualizations to see potential relatinoships between churn and tenure.

In [10]:
data.to_csv('cleaned_data.csv', index=False)

>> We now move onto the graphing and visualization of the data. I created all the graphs in RStudio as I personally prefer the aesthetics of R when compared to Python libraries such as matplotlib or Seaborn.

__Note.__ The R code used to create these graphs are in the __Graphs__ folder.

<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/churnxgender.jpeg" alt="Image 1" width="450">
    <img src="Graphs/Tenuregroupxchurn.jpeg" alt="Image 2" width="450">
</div>

__Figure I.__ Churn count split between the genders provided in the dataset.

__Figure II.__ Churn count split amongst the different tenure groups.

<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/churncount.jpeg" alt="Image 1" width="450">
    <img src="Graphs/contractchurn.jpeg" alt="Image 1" width="450">
</div>

__Figure III.__ Number of customers who did not churn (0) and number of customers who did churn (1).

>> From observing __Figure I__, we can deduce that ```Gender``` does not seem to be a column that heavily affects the number of people who churn or the number of people who don't.

>> __Figure II__ shows us that longer tenure lengths appear to have a lower number of churners. Converesely, we can see that tenure lengths of 0-1 years have the highest number of people who end up churning.

>> __Figure III__ displays the breakdown of the number of people who did and did not churn with __5174 non-churners__ and __1869 churners__.

>> __Figure IV__ Churn count amongst the different contract types.

>## Model Building

>> Now that we have cleaned the dataset up, and created certain visualizations that give us further insight into the way certain features effect the churn status of a customer, we move onto building a model  

In [11]:
X = data.drop(["Churn", 'customerID', 'tenure_group'], axis=1)
y = data['Churn']

>> I start by splitting the cleaned dataset into the predictors and their observations, X, and the responses, y. X does not include the response, ```customerID``` as it doesn't say anything important, or ```tenure_group``` as the information is already under the ```tenure``` feature. 

In [12]:
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

>> Before running a Variance Inflation Factor test, I scale the data in X to acquire more precise values post scaling. 

>> Before running a Variance Inflation Factor test, I scale the data in X to acquire more precise values post scaling. 

In [13]:
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
print(vif_data)

  vif = 1. / (1. - r_squared_i)


                                  Feature         VIF
0                                  gender    1.002151
1                           SeniorCitizen    1.153352
2                                 Partner    1.462800
3                              Dependents    1.384160
4                                  tenure    7.304521
5                            PhoneService   34.862059
6                           MultipleLines         inf
7                        PaperlessBilling    1.209019
8                          MonthlyCharges  865.053343
9                            TotalCharges   10.541130
10                     TotalSubscriptions         inf
11            InternetService_Fiber optic  148.263480
12                     InternetService_No         inf
13     OnlineSecurity_No internet service         inf
14                     OnlineSecurity_Yes         inf
15       OnlineBackup_No internet service         inf
16                       OnlineBackup_Yes         inf
17   DeviceProtection_No int

>> From the VIF table, it can be seen that even though I used one-hot encoding where we drop one of the categories in an attempt to avoid collinearity, we can see that __a lot of VIF values for the features are over 5__ -the standard threshold by which to judge if something is experiencing mulit-collinearity. The learning model that I have decided to start with is the Random Forest model. One of the decisions that I chose the Random Forest model is because it automatically handles multicollinearity. The significance of this is that we do not need to decide which features to remove in an attempt to deal with multi-collinearity while risking the loss of an important predictor. 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)

>> I split the scaled data into the training set and the test set. The training set is comprised of 80% while the test set is comprised of the remaining 20%. 

In [15]:
og_rf_model = RandomForestClassifier(
    n_estimators= 100,
    max_depth=10,
    min_samples_leaf=4,
    min_samples_split=10,
    random_state=30,
)

og_rf_model.fit(X_train, y_train)

>> We create a fundamental Random Forest model which we fit to the training data. The values of the parameters were arbitrarily chosen. Hypertuning of parameters will be explored later in the notebook.

In [16]:
og_y_pred = og_rf_model.predict(X_test)
og_y_pred_proba = og_rf_model.predict_proba(X_test)[:,1]

In [17]:
accuracy = accuracy_score(y_test, og_y_pred)
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n {classification_report(y_test, og_y_pred)}")
roc_auc = roc_auc_score(y_test, og_y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

Accuracy: 0.7906316536550745
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.92      0.86      1001
           1       0.71      0.48      0.57       408

    accuracy                           0.79      1409
   macro avg       0.76      0.70      0.71      1409
weighted avg       0.78      0.79      0.78      1409

ROC-AUC Score: 0.8386943938414527


<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/ogdatacm.jpeg" alt="Image 1" width="450">
</div>

__Figure V.__ Confusion Matrix of the original dataset with a basic Random Forest Model

>> We now use the Random Forest model that we just created and use it to predict the churn status of the test data. From the printed values, we can see that the Random Forest model has a __prediction accuracy of around 0.79__. This is definitely an improvement over the baseline prediction accuracy -assigning an observation to the most common class- of 0.73. However, upon further inspect of the Classification Report, the ```recall``` values for class 0 and class 1 are something to take note of. ```recall``` is how accurate the model is at labelling observations as churners who then turn out to, in fact, churn, and vice versa in regards to non-churners. __The model correctly predicted 92% of people who did not churn__ which is a significantly good accuracy level. On the toher hand, it only __correctly predicted 48% of actual churners.__ This accuracy is more problematic as predicting churners, and figuring out how to decrase the number of churners is the main objective, and only being able to accurately predict 48% of actual churners is a problem. 

>> Figure IV shows the breakdown of the prediction our model made and the number of flase positive, false negatives, and so on.  We can conclude that there definitely needs to be an improvement in the model when it comes to predicting customers who do end up churning, as when it came to churners, the model predicted more wrong than right.

>> Before focusing on improving the model's ability to predict churners, a cross-validation test will be run to ensure that the Random Forest model generalizes well to unseen data and is not overfitting or underfitting the data. 

In [18]:
cv_scores = cross_val_score(og_rf_model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy:", cv_scores.mean())

Cross-Validation Accuracy: 0.8024496415293279


>> The cross-validation accuracy is similar to that of the prediction accuracy of the test set. This is a good indicator that the model generalizes well and is not overfitting or underfitting. 

>## Improving the Model

>> There are three main ways that will be explored in order to see how an improvement to the ```recall``` score of the Random Forest model can be made, combinations of the three will also be tested. The first one is including the parameter ```class_weight='balanced'```, the second is lowering the decision threshold when predicting churn status, and the third is to resample the data in an attempt to balance out in class inbalances using ```SMOTE```.

In [19]:
weighted_rf_model = RandomForestClassifier(
    n_estimators= 100,
    max_depth=10,
    min_samples_leaf=4,
    min_samples_split=10,
    random_state=30,
    class_weight='balanced'
)

weighted_rf_model.fit(X_train, y_train)

>> The ```class_weight='balanced'``` in the Random Forest Classifier is used to address class imbalance by adjusting the weight assigned to each class during training. __Class imbalance has the problem of making the model favor the most common class__. This leads to the model having a __high overall prediction accuracy but potentially doing poorly when attempting to predict the minority class__, as in the case of modelling the basic Random Forest model. ```class_weight='balanced'``` increases the attention that the minority class draws from the model which allows the model to have a better chance of correctly predicting the minority class. This Random Forest model will now be known as the weighted rf model.

In [20]:
weighted_y_pred = weighted_rf_model.predict(X_test)
weighted_y_pred_proba = weighted_rf_model.predict_proba(X_test)[:,1]

accuracy = accuracy_score(y_test, weighted_y_pred)
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n {classification_report(y_test, weighted_y_pred)}")
roc_auc = roc_auc_score(y_test, weighted_y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

Accuracy: 0.7693399574166075
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.79      0.83      1001
           1       0.58      0.72      0.64       408

    accuracy                           0.77      1409
   macro avg       0.73      0.75      0.74      1409
weighted avg       0.79      0.77      0.78      1409

ROC-AUC Score: 0.8402834910187852


>> The overall accuracy of the weighted rf model is slightly lower than that of the basic rf model. However, the recall value for churners (class 1) is significantly higher, coming in at __72% instead of the 48%__ in the previous model. The precision for non-churners (class 0) has gone up, but has gone down for churners (class 1). The f1-score, which measures the balance between both precision and recall, slightly decreased for non-churners and slightly increased for churners. I believe that the slight decrease in certain metrics of model is worth the trade-off for the significant increase in the recall score of churners. This is because the focus of this project is to improve churn prediction. 

In [21]:
weighted_cv_scores = cross_val_score(weighted_rf_model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy:", weighted_cv_scores.mean())

Cross-Validation Accuracy: 0.7740498438930751


>> The weighted rf model's cross-validation accuracy is slightly above that of overall accuracy, however it is still in an acceptable range where we can confidently say that the model generalizes well and does not overfit orunderfit the training data.

<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/balancedcm.jpeg" alt="Image 1" width="450">
</div>

__Figure VI.__ Confusion matrix of the Random Forest model with ```class_weight='balanced'```added as a parameter.

>> The confusion matrix built off of the weighted rf model shows that the model correctly predicted 100 more churners, despite failing to predict non-churners as accurately.

In [22]:
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5] 
for threshold in thresholds:
    og_y_pred_adjusted = (og_y_pred_proba >= threshold).astype(int)
    print(f"Threshold: {threshold}")
    print(f"Recall (Class 1): {recall_score(y_test, og_y_pred_adjusted):.2f}")
    print(f"Precision (Class 1): {precision_score(y_test, og_y_pred_adjusted):.2f}")
    print(f"F1-Score: {f1_score(y_test, og_y_pred_adjusted):.2f}")
    print("-" * 30)

Threshold: 0.1
Recall (Class 1): 0.93
Precision (Class 1): 0.41
F1-Score: 0.57
------------------------------
Threshold: 0.2
Recall (Class 1): 0.87
Precision (Class 1): 0.52
F1-Score: 0.65
------------------------------
Threshold: 0.3
Recall (Class 1): 0.72
Precision (Class 1): 0.57
F1-Score: 0.63
------------------------------
Threshold: 0.4
Recall (Class 1): 0.61
Precision (Class 1): 0.63
F1-Score: 0.62
------------------------------
Threshold: 0.5
Recall (Class 1): 0.48
Precision (Class 1): 0.71
F1-Score: 0.57
------------------------------


In [23]:
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5]  
for threshold in thresholds:
    weighted_y_pred_adjusted = (weighted_y_pred_proba >= threshold).astype(int)
    print(f"Threshold: {threshold}")
    print(f"Recall (Class 1): {recall_score(y_test, weighted_y_pred_adjusted):.2f}")
    print(f"Precision (Class 1): {precision_score(y_test, weighted_y_pred_adjusted):.2f}")
    print(f"F1-Score: {f1_score(y_test, weighted_y_pred_adjusted):.2f}")
    print("-" * 30)

Threshold: 0.1
Recall (Class 1): 0.99
Precision (Class 1): 0.37
F1-Score: 0.54
------------------------------
Threshold: 0.2
Recall (Class 1): 0.93
Precision (Class 1): 0.43
F1-Score: 0.58
------------------------------
Threshold: 0.3
Recall (Class 1): 0.89
Precision (Class 1): 0.48
F1-Score: 0.63
------------------------------
Threshold: 0.4
Recall (Class 1): 0.81
Precision (Class 1): 0.54
F1-Score: 0.65
------------------------------
Threshold: 0.5
Recall (Class 1): 0.72
Precision (Class 1): 0.58
F1-Score: 0.64
------------------------------


<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/thresholdsweightshape.jpeg" alt="Image 1" width="450">
</div>

__Figure VII.__ A graph of all the different metrics versus thresholds between class weight and no class weight

>> There are a couple choices for which threshold to choose and whether we want to include the ```class_weight='balanced'``` parameter. Moving forward, I will plan on including the ```class_weight='balanced'``` parameter in the Random Forest model, and the decision threshold I will set will be 0.4. I am making these choices based on the fact that the f1-score is highest, and there is a good recall score. This means that the model is good at predicting customers who end up actually churning while also having features and insights that we can take away to reduce churn rate.

<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/wltcm.jpeg" alt="Image 1" width="450">
    
</div>

__Figure VIII.__ Confusion matrix for the prediction outcomes of the Random Forest model with ```class_weight='balanced'``` and a threshold of 0.4.

>> The confusion matrix shown above displays that the model has become much more adept at predicting the number of churners. Out of a total of 398 total churners, the model accurately predicted 322 of them. Even though there is a slight increase in the number of non-churners who the model predicted to be churners, it is not so significant that no insights ca be taken away from the model.

>> Another way that we can deal with class imbalance is by using Synthetic Minority Oversampling Technique (SMOTE). THe goal of SMOTE is to address the class imbalance in the dataset by directly altering the obversations. SMOTE does this be generating synthetic samples of the minority class in order to balance out the distribution. This differs to ```class_weight='balanced'``` and lowering the threshold because neither directly manipulate the actual data. One important aspect to note is what is defined as class imbalance. Despite there not being a strict threshold, one of the standards is when the minority __constitutes less than 20%__ of the total samples. In the original dataset, the __minority class makes up around 26%__ of the total samples.

In [24]:
smote_rf_model = RandomForestClassifier(
    n_estimators= 100,
    max_depth=10,
    min_samples_leaf=4,
    min_samples_split=10,
    random_state=30,
)


In [25]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

smote_rf_model.fit(X_train_resampled, y_train_resampled)

In [26]:
smote_y_pred = smote_rf_model.predict(X_test)
smote_y_pred_proba = smote_rf_model.predict_proba(X_test)[:,1]

In [27]:
print(classification_report(y_test, smote_y_pred))

threshold = 0.4
smote_y_pred_adjusted = (smote_y_pred_proba >= threshold).astype(int)
print(classification_report(y_test, smote_y_pred_adjusted))

              precision    recall  f1-score   support

           0       0.87      0.79      0.83      1001
           1       0.58      0.71      0.64       408

    accuracy                           0.77      1409
   macro avg       0.73      0.75      0.74      1409
weighted avg       0.79      0.77      0.78      1409

              precision    recall  f1-score   support

           0       0.90      0.73      0.81      1001
           1       0.55      0.79      0.65       408

    accuracy                           0.75      1409
   macro avg       0.72      0.76      0.73      1409
weighted avg       0.79      0.75      0.76      1409



In [28]:
for threshold in thresholds:
    smote_y_pred_adjusted = (smote_y_pred_proba >= threshold).astype(int)
    print(f"Threshold: {threshold}")
    print(f"Recall (Class 1): {recall_score(y_test, smote_y_pred_adjusted):.2f}")
    print(f"Precision (Class 1): {precision_score(y_test, smote_y_pred_adjusted):.2f}")
    print(f"F1-Score: {f1_score(y_test, smote_y_pred_adjusted):.2f}")
    print("-" * 30)

Threshold: 0.1
Recall (Class 1): 0.98
Precision (Class 1): 0.38
F1-Score: 0.54
------------------------------
Threshold: 0.2
Recall (Class 1): 0.92
Precision (Class 1): 0.43
F1-Score: 0.59
------------------------------
Threshold: 0.3
Recall (Class 1): 0.87
Precision (Class 1): 0.49
F1-Score: 0.63
------------------------------
Threshold: 0.4
Recall (Class 1): 0.79
Precision (Class 1): 0.55
F1-Score: 0.65
------------------------------
Threshold: 0.5
Recall (Class 1): 0.71
Precision (Class 1): 0.58
F1-Score: 0.64
------------------------------


>> According to the first classification report -done without lowering the decision threshold- the balancing of class distribution in the dataset has resulted in the model having a recall value of   0.71 for churners (class 1). Even though this value is higher than that of the vanilla Random Forest model we started out with, it is lower than the recall value of the model that has ```class_weight='balanced'``` included as a parameter alongside a decision threshold of 0.4. The increase in precision is not a strong enough reason to switch over to resample the data and not include the previously decided upon changes. The same can be said for using SMOTE and lowering the threshold to different values.

__Note.__ We do not need to run tests for including ```class_weight='balanced'``` as we are assuming that SMOTE is correcltly balancing out the class distribution in the dataset.

In [29]:
smote_cm = confusion_matrix(y_test, smote_y_pred)
print(smote_cm)

smote_threshold_cm = confusion_matrix(y_test, smote_y_pred_adjusted)
print(smote_threshold_cm)

[[795 206]
 [119 289]]
[[795 206]
 [119 289]]


<div style="display: flex; justify-content: space-around;">
    <img src="Graphs/SMOTE.jpeg" alt="Image 1" width="300">
    <img src="Graphs/lowerSMOTE.jpeg" alt="Image 2" width="300">
</div>

__Figure IX.__ Confusion matrix after using SMOTE on the dataset and testing with a Random Forest model with a decision threshold of 0.5.

__Figure X.__ Confusino matrix after using SMOTE on the dataset and testing with a Random Forest model with a decision threshold of 0.4.

>> The two confusion matrices show the break down how well the model is at predicting churners and non-churners. Though Figure IX does come close to rivalling usind a Random Forest model with ```class_weight='balanced'``` and a decision threshold of 0.4, it still sightly falls short.

>> Before moving onto analyizng what features are important and providng insights using that information as to how to potentially reduce churn, I am first going to hypertune the parameters in the Random Forest model. The purpose of hypertuning the parameters in the Random Forest model is to further optimize the the model's performance. These parameters directly influence how a machine learning algorithm learns patterns and makes predictions. 

>> Using grid search, we can see that the optimum value for ```'n_estimators'```, ```'max_depth'```, ```'min_samples_leaf'```, ```'min_samples_split'``` are 200, 10, 2, and 5, respectively.

In [31]:
hyper_weighted_rf_model = RandomForestClassifier(
    n_estimators= best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_leaf=best_params['min_samples_leaf'],
    min_samples_split=best_params['min_samples_split'],
    random_state=30,
    class_weight='balanced'
)

hyper_weighted_rf_model.fit(X_train, y_train)

In [32]:
hyper_weighted_y_pred = weighted_rf_model.predict(X_test)
hyper_weighted_y_pred_proba = weighted_rf_model.predict_proba(X_test)[:,1]

accuracy = accuracy_score(y_test, hyper_weighted_y_pred)
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n {classification_report(y_test, hyper_weighted_y_pred)}")
roc_auc = roc_auc_score(y_test, hyper_weighted_y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

Accuracy: 0.7693399574166075
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.79      0.83      1001
           1       0.58      0.72      0.64       408

    accuracy                           0.77      1409
   macro avg       0.73      0.75      0.74      1409
weighted avg       0.79      0.77      0.78      1409

ROC-AUC Score: 0.8402834910187852


In [33]:
threshold = 0.4
hyper_weighted_y_pred_adjusted = (weighted_y_pred_proba >= threshold).astype(int)
print(classification_report(y_test, weighted_y_pred_adjusted))

              precision    recall  f1-score   support

           0       0.87      0.79      0.83      1001
           1       0.58      0.72      0.64       408

    accuracy                           0.77      1409
   macro avg       0.73      0.75      0.74      1409
weighted avg       0.79      0.77      0.78      1409



>> After hyptertuning the parameters and rerunning the weighted Random Forest model, we see that the hypertuned paramters are not performing as well as we hoped when it comes to the recall value for class 1. Since this is the aspect of the model that we care most about it, as it demonstrates the model's ability to correctly predict churners, the values used in the initial model are better suited for our purposes. 

>## Feature Analysis

>> Now that we have successfully figured out the best setup for the model, and what threshold has the best recall with minimal trade-off of other metrics, we can run the feature importance test and see what features the Random Forest model is focusing on. This allows us to use this information in accordance to achieving the goal of generating actionable insights that can reduce customer churn.

In [34]:
feature_importance = pd.DataFrame(
    {'importance': weighted_rf_model.feature_importances_},
    index = X.columns
)

In [35]:
feature_importance = pd.DataFrame(
    {'importance': weighted_rf_model.feature_importances_},
    index = X.columns
)

>> According to ```feature_importances_```, the features that the Random Forest model considers to be most important are ```tenure```, followed by ```TotalCharges``` and then ```Contract_Two year```. Down towards the end of the ranked features, we can see that some of the values start to be come rather small. We can create a new dataset that only include features whose values are greather than the mean of all feature values. This may be done in order to simplify the model, making it more inerpretable and removing features that do not add anything significant to its predictive ability. Another reason to remove certain features is because it leads to improved computational efficiency. Computational efficiency is of utmost importance when dealing with larger datasets as it saves time, memory, and money. One last reason that we may want to consider removing certain features is so that it allows the model to focus on the relevant features, improving its generalizability by ignoring noise and irrelevant features.

In [36]:
reduced_features = feature_importance[feature_importance['importance'] < feature_importance['importance'].mean()]
important_features = feature_importance[feature_importance['importance'] > feature_importance['importance'].mean()]
print(important_features)

                                importance
tenure                            0.177070
MonthlyCharges                    0.095514
TotalCharges                      0.144340
InternetService_Fiber optic       0.075428
Contract_One year                 0.039816
Contract_Two year                 0.095637
PaymentMethod_Electronic check    0.053285


>> We now have a final list of features whose importance value is greater than that of the mean importance values of al features. Before moving forward, we want to train the Random Forest model on these features and run a test to see if the model still has good prediction metrics.

In [37]:
reduced_X = X.drop(
    ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
     'PaperlessBilling', 'TotalSubscriptions', 'InternetService_No', 'OnlineSecurity_No internet service',
     'OnlineSecurity_Yes', 'OnlineBackup_No internet service', 'OnlineBackup_Yes', 'DeviceProtection_No internet service',
     'DeviceProtection_Yes', 'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No internet service',
     'StreamingTV_Yes', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Mailed check'], 
    axis=1 
)

reduced_scaled_X = scaler.fit_transform(reduced_X)

In [38]:
reduced_X_train, reduced_X_test, reduced_y_train, reduced_y_test = train_test_split(reduced_scaled_X, y, test_size=0.2, random_state=30)

In [39]:
reduced_weighted_rf_model = RandomForestClassifier(
    n_estimators= 100,
    max_depth=10,
    min_samples_leaf=4,
    min_samples_split=10,
    random_state=30,
    class_weight='balanced'
)

reduced_weighted_rf_model.fit(reduced_X_train, reduced_y_train)

In [40]:
reduced_weighted_y_pred = reduced_weighted_rf_model.predict(reduced_X_test)
reduced_weighted_y_pred_proba = reduced_weighted_rf_model.predict_proba(reduced_X_test)[:,1]

accuracy = accuracy_score(reduced_y_test, reduced_weighted_y_pred)
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n {classification_report(reduced_y_test, reduced_weighted_y_pred)}")
roc_auc = roc_auc_score(reduced_y_test, reduced_weighted_y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

Accuracy: 0.7672107877927609
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.79      0.83      1001
           1       0.58      0.72      0.64       408

    accuracy                           0.77      1409
   macro avg       0.73      0.75      0.73      1409
weighted avg       0.79      0.77      0.77      1409

ROC-AUC Score: 0.8343482497894263


In [41]:
threshold = 0.4
reduced_weighted_y_pred_adjusted = (reduced_weighted_y_pred_proba >= threshold).astype(int)
print(classification_report(reduced_y_test, reduced_weighted_y_pred_adjusted))

              precision    recall  f1-score   support

           0       0.91      0.70      0.79      1001
           1       0.53      0.83      0.65       408

    accuracy                           0.74      1409
   macro avg       0.72      0.77      0.72      1409
weighted avg       0.80      0.74      0.75      1409



In [42]:
reduced_cv_scores = cross_val_score(reduced_weighted_rf_model, X_train, reduced_y_train, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy:", reduced_cv_scores.mean())

Cross-Validation Accuracy: 0.7740498438930751


>> The most important takeaways from these metrics are that the model recall on churners has actually gone up from __81% to 83%__, while only decreasing slightly in other evaluation metrics. Alongside the strong CV score and ROC-AUC score, we show that we in fact do not need to include all of the previousuly described features and only need to keep features whose importance value is above that of the mean. As mentioned previously, this brings along a couple benefits.

>## Conclusion

>> From testing, we figured out that the best configuration that lead to the highest recall value for churners, while still holding relatively strong in other metrics, was including ```class_weight='balanced'``` in the Random Forest model and lowering the decision threshold to 0.4. We also decided that the data does not need to include all the features, only those listed in ```important_features``` -these are the features whose value of importance is higher than that of the mean calculated from all features' value of importance according to the Random Forest model.

>> From ```important_features``` we can see that ```tenure``` is the number one most important features used when determining whether somoene is going to churn. However, there is no way to directly influence the amount of time someone has spent subscribed with the company. Though there is still something to be taken from this. I believe that the most important interpretation of this features is that it tels us that once someone has stayed subscribed with the company for a period of time, the chances that they will suddenly chun drops drastically. This is also something that is shown in __Figure II__. From __Figure IV__ it can be seen that two year contracts have the lowest churn count. Pairing this along side ```TotalCharges```, I would suggest that the company offers a discount on two months for the two year contract plan for new customers. Furthermore, I would also reccomend that this option be given to customers who are not currently on the two year plan. I would also suggest a rewards system, only for customers on the two yera contract plan, that gets increasingly better the longer a customer stays subscribed. To appease the current two year contract plan customers who have been more loyal to the company, I suggest that they automatically start at one of the rewards, thus allowing them to collect all the rewards leading up to that one. This simultaneously rewards them for their loyalty while giving them added incentive to continue subscribing for a longer period of time in order to achieve the latter rewards in the program. In conclusion, by promoting the two year contract with emphasis on the two discounted months, implementing the rewards program and placing customers who have shown loyalty accordingly, would be a great start towards working on reducing customer churn. I will note that this is not an end all be all. I believe that we will need to implement these changes and monitor how people respond to them and then figure out our next course of action from there.