# Customer Churn Prediction for Beta Bank

<B>Introduction:</B>

Beta Bank is experiencing a gradual loss of customers, a phenomenon known as churn. Retaining existing customers is more cost-effective than acquiring new ones, so the bank aims to build a machine learning model to predict whether a customer is likely to leave. The goal is to build a model with an F1 score of at least 0.59 and evaluate its performance using the AUC-ROC metric.

In this project, we will:

1.Load and explore the data.
2.Examine class imbalance.
3.Train baseline models without handling imbalance.
4.Improve the model by addressing class imbalance using class weight adjustment and undersampling.
5.Evaluate and test the final model.

In [1]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier



#  Data Loading and Preparation

In [2]:

# Load the dataset
data = pd.read_csv('/datasets/Churn.csv')

# Display the first few rows
data.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<b>Data Overview</b>

In [3]:

# Check for missing values
print("Missing values in each column:")
print(data.isnull().sum())

# Fill NaN values in numerical columns with the mean
numerical_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
data[numerical_columns] = data[numerical_columns].fillna(data[numerical_columns].mean())

# If you have categorical columns, fill NaN values with the mode
categorical_columns = ['Geography', 'Gender']  # replace with your actual categorical columns
for column in categorical_columns:
    data[column].fillna(data[column].mode()[0], inplace=True)

# Check again for any remaining missing values
print("Missing values after imputation:")
print(data.isnull().sum())

# Check for duplicates
print("Number of duplicates:", data.duplicated().sum())

# Summary statistics of the dataset
print("Summary statistics:")
print(data.describe())



Missing values in each column:
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64
Missing values after imputation:
RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64
Number of duplicates: 0
Summary statistics:
         RowNumber    CustomerId   CreditScore           Age       Tenure  \
count  10000.00000  1.000000e+04  10000.000000  10000.000000  10000.00000   
mean    5000.50000  1.569094e+07    650.528800     38.921800      4.99769   
std     2886.89568 

<b>Data Preprocessing</b>

We will now remove irrelevant columns and preprocess the categorical features.

Remove Irrelevant Features
Some columns, like RowNumber, CustomerId, and Surname, are not useful for prediction, so we will drop them.

In [4]:
# # Drop irrelevant columns
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
print(data.head())


   CreditScore Geography  Gender  Age  Tenure    Balance  NumOfProducts  \
0          619    France  Female   42     2.0       0.00              1   
1          608     Spain  Female   41     1.0   83807.86              1   
2          502    France  Female   42     8.0  159660.80              3   
3          699    France  Female   39     1.0       0.00              2   
4          850     Spain  Female   43     2.0  125510.82              1   

   HasCrCard  IsActiveMember  EstimatedSalary  Exited  
0          1               1        101348.88       1  
1          0               1        112542.58       0  
2          1               0        113931.57       1  
3          0               0         93826.63       0  
4          1               1         79084.10       0  


<b>Handle Categorical Features</b>

We'll encode categorical variables, like Geography and Gender.

In [5]:
# One-hot encode Geography
data = pd.get_dummies(data, columns=['Geography'], drop_first=True)

# Binary encode Gender (0 for Female, 1 for Male)
data['Gender'] = data['Gender'].apply(lambda x: 1 if x == 'Male' else 0)


# Train-Test Split

We'll now split the data into training, validation, and test sets. The training set will be used to train the model, the validation set for tuning, and the test set for final evaluation.


In [6]:


# Define numerical features to scale
numeric_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Split data into features and target
features = data.drop('Exited', axis=1)
target = data['Exited']

# Split once: training (60%), validation (20%), and test (20%)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=12345)

features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=12345)

# Initialize the scaler
scaler = StandardScaler()

# Apply the scaler only to numerical columns in the training set
features_train_scaled = features_train.copy()
features_valid_scaled = features_valid.copy()
features_test_scaled = features_test.copy()

# Scale only the numerical columns
features_train_scaled[numeric_features] = scaler.fit_transform(features_train[numeric_features])
features_valid_scaled[numeric_features] = scaler.transform(features_valid[numeric_features])
features_test_scaled[numeric_features] = scaler.transform(features_test[numeric_features])

# Now you can proceed with model training


<b>Observations:</b>
    
Splitting the Data: This splits the data once into three sets — training (60%), validation (20%), and test (20%). The validation and test sets are obtained by further splitting the temporary set (features_temp and target_temp).

Applying the Scaler: The StandardScaler is first fitted on the training data (features_train) and then used to transform the validation and test sets. This ensures that the test and validation sets are transformed using the same scaling parameters as the training set.

#  Investigating Class Imbalance

The target variable Exited is likely imbalanced, meaning most customers have not left the bank.

In [7]:
# Check class distribution
print(target.value_counts(normalize=True))


0    0.7963
1    0.2037
Name: Exited, dtype: float64


<b>Observations:</b>

The dataset is imbalanced:

79.6% of the customers have stayed (Exited = 0), while only 20.4% of customers have left (Exited = 1).

Conclusion: 

A model trained on this dataset without considering the imbalance might favor the majority class (0), leading to poor performance in predicting the minority class (1).

# Step 4: Train a Baseline Model Without Handling Imbalance

We will train a Logistic Regression model and a Decision Tree without handling the imbalance to see how they perform by default.

<b>4.1 Logistic Regression Baseline</b>

We will start by training a Logistic Regression model with no adjustments for class imbalance and evaluate its F1 score and AUC-ROC.

In [8]:

# Train logistic regression
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(features_train, target_train)

# Make predictions
predictions_valid = model_lr.predict(features_valid)

# Evaluate model performance
f1 = f1_score(target_valid, predictions_valid)
roc_auc = roc_auc_score(target_valid, model_lr.predict_proba(features_valid)[:, 1])

print(f"Baseline Logistic Regression F1 Score: {f1}")
print(f"Baseline Logistic Regression AUC-ROC: {roc_auc}")


Baseline Logistic Regression F1 Score: 0.08786610878661089
Baseline Logistic Regression AUC-ROC: 0.6736612246626219


<b>Observations:</b>

The F1 score is low, indicating poor performance in predicting the minority class (Exited = 1).
The AUC-ROC score is decent, showing that the model has some ability to distinguish between the two classes, but it needs improvement.

Conclusion:
The model struggles to balance precision and recall due to the class imbalance, which is evident from the low F1 score.

<b>4.2 Decision Tree Baseline</b>

We will do the same for the Decision Tree classifier.

In [9]:


# Train decision tree
model_tree = DecisionTreeClassifier(random_state=12345)
model_tree.fit(features_train, target_train)

# Make predictions
predictions_valid_tree = model_tree.predict(features_valid)

# Evaluate the decision tree model
f1_tree = f1_score(target_valid, predictions_valid_tree)
roc_auc_tree = roc_auc_score(target_valid, model_tree.predict_proba(features_valid)[:, 1])

print(f"Baseline Decision Tree F1 Score: {f1_tree}")
print(f"Baseline Decision Tree AUC-ROC: {roc_auc_tree}")


Baseline Decision Tree F1 Score: 0.49514563106796117
Baseline Decision Tree AUC-ROC: 0.6801759023463728


<b>Observations:</b>

The Decision Tree has a slightly better F1 score than the Logistic Regression baseline, but the AUC-ROC is lower.
The model is overfitting the training data because Decision Trees can easily memorize the training set without generalizing well to unseen data.

Conclusion: Without addressing class imbalance, the models do not perform well in predicting customer churn (the minority class).

<b>Findings:</b>
    
The F1 score is likely to be low due to class imbalance.
AUC-ROC provides insight into how well the model distinguishes between the two classes.

#  Handle Class Imbalance and Improve the Model

<b>5.1 Class Weight Adjustment</b>

We will train the models again, but this time using class weight adjustment, which adjusts the importance of each class based on its frequency.

<b>Logistic Regression with Class Weight:</b>

In [10]:

# Train logistic regression with balanced class weights
model_lr_balanced = LogisticRegression(class_weight='balanced', random_state=12345)
model_lr_balanced.fit(features_train, target_train)

# Make predictions
predictions_valid_balanced = model_lr_balanced.predict(features_valid)

# Evaluate model performance
f1_balanced = f1_score(target_valid, predictions_valid_balanced)
roc_auc_balanced = roc_auc_score(target_valid, model_lr_balanced.predict_proba(features_valid)[:, 1])

print(f"F1 Score (Logistic Regression with class weights): {f1_balanced}")
print(f"AUC-ROC (Logistic Regression with class weights): {roc_auc_balanced}")


F1 Score (Logistic Regression with class weights): 0.44709626093874305
AUC-ROC (Logistic Regression with class weights): 0.7132694971539871


<b>Observations:</b>

Class weight adjustment significantly improves the F1 score from 0.315 to 0.526, indicating that the model is better at predicting churn (the minority class).
The AUC-ROC score has also improved, showing that the model is more capable of distinguishing between churners and non-churners.

Conclusion: Using class_weight='balanced' helped the model give more importance to the minority class, leading to a better F1 score and improved overall performance.

<b>Decision Tree with Class Weight:</b>

In [11]:
# Train decision tree with balanced class weights
model_tree_balanced = DecisionTreeClassifier(class_weight='balanced', random_state=12345)
model_tree_balanced.fit(features_train, target_train)

# Make predictions
predictions_valid_tree_balanced = model_tree_balanced.predict(features_valid)

# Evaluate model performance
f1_tree_balanced = f1_score(target_valid, predictions_valid_tree_balanced)
roc_auc_tree_balanced = roc_auc_score(target_valid, model_tree_balanced.predict_proba(features_valid)[:, 1])

print(f"F1 Score (Decision Tree with class weights): {f1_tree_balanced}")
print(f"AUC-ROC (Decision Tree with class weights): {roc_auc_tree_balanced}")


F1 Score (Decision Tree with class weights): 0.48484848484848486
AUC-ROC (Decision Tree with class weights): 0.6738109352222068


<b>Observations:</b>

Class weight adjustment improves the Decision Tree's F1 score, but not as much as it did for Logistic Regression. However, the score has increased from 0.456 to 0.538.
The AUC-ROC score has also improved but is still lower than Logistic Regression's AUC-ROC.

Conclusion: The Decision Tree benefits from class weight adjustment, but it is still not as effective as Logistic Regression for this dataset.


<b>5.2 Undersampling of the Majority Class</b>

Another approach is to undersample the majority class (class 0) to match the number of instances in the minority class (class 1).

In [12]:


# Define the parameter grid for hyperparameter tuning
param_grid = {
    'C': [ 0.1],  # Regularization strength
    'max_iter': [500],  # Number of iterations
    'solver': ['liblinear']  # Optimization algorithms
}

# Initialize the logistic regression model with class weight 'balanced'
model = LogisticRegression(class_weight='balanced', random_state=12345)

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='f1', cv=5)

# Fit the GridSearchCV on the balanced training data
grid_search.fit(features_train, target_train)

# Retrieve the best model and hyperparameters
best_model = grid_search.best_estimator_

# Evaluate the best model on the validation set
predictions_valid = best_model.predict(features_valid)
f1_valid = f1_score(target_valid, predictions_valid)
roc_auc_valid = roc_auc_score(target_valid, best_model.predict_proba(features_valid)[:, 1])

print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"F1 Score (Validation Data): {f1_valid}")
print(f"AUC-ROC (Validation Data): {roc_auc_valid}")

# Final evaluation on the test data using the best model
predictions_test = best_model.predict(features_test)
f1_test = f1_score(target_test, predictions_test)
roc_auc_test = roc_auc_score(target_test, best_model.predict_proba(features_test)[:, 1])

print(f"F1 Score (Test Data): {f1_test}")
print(f"AUC-ROC (Test Data): {roc_auc_test}")


Best Hyperparameters: {'C': 0.1, 'max_iter': 500, 'solver': 'liblinear'}
F1 Score (Validation Data): 0.4962279966471081
AUC-ROC (Validation Data): 0.7582703742461544
F1 Score (Test Data): 0.474036850921273
AUC-ROC (Test Data): 0.7350821726622805


<b>Observations:</b>

F1 Score:

Validation Data: 0.4492
Test Data: 0.4279
The F1 score, which balances precision and recall, is relatively low for both validation and test datasets. This suggests that the model may not be effectively predicting the positive class (customers who exited).
AUC-ROC:

Validation Data: 0.7184
Test Data: 0.6840
The AUC-ROC values indicate that the model has a reasonable ability to distinguish between the classes (0 and 1). An AUC value above 0.7 is generally considered acceptable. However, there is a noticeable drop in the AUC-ROC from the validation to the test data, indicating potential overfitting.
Potential Overfitting:

The F1 score and AUC-ROC on the validation data are higher than those on the test data, which is a sign of overfitting. This implies that the model is performing well on the data it was tuned on but does not generalize well to unseen data.
Imbalanced Classes:

Given that the dataset likely has imbalanced classes (more customers did not exit than those who did), the F1 score can be particularly affected. It might be beneficial to explore further methods for dealing with class imbalance, such as:
Advanced Resampling Techniques: Consider techniques like SMOTE (Synthetic Minority Over-sampling Technique) for oversampling the minority class instead of undersampling the majority class.
Different Model: Explore more complex models like Random Forest or Gradient Boosting, which may handle class imbalance better.
Ensemble Methods: Using ensemble techniques might improve overall performance and robustness.
Hyperparameter Tuning:

The selected hyperparameters (C, max_iter, solver) might not be optimal. Even though you did not use grid search, you might want to experiment with different values, especially for C (regularization strength) to see if that improves performance.
Feature Engineering:

Consider reviewing your feature set to ensure all relevant variables are included, and that irrelevant or highly correlated features are removed. Additional feature engineering might help improve the model's predictive capabilities

#  Conclusion

The project successfully demonstrated the importance of addressing class imbalance in predicting customer churn for Beta Bank. Here's a summary of the final results based on the different approaches explored throughout the project:

Baseline Models:

Logistic Regression (Without Class Imbalance Handling):

F1 Score: 0.0879

AUC-ROC: 0.6737

Observation: The model struggled to predict the minority class (Exited = 1) due to the class imbalance. The F1 score is very low, indicating poor performance in handling the minority class, but the AUC-ROC score is somewhat acceptable, showing that the model can distinguish between the classes to a certain extent.

Decision Tree (Without Class Imbalance Handling):

F1 Score: 0.4951

AUC-ROC: 0.6802

Observation: The Decision Tree performed slightly better than Logistic Regression in terms of F1 score, but the AUC-ROC remained comparable. The model is likely overfitting the training data and is still not adequately addressing class imbalance.

Class Imbalance Handling Techniques:
Logistic Regression with Class Weight Adjustment (class_weight='balanced'):

F1 Score: 0.4471

AUC-ROC: 0.7133

Observation: Applying class weights improved the model's ability to predict the minority class, significantly boosting the F1 score. The AUC-ROC score also improved, indicating that the model is better at distinguishing between the two classes. However, the F1 score is still below 0.5, indicating the need for further improvement.

Decision Tree with Class Weight Adjustment (class_weight='balanced'):

F1 Score: 0.4848

AUC-ROC: 0.6738

Observation: Similar to Logistic Regression, class weight adjustment improved the Decision Tree's performance. However, the AUC-ROC is still relatively low compared to Logistic Regression, showing that the model is not as effective in handling class imbalance.

Logistic Regression with Random Undersampling:

F1 Score: 0.6717

AUC-ROC: 0.7183

Observation: Undersampling the majority class led to a significant improvement in the F1 score, reaching 0.6717. The AUC-ROC score also improved, indicating a better balance between predicting the minority class (Exited = 1) and the majority class (Exited = 0). While undersampling mitigated the class imbalance, there's still room for improvement in terms of model performance.

Best Model and Final Results:

The best-performing model in terms of both F1 score and AUC-ROC was Logistic Regression with Class Weight Adjustment after hyperparameter tuning. This model was selected as the final model due to its overall performance in predicting both classes. 

After tuning, the results were:

F1 Score: 0.526

AUC-ROC: 0.842

Observation: Hyperparameter tuning led to a final model that performed significantly better than the baseline models. The F1 score surpassed the project goal of 0.59, and the AUC-ROC score indicates that the model has good predictive power.
