# Beta Bank
Customers of Beta Bank are leaving little by little every month. Bankers have discovered that it is cheaper to retain existing customers than to attract new ones. <br>
We need to predict whether a customer will leave the bank soon. You have data on customers’ past behavior and contract terminations with the bank. <br>
Create a model with the highest possible F1 score. To pass the review, you need an F1 score of at least 0.59. Check the F1 score on the test set. In addition, you must measure the AUC-ROC metric and compare it with the F1 score.

# Data Description
You can find the data in the file /datasets/Churn.csv.

Features

- RowNumber: data row index
- CustomerId: unique customer identifier
- Surname: last name
- CreditScore: credit score
- Geography: country of residence
- Gender: gender
- Age: age
- Tenure: period during which the customer’s fixed-term deposit has matured (years)
- Balance: account balance
- NumOfProducts: number of banking products used by the customer
- HasCrCard: whether the customer has a credit card (1 – yes; 0 – no)
- IsActiveMember: customer activity status (1 – yes; 0 – no)
- EstimatedSalary: estimated salary

Target
- Exited: the customer has left the bank (1 – yes; 0 – no)

# 1. Initialization

In [None]:
# Import functions
import sys
import os

sys.path.append(os.path.abspath('..'))

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# Import classification libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, classification_report

from src.null_columns import show_null_columns
from src.column_names import stardard_col_names
from src.modeling import train_evaluate_model

In [None]:
# Extract the info from the Datasets
df_churn = pd.read_csv('../data/raw/churn.csv')

# 2. Data Preprocessing
### 2.1 Copy original Dataframe

In [None]:
# Clone datasets to keep the original with no changes
df_churn_clean = df_churn.copy()

### 2.2 Review duplicate / null values

In [None]:
# General View
df_churn.info()
print()
print(df_churn.sample(3))

"""
Findings:
Reviewing the column headers, we have to standardize to snake_case
"""
stardard_col_names(df_churn_clean)
df_churn_clean.columns

Duplicate Values

In [None]:
print("the duplicate rows are:", df_churn_clean.duplicated().sum())  # Sum of Duplicated rows
print('number of duplicate values in column "customer_id":', df_churn_clean['customer_id'].duplicated().sum())

Null Values

In [None]:
show_null_columns(df_churn_clean)

In [None]:
"""
Findings:
As the results of null values in column 'tenure' is 9.09%, we have to evaluate how the data is distributed to decide how to impute the data.
"""
# Visualize distribution
df_churn_clean['tenure'].hist(bins=30)
plt.title('Distribution of Column')
plt.show()

# Check statistics
print(df_churn_clean['tenure'].describe().round(2))
print(f"Skewness: {df_churn_clean['tenure'].skew()}")

"""
Findings:
The data in column "tenure" result very simetric (skewness = 0.016), the imputation can be either mean or median,
in this case, considering tenure (years / months), I'll take median to keep the values (int).
"""
df_churn_clean['tenure'] = df_churn_clean['tenure'].fillna(df_churn_clean['tenure'].median())

# Change dtypes to keep the integrity and optimization of the data
df_churn_clean['tenure'] = df_churn_clean['tenure'].astype(int)
df_churn_clean['geography'] = df_churn_clean['geography'].astype('category')
df_churn_clean['gender'] = df_churn_clean['gender'].astype('category')
df_churn_clean.info()

### 2.3 Preparing data for Modeling

In [None]:
"""
Findings:
To proceed with a ML model, some columns with 'non-useful' variables will be removed, and change 2 categorical columns to numerical
using the one-hot coding. With this, the data will be ready for modeling.
"""
df_churn_clean = df_churn_clean.drop(['row_number', 'customer_id', 'surname'], axis=1)   # Removing columns
df_churn_clean = pd.get_dummies(df_churn_clean, columns=['geography', 'gender'], drop_first=True)   # transforming data using one-hot coding
df_churn_clean.columns = df_churn_clean.columns.str.lower()
print(df_churn_clean.dtypes)


# 3. Modeling
### 3.1 Data Segmentation

In [None]:
# Split the dataset into features and target
features = df_churn_clean.drop(['exited'], axis=1)
target = df_churn_clean['exited']

# Divide the data (train and test) by using train-test split
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

### 3.2 Class Balance

In [None]:
"""
Findings:
Here we need to see if how balanced is the data in the target column 'exited'. 
This is to ensure the target distribution was not heavily skewed, since class imbalance can bias model training and make accuracy misleading.
"""
# Class balance
print(df_churn_clean['exited'].value_counts())
print()
print(df_churn_clean['exited'].value_counts(normalize=True) * 100)

# Class Balance Visualization
df_churn_clean['exited'].value_counts().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Exited (0 = No, 1 = Yes)')
plt.ylabel('Number of Customers')
plt.show()

Findings: <br>
With the class balance analysis, it can be observed that there is no balance between the classes [0, 1]. <br>
The results show that 79.63% of customers have not left the bank, while 20.37% of customers have left. <br>
Metrics to use: F1-score and ROC-AUC

### 3.3 Training
For the selection of models to be trained, given the characteristics and interpretation of the dataset, the problem is a classification task. <br>
Here, I'll select 3 of classification models to determine which one gives the best and more accurate result, and each model will be compared with class_balance=True, and class_balance=False. This, because the imbalance is ~80/20.
#### 3.3.1 DecisionTreeClassifier

In [None]:
# Setting parameters
results_dtc = []

dtc_params = {
    "max_depth": 10,
    "min_samples_split": 20,
    "random_state": 12345
}

In [None]:
# Model Without class_balance
model_dtc, f1_dtc, auc_dtc = train_evaluate_model(
    DecisionTreeClassifier,
    dtc_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=False
)

results_dtc.append({
    "model": "DecisionTree",
    "class_weight": "None",
    "f1": f1_dtc,
    "auc": auc_dtc
})

In [None]:
# Model With class_balance
model_dtc_bal, f1_dtc_bal, auc_dtc_bal = train_evaluate_model(
    DecisionTreeClassifier,
    dtc_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=True
)

results_dtc.append({
    "model": "DecisionTree",
    "class_weight": "Balanced",
    "f1": f1_dtc_bal,
    "auc": auc_dtc_bal
})

In [None]:
# Comparison of Models
results_dtc = pd.DataFrame(results_dtc)
results_dtc

#### 3.3.2 LogisticRegression

In [None]:
# Setting parameters
results_lr = []

lr_params = {
    "solver":'liblinear',
    "random_state": 12345
}

In [None]:
# Model Without class_balance
model_lr, f1_lr, auc_lr = train_evaluate_model(
    LogisticRegression,
    lr_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=False
)

results_lr.append({
    "model": "LogisticRegression",
    "class_weight": "None",
    "f1": f1_lr,
    "auc": auc_lr
})

In [None]:
# Model With class_balance
model_lr_bal, f1_lr_bal, auc_lr_bal = train_evaluate_model(
    LogisticRegression,
    lr_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=True
)

results_lr.append({
    "model": "LogisticRegression",
    "class_weight": "Balanced",
    "f1": f1_lr_bal,
    "auc": auc_lr_bal
})

In [None]:
# Comparison of Models
results_lr = pd.DataFrame(results_lr)
results_lr

#### 3.3 RandomForestClassifier

In [None]:
# Setting parameters
results_rfc = []

rfc_params = {
    "n_estimators": 20,
    "min_samples_split": 10,
    "random_state":12345
}

In [None]:
# Model Without class_balance
model_rfc, f1_rfc, auc_rfc = train_evaluate_model(
    RandomForestClassifier,
    rfc_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=False
)

results_rfc.append({
    "model": "RandomForest",
    "class_weight": "None",
    "f1": f1_rfc,
    "auc": auc_rfc
})

In [None]:
# Model With class_balance
model_rfc_bal, f1_rfc_bal, auc_rfc_bal = train_evaluate_model(
    RandomForestClassifier,
    rfc_params,
    features_train,
    target_train, 
    features_test, 
    target_test,
    use_class_weight=True
)

results_rfc.append({
    "model": "RandomForest",
    "class_weight": "Balanced",
    "f1": f1_rfc_bal,
    "auc": auc_rfc_bal
})

In [None]:
# Comparison of Models
results_rfc = pd.DataFrame(results_rfc)
results_rfc

After running three different classification models, the one that achieved the most optimal F1 score was the RandomForestClassifier, using the undersampling balancing technique.
Below are the results obtained:

DecisionTreeClassifier
-	Unbalanced: F1 = 0.5736, AUC = 0.7161
-	With class_weight: F1 = 0.5651, AUC = 0.7469

LogisticRegression
-	Unbalanced: F1 = 0.1171, AUC = 0.5224
-	With class_weight: F1 = 0.4893, AUC = 0.6923

RandomForestClassifier
-	Unbalanced: F1 = 0.5632, AUC = 0.7035
-	With class_weight: F1 = 0.6363, AUC = 0.7611

# 4. Conclusion
As a conclusion, the objective of this project was to develop a machine learning model to predict customer churn for Beta Bank, targeting an F1 score of at least 0.59.
In the project, to select the right model to fit the desired score, 2 important topics were considered:

1. Model Evaluation:
We tested three classification algorithms (Decision Tree, Logistic Regression, and Random Forest) and compared their performance using both unbalanced data and class balancing techniques (Class Weighting). <br>
- Decision Tree: Struggled to meet the threshold, with F1 scores hovering around 0.56-0.57 regardless of balancing.
- Logistic Regression: Performed poorly due to the non-linear nature of the data, achieving a maximum F1 score of only ~0.49 even with balancing.
- Random Forest: Demonstrated the best performance. While the unbalanced model fell short (F1 ~0.56), applying Class Weighting significantly boosted its predictive power.

2. Final Model Selection:
- The Random Forest Classifier with class_weight='balanced' is the final selected model.
- Final F1 Score: 0.6363 (Successfully exceeds the 0.59 target).
- AUC-ROC Score: 0.7611 (Indicates the model has a good capability to distinguish between retaining and exiting customers).

Business Impact:
This model effectively balances Precision and Recall, making it a valuable tool for Beta Bank. It allows the bank to proactively target at-risk customers with retention strategies, thereby reducing churn and saving costs associated with acquiring new customers.