# Imperial AI & Machine Learning Capstone Project  
Submission by: **Andrew Major**  
Student Number: **432**

## 1. Problem Statement and Objective  

- **Problem**: Predict customer churn (i.e., which customers are likely to leave the bank).
- **Objective**: Develop a model that accurately predicts churn to help the bank retain valuable customers.


## 2. Initial Setup


In [None]:
import numpy as np
import pandas as pd
import decimal
from termcolor import colored
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import capstoneFunctions as cf

from sklearn.preprocessing import OneHotEncoder
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from sklearn.preprocessing import StandardScaler

## 3. Data Preparation  

* **Data Collection**: Source a banking customer dataset, including features such as account balance, transaction frequency, customer demographics, etc.
* **Data Cleaning**:
  * Handle missing values (impute or drop).
  * Remove irrelevant features.
  * Check for duplicates.
* **Data Exploration**:
  * Understand the distribution of features.
  * Identify potential outliers.
### Load the dataset from .csv file  
* Explore columns and missing data

In [None]:
rawData = pd.read_csv('./data/churn_modelling.csv')

In [None]:
rawData.head()

In [None]:
rawData.describe()

Now look for the missing values...

In [None]:
np.where(pd.isnull(rawData))

We can drop any irrelevant columns unlikely to influence outcome...

In [None]:
rawData = rawData.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

## 5. Data Preprocessing  

* **Feature Scaling**: Normalise numerical features (e.g., using Min-Max scaling or Z-score normalisation). After consultation, it was agreed to only scale features for the K-Nearest Neighbours model.
* **Handling Categorical Features**:
  * Convert categorical features to numerical using one-hot encoding.
* **Feature Engineering**:
  * Consider creation of new features if relevant (e.g., customer tenure, transaction frequency).
  * Explore interactions between features.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Convert categorical data into numerical data
le = LabelEncoder()
rawData['Geography'] = le.fit_transform(rawData['Geography'])
rawData['Gender'] = le.fit_transform(rawData['Gender'])

# Scale the numerical data
scaler = StandardScaler()
rawData[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(rawData[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])

rawData.head()

Split the data into predictors and outcomes, in preparation for creation of training, test and validation sets

In [None]:
predictors_alt = rawData.drop(columns=['Exited'])
outcomes_alt = rawData['Exited']

Check the distribution of outcomes, as this will influence our choice of performance metrics for the models later.

In [None]:
outcomes_alt.value_counts()

The target of positive churn represents approximately 20% of the total dataset; this imbalance means we will lean towards the F1 Score as a performance metric as it is impacted by both precision and recall.

### Plot Correlation Matix  
* We need to determine if multi-colinearity will be an issue

In [None]:
cf.correlationMatrix(outcomes_alt)

## 4. Model Selection  

We will consider the following six alternative models:

1. **Logistic Regression**:
  * A simple yet interpretable model.
  * Suitable for binary classification tasks.
  * May need input data converted to categorical (e.g., one-hot encoding for categorical features like gender, education level).

2. **Random Forest**:
  * Ensemble method combining multiple decision trees.
  * Handles non-linear relationships.
  * Can handle both numerical and categorical features.

3. **K-Nearest Neighbours**:
  * KNN is easy to understand and implement.
  * It doesn’t make any underlying assumptions about the data distribution.
  * Can learn nonlinear decision boundaries.

4. **Gradient Boosting (e.g., XGBoost)**:
  * Powerful ensemble model.
  * Handles complex interactions.
  * Automatically handles missing values.
  * May not require explicit one-hot encoding.

5. **Support Vector Machine**:
  * SVC works well when there is a clear margin of separation between classes.
  * Aims to find the optimal hyperplane that maximizes the distance between data points of different classes.

6. **Neural Networks (Deep Learning)**:
  * Complex model capable of learning intricate patterns.
  * Requires substantial data and computational resources.
  * Can handle both numerical and categorical features.

## 5. Performance Metrics  

Choose appropriate metrics to compare model performance. Given that we are predicting the minority class representing approximately 20% of the data, we are interested in:

* **Precision**: Proportion of true positive predictions among all positive predictions.
* **Recall (Sensitivity)**: Proportion of actual positives correctly predicted.
* **F1-Score**: Harmonic mean of precision and recall.

## 6. Model Training and Evaluation  

* Split data into training and hold-out validation sets.
* Tune hyperparameters (e.g., learning rate, tree depth) using full grid search and randomised grid search for comparison, with cross-validation. It was decided after consultation that the only models requiring full optimisation were the K-Nearest Neighbour and Gradient Boosting models. For programming convenience, all other models were 'optimised' through the same python module, but with a single value specified for each hperparameter; this allowed the benefitof the GridSearchCV and RandomizedSearchCV outputs to be obtained for all models.
* Train each model with stratified k-fold cross-validation using the chosen features.
* Evaluate models using the selected performance metrics on the hold-out set.

In [None]:
# Initial attept totune FNN hyperparameters before full training alongside other models
FNNresults= cf.trainFNNOnly(trialPredictors, outcomes, nEpochs=10, learningRate=0.0001, display=True, threshold=0.24)

In [None]:
results, X_val, y_val = cf.trainTestCycle(outcomes_alt,outcomes_alt, stratifiedKF=True, threshold=0.23)

In [None]:
f1_val_scores = []
f1_train_scores = []
f1_scores= []
confusion_matrices = []
labels = []
for model, data in results.items():
    if model != 'FNN':
        # confusion matrix
        confusion_matrices.append(data['Grid']['validation']['confusion'])
        confusion_matrices.append(data['Random']['validation']['confusion'])
        # Best training scores
        f1_train_scores.append(data['Grid']['bestScore'])
        f1_train_scores.append(data['Random']['bestScore'])
        # validation scores
        f1_val_scores.append(data['Grid']['validation']['f1'])
        labels.append(model + '_grid')
        f1_val_scores.append(data['Random']['validation']['f1'])
        labels.append(model + '_rand')
        # data for plots
        f1_scores.append([data['Grid']['bestScore'],data['Grid']['validation']['f1']])
        f1_scores.append([data['Random']['bestScore'],data['Random']['validation']['f1']])
    else:
        # confusion matrix
        confusion_matrices.append(data['confusion'])
        # Best training scores
        f1_train_scores.append(0)
        # validation scores
        f1_val_scores.append(data['scores']['f1'])
        labels.append(model)
        # data for plots
        f1_scores.append([0, data['scores']['f1']])
print(labels)
print(f1_scores)

In [None]:
import seaborn as sns
names =['Best Train', 'Validation']
colours = ['blue', 'orange']
# Create a DataFrame
df = pd.DataFrame(f1_scores, index=labels, columns=names)

# Plot the bar chart
ax = df.plot.bar(color=colours)
ax.set_title("F1 Scores for Trained Model Comparison")
ax.set_xlabel("Models")  # Customize x-axis label
ax.set_ylabel("F1 Score")  # Customize y-axis label
plt.savefig('F1_Comparison.png')
plt.show()
# confusion matrix
ax = plt.subplot()
sns.heatmap(results['SVC']['Grid']['validation']['confusion'], annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix for SVM Classifier')
plt.savefig('SVC_CM.png')
plt.show()

In [None]:
results

In [None]:
tn, fp, fn, tp = results['Logistic']['Grid']['validation']['confusion'].ravel()
print(tn)
print(fp)
print(fn)
print(tp)