# Credit Card Users Churn Prediction

## Problem Statement

### Business Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

### Data Description

* CLIENTNUM: Client number. Unique identifier for the customer holding the account
* Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
* Customer_Age: Age in Years
* Gender: Gender of the account holder
* Dependent_count: Number of dependents
* Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
* Marital_Status: Marital Status of the account holder
* Income_Category: Annual Income Category of the account holder
* Card_Category: Type of Card
* Months_on_book: Period of relationship with the bank (in months)
* Total_Relationship_Count: Total no. of products held by the customer
* Months_Inactive_12_mon: No. of months inactive in the last 12 months
* Contacts_Count_12_mon: No. of Contacts in the last 12 months
* Credit_Limit: Credit Limit on the Credit Card
* Total_Revolving_Bal: Total Revolving Balance on the Credit Card
* Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
* Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
* Total_Trans_Amt: Total Transaction Amount (Last 12 months)
* Total_Trans_Ct: Total Transaction Count (Last 12 months)
* Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
* Avg_Utilization_Ratio: Average Card Utilization Ratio

#### What Is a Revolving Balance?

- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance


##### What is the Average Open to buy?

- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.

##### What is the Average utilization Ratio?

- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.


##### Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:

- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

### **Please read the instructions carefully before starting the project.**
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
* Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.


## Importing necessary libraries

In [1]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user

In [2]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again*.

In [None]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To supress warnings
import warnings
warnings.filterwarnings("ignore")

## Loading the dataset

In [None]:
# Load the dataset
df = pd.read_csv('BankChurners.csv')
df.head()

## Data Overview

In [None]:
df.shape

In [None]:
#make a copy and work on the copy, so the original data (df) is intact.
data = df.copy()

In [None]:
#Review top and bottom 5 rows to understand the data.
data.head()

In [None]:
data.tail()

In [None]:
data.info()

There are few non-numeric columns (features) that need to be worked on.

In [None]:
#check for duplicate datasets
data.duplicated().sum()

No duplicate datasets are found, hence no clean up is needed.

In [None]:
#check for missing values

data.isnull().sum()

Columns (features) Education_level and Marital_status are missing data. Need to replace with proper values soon.

In [None]:
data.describe(include='all').T

In [None]:
# Identify unique values and the number of occurences for all the categorical variables.
for i in data.describe(include=["object"]).columns:
    print("Unique values in", i, "are :")
    print(data[i].value_counts())
    print("*" * 50)

In [None]:
# CLIENTNUM consists of uniques ID for clients and hence will not add value to the modeling
data.drop(["CLIENTNUM"], axis=1, inplace=True)

In [None]:
## Encoding the label (target variable).
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)

data.tail(10)

In [None]:
Retention_percentage = data['Attrition_Flag'].value_counts(normalize=True) * 100
Retention_percentage

#Observations:
Total dataset size is 10k+, which is a decent size.
Education_level and Marital_status features have many missing values. If these features are important, then null values need to be replaced with either high frequent or mean values.
The target value data sets are unbalanced. The model may be biased towards high frequency outcomes, hence adjust the dataset.


## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:
Based on the EDA, I have come up with the below answers.
1. How is the total transaction amount distributed?
Total_Trans_Amt is right skewed with outliers. Most of the customers have total_trans_amount $5000 or less.
2. What is the distribution of the level of education of customers?
Majority of the customer completed Graduation or High School. Very few customers have Post-Graduation or Doctorate
3. What is the distribution of the level of income of customers?
Most of the customer have income less than $40k (3561) and between 40k - 60k (1790). There are very few customers with income higher than $120k (727). Surprisingly 1112 customer's income is "abc" which seems to be an invalid income. Need to fix these values.
4. How does the change in transaction amount between Q4 and Q1 (`total_ct_change_Q4_Q1`) vary by the customer's account status (`Attrition_Flag`)?
As the total_ct_change_Q4_Q1 decreases, the likelihood of attrition increases, showing a negative correlation of -0.37. The average value of total_ct_change_Q4_Q1 for customers who have attrited is 0.5, whereas for existing customers, it is 0.7.
5. How does the number of months a customer was inactive in the last 12 months (`Months_Inactive_12_mon`) vary by the customer's account status (`Attrition_Flag`)?
As the value of Months_Inactive_12_mon increases, the likelihood of attrition also rises, exhibiting a positive correlation of 0.15. The average value of Months_Inactive_12_mon for customers who have attrited is 2, while the average for existing customers is 2.5.=
6. What are the attributes that have a strong correlation with each other?
a. Total_trans_ct exhibits a strong positive correlation with total_trans_amt, with a correlation coefficient of 0.81.
b. Customer_age shows a strong positive correlation with months_on_book, with a correlation coefficient of 0.79.
c. Avg_utilization_ratio has a strong positive correlation with total_revolving_balance, with a correlation coefficient of 0.62.
d. Total_amt_chng_Q4_Q1 demonstrates a good positive correlation with total_ct_chng_Q4_Q1, with a correlation coefficient of 0.38.
e. Avg_open_to_buy reflects a strong negative correlation with avg_utilization_ratio, with a correlation coefficient of -0.54.
f. Credit_limit shows a strong negative correlation with avg_utilization_ratio, with a correlation coefficient of -0.48.
g. Total_trans_ct has a good negative correlation with attrition_flag, with a correlation coefficient of -0.37.
h. Total_trans_amt exhibits a good negative correlation with total_relationship_ct, with a correlation coefficient of -0.35.  


#### The below functions need to be defined to carry out the Exploratory Data Analysis.

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

# Univariate analysis

In [None]:


continuous_columns = ['Customer_Age', 'Months_on_book', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt','Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
for column in continuous_columns:
  histogram_boxplot(data, column)

discrete_columns = ['Gender','Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Total_Relationship_Count', 'Months_Inactive_12_mon','Contacts_Count_12_mon']
for column in discrete_columns:
  labeled_barplot(data, column)

In [None]:

numerical_data = data.select_dtypes(include=np.number)

# Create correlation matrix
correlation_matrix = numerical_data.corr()

# Create the heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

#Bivariate Analysis

In [None]:
# plot stacked_barplot using the above function for all valid columns

for column in discrete_columns:
  stacked_barplot(data, column, 'Attrition_Flag')

In [None]:
# Generate the distribution_plot_wrt_target for all valid columns using the function mentioned above.

for column in numerical_data:
  distribution_plot_wrt_target(data, column, 'Attrition_Flag')

## Data Pre-processing

In [None]:
# Detect the percentage of outliers for all continuous_columns

def detect_outliers_percentage(data, continuous_columns):
  """
  Detects the percentage of outliers for all continuous columns in a DataFrame.

  Args:
    data: The DataFrame containing the data.
    continuous_columns: A list of column names representing continuous variables.

  Returns:
    A dictionary where keys are column names and values are the percentage of outliers.
  """

  outlier_percentages = {}
  for column in continuous_columns:
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    num_outliers = len(data[(data[column] < lower_bound) | (data[column] > upper_bound)])
    outlier_percentage = (num_outliers / len(data)) * 100

    outlier_percentages[column] = outlier_percentage

  return outlier_percentages


# Example usage (assuming you have the 'data' DataFrame and 'continuous_columns' list defined)
outlier_percentages = detect_outliers_percentage(data, continuous_columns)
outlier_percentages

#Observations
- Since the outlier data appears to be pertinent to this scenario, I will retain it in my analysis.

- To prevent any leakage of my test data, I will first split the dataset before proceeding with preprocessing or imputation.

In [None]:
# split the dataset into train, test and validation on a new copy
# Using startify as the data is not balanced wrt the target variable

from sklearn.model_selection import train_test_split



# Split the data into train, test, and validation sets
X = data.drop('Attrition_Flag', axis=1)
y = data['Attrition_Flag']

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=1
)

X_test, X_val, y_test, y_val = train_test_split(
    X_temp, y_temp, test_size=0.2, stratify=y_temp, random_state=1
)

print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)
print("Validation set shape:", X_val.shape, y_val.shape)

In [None]:
#Display the value counts of the income_category in all test, validation, and training datasets.

print("Income Category Value Counts in Train Data:")
print(X_train['Income_Category'].value_counts())
print("\nIncome Category Value Counts in Validation Data:")
print(X_val['Income_Category'].value_counts())
print("\nIncome Category Value Counts in Test Data:")
print(X_test['Income_Category'].value_counts())

## Missing value imputation




In [None]:
# Replace 'Income_Category' values containing 'abc' with NaN
X_train['Income_Category'].replace("abc", np.nan, inplace=True)
X_test['Income_Category'].replace("abc", np.nan, inplace=True)
X_val['Income_Category'].replace("abc", np.nan, inplace=True)

print(X_train.isna().sum())
print('-------')
print(X_test.isna().sum())
print('-------')
print(X_val.isna().sum())

In [None]:
#Display the value counts of the income_category across all test, validation, and training datasets.

print("Income Category Value Counts in Train Data:")
print(X_train['Income_Category'].value_counts())
print("\nIncome Category Value Counts in Validation Data:")
print(X_val['Income_Category'].value_counts())
print("\nIncome Category Value Counts in Test Data:")
X_test['Income_Category'].value_counts()

In [None]:
# Impute the null values with simpleimputer

from sklearn.impute import SimpleImputer

# Create a SimpleImputer object with the 'most_frequent' strategy
imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the imputer on the training data for categorical columns
categorical_cols_with_null = ['Education_Level', 'Marital_Status', 'Income_Category']
X_train[categorical_cols_with_null] = imputer.fit_transform(X_train[categorical_cols_with_null])
X_test[categorical_cols_with_null] = imputer.transform(X_test[categorical_cols_with_null])
X_val[categorical_cols_with_null] = imputer.transform(X_val[categorical_cols_with_null])

print(X_train.isna().sum())
print('-------')
print(X_test.isna().sum())
print('-------')
print(X_val.isna().sum())

In [None]:
# Create dummy variables for categorical features in train, test, and validation sets
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)

print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Validation set shape:", X_val.shape)

# Ensure all datasets have the same columns after creating dummy variables
for column in X_train.columns:
  if column not in X_test.columns:
    print(column + "ERROR :  not found in X_test")
  if column not in X_val.columns:
    print(column + "ERROR : not found in X_val")

# Reorder columns in test and validation sets to match the order of columns in train set
X_test = X_test[X_train.columns]
X_val = X_val[X_train.columns]

## Model Building

### Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.

**Which metric to optimize?**

* We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
* We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
* We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

**Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.**

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf

### Model Building with original data

Sample code for model building with original data

In [None]:
models = []  # Empty list to store all the models
model_performance_df = pd.DataFrame(columns=['Model', 'Train Recall', 'Validation Recall'])

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XG Boosting", XGBClassifier(random_state=1)))

print("\n" "Model Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train))
    scores_val = recall_score(y_val, model.predict(X_val))
    print("Train Score: {}: {}".format(name, scores))
    print("Val Score: {}: {}".format(name, scores_val))
    # Create a temporary DataFrame for the new row
    new_row_df = pd.DataFrame({'Model': [model] , 'ModelName' :[name +' Original Sampling']  , 'Train Recall': [scores], 'Validation Recall': [scores_val]})

    # Concatenate the new row with the existing DataFrame
    model_performance_df = pd.concat([model_performance_df, new_row_df], ignore_index=True)

model_performance_df

### Model Building with Oversampled data


In [None]:
# Synthetic Minority Over Sampling Technique

sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("Before Oversampling value counts of y_train: " + str(y_train.value_counts()))

print("\nAfter Oversampling value counts of y_train: " + str(y_train_over.value_counts()))

print("Before Oversampling shape: " + str(y_train.shape))

print("\nAfter Oversampling shape: " + str(y_train_over.shape))
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XG Boosting", XGBClassifier(random_state=1)))

print("\n" "Model Performance:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over))
    scores_val = recall_score(y_val, model.predict(X_val))
    print("Train Over Score: {}: {}".format(name, scores))
    print("Val Over Score: {}: {}".format(name, scores_val))
    # Create a temporary DataFrame for the new row
    new_row_df = pd.DataFrame({'Model': [model] , 'ModelName' :[name +' overSampling'] , 'Train Recall': [scores], 'Validation Recall': [scores_val]})

    # Concatenate the new row with the existing DataFrame
    model_performance_df = pd.concat([model_performance_df, new_row_df], ignore_index=True)

model_performance_df

### Model Building with Undersampled data

In [None]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)

X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

print("Before Undersampling value counts of y_train: " + str(y_train.value_counts()))

print("\nAfter Undersampling value counts of y_train: " + str(y_train_un.value_counts()))

print("Before Undersampling shape: " + str(y_train.shape))

print("\nAfter Undersampling shape: " + str(y_train_un.shape))

In [None]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XG Boosting", XGBClassifier(random_state=1)))

print("\n" "Model Performance:" "\n")
for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un))
    scores_val = recall_score(y_val, model.predict(X_val))
    print("Train Under Score: {}: {}".format(name, scores))
    print("Val Under Score: {}: {}".format(name, scores_val))
    # Create a temporary DataFrame for the new row
    new_row_df = pd.DataFrame({'Model': [model] , 'ModelName' :[name +' UnderSampling'] , 'Train Recall': [scores], 'Validation Recall': [scores_val]})

    # Concatenate the new row with the existing DataFrame
    model_performance_df = pd.concat([model_performance_df, new_row_df], ignore_index=True)

model_performance_df

### HyperparameterTuning

#### Sample Parameter Grids

**Note**

1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
  - Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase


- For Gradient Boosting:

```
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
```

- For Adaboost:

```
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
```

- For Bagging Classifier:

```
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
```
- For Random Forest:

```
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
```

- For Decision Trees:

```
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
```

- For XGBoost (optional):

```
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}
```

#### Tuning Decision Tree with UnderSampling data

In [None]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new pipeline with best parameters
tuned_dt = DecisionTreeClassifier(
    random_state=1,
    max_depth=5,
    min_samples_leaf=7,
    max_leaf_nodes=10,
    min_impurity_decrease=0.0001
)

tuned_dt.fit(X_train, y_train)

print("Train recall score : " + str(recall_score(y_train, tuned_dt.predict(X_train))))
print("Validation recall score : "+ str(recall_score(y_val, tuned_dt.predict(X_val))))

#### Tuning for XGBoost with original data

In [None]:
# defining XGBoost model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating the XGBoost model with best parameters from above
tuned_xgb = XGBClassifier(
    random_state=1,
    subsample=0.7,
    scale_pos_weight=5,
    n_estimators=75,
    learning_rate=0.05,
    gamma=3
)

tuned_xgb.fit(X_train, y_train)

print("Train recall score: " + str(recall_score(y_train, tuned_xgb.predict(X_train))))
print("Validation recall score: " + str(recall_score(y_val, tuned_xgb.predict(X_val))))

#### Tuning XG Boosting with oversampled data

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

#### Creating the XGBoost model with best parameters from above

In [None]:

tuned_xgb1 = XGBClassifier(
    random_state=1,
    n_estimators=75,
    scale_pos_weight=5,
    learning_rate=0.05,
    gamma=3,
    subsample=0.9
)

tuned_xgb1.fit(X_train, y_train)

print("Train recall score : " + str(recall_score(y_train, tuned_xgb1.predict(X_train))))
print("Validation recall score : "+ str(recall_score(y_val, tuned_xgb1.predict(X_val))))

#Tuning method for XGBoost with Undersampling data

In [None]:
# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}


#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new XGBoost with best tuned hyper parameters
tuned_xgb2 = XGBClassifier(
    random_state=1,
    n_estimators= 75,
    scale_pos_weight= 5,
    subsample= 0.7,
    learning_rate=0.05,
    gamma=3
)

tuned_xgb2.fit(X_train, y_train)

print("Train recall score: "+ str(recall_score(y_train, tuned_xgb2.predict(X_train))))
print("Validation recall score: "+ str(recall_score(y_val, tuned_xgb2.predict(X_val))))

#### Tuning Gradient Boosting with UnderSampling

In [None]:
# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new Gradient Boosting with best hyper parameters
tuned_gb = GradientBoostingClassifier(
    random_state=1,
    init= DecisionTreeClassifier(random_state=1),
    n_estimators= 100,
    max_features= 0.7,
    subsample= 0.7,
    learning_rate=0.05
)

tuned_gb.fit(X_train, y_train)

print("Train recall score: "+ str(recall_score(y_train, tuned_gb.predict(X_train))))
print("Validation recall score: "+ str(recall_score(y_val, tuned_gb.predict(X_val))))

#### Tuning AdaBoost with UnderSampling

In [None]:
# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

In [None]:
# Creating new AdaBoost model  with best hyper parameters
tuned_ab = AdaBoostClassifier(
    random_state=1,
    n_estimators=75,
    learning_rate=0.1,
    estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
)

tuned_ab.fit(X_train, y_train)

print("Train recall score: "+ str(recall_score(y_train, tuned_ab.predict(X_train))))
print("Validation recall score: "+ str(recall_score(y_val, tuned_ab.predict(X_val))))

## Test set final performance

In [None]:
# Predict the performance on the test set on all the tuned models
print("Test Recall on tuned_xgb XGBoost ")
print(recall_score(y_test, tuned_xgb.predict(X_test)))
print("Test Recall on tuned_gb")
print(recall_score(y_test, tuned_gb.predict(X_test)))
print("Test Recall on tuned_ab")
print(recall_score(y_test, tuned_ab.predict(X_test)))
print("Test Recall on tuned_dt")
print(recall_score(y_test, tuned_dt.predict(X_test)))


In [None]:
# Derive the important features

import matplotlib.pyplot as plt

feature_importances = tuned_xgb.feature_importances_

feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)


imp_features = feature_importance_df.head(7)

print()
# Create a bar graph
plt.figure(figsize=(10, 6))
plt.barh(imp_features['Feature'], imp_features['Importance'])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Top 7 Important Features for Tuned XGB')
plt.gca().invert_yaxis()  # Invert y-axis to show most important feature at the top
plt.show()

# Business Insights

When total_trans_ct declines, customers are more likely to attrite. Therefore, the business should engage with these customers and offer them attractive incentives to encourage retention.

Similarly, if total_revolving_bal decreases, it indicates a higher propensity for attrition among customers, necessitating targeted offers to retain them.

A decline in total_relationship_count also suggests that customers may be inclined to leave, reinforcing the need for the business to connect with these individuals and present them with compelling offers to retain their loyalty.

Furthermore, if total_ct_chng_Q4_Q1 is falling, it signals an increased likelihood of attrition, prompting the business to proactively reach out to these customers with enticing offers.

These factors are critical in understanding what influences customer attrition. To mitigate this risk, the marketing team should design campaigns aimed at different age groups, promoting suitable products for younger customers while emphasizing value for older ones.

# Conclusions
Based on these insights, the bank can develop targeted marketing campaigns aimed specifically at customers experiencing a decline in total transactions (total_trans_ct) and total revolving balance (total_revolving_bal). These campaigns should offer personalized incentives such as exclusive discounts, cashback on purchases, or loyalty rewards.


***