<div style="border-radius:10px; border:black solid; padding: 15px; background-color: #257180; font-size:100%; text-align:left">
<p style="font-family:Georgia; font-weight:bold; letter-spacing: 2px; color:white; font-size:200%; text-align:center;padding: 0px;"> Banking Churn Analysis & Modeling.</p>
</div>

<div style="border-radius:10px; border:#808080 solid; padding: 15px; background-color: ##F0E68C ; font-size:100%; text-align:left">

<h3 align="left"><font color=brown>📊 Business Objective:</font></h3>
   
- The goal would be to identify which customers are most likely to churn (leave the service) and understand the key factors driving their decision to leave. `Churn Reduction through Predictive Analytics`
- Churn refers to the process by which a customer stops doing business with a company.

<h3 align="left"><font color=brown>📊 Business Value:</font></h3>
- Customer retention is critical for a bank’s profitability. Predicting which customers are likely to churn can help the bank take proactive steps (e.g., offering personalized services or incentives) to retain valuable customers.

## **Import Needded Columns**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="darkgrid",font_scale=1.5)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score ,f1_score, confusion_matrix, classification_report

## **Data Representation**

In [None]:
df = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv')

In [None]:
df.head()

In [None]:
df.info()

## **Data Wrangling**

**Check Duplicated Values**

In [None]:
df.duplicated().sum()

**Check Missing Values**

In [None]:
df.isna().sum()

**Check Outliers Values**

In [None]:
check_outlier = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']

In [None]:
plt.figure(figsize = (15, 10))
for ind, val in enumerate(check_outlier):
    plt.subplot(2,2, ind + 1)
    plt.boxplot(df[val], vert=False)
    plt.title(f'Boxplot of {val}')

plt.tight_layout()
plt.show()

## **Exploratory Data Analysis**

- **The goal of EDA is to better understand the distributions of features, identify trends or patterns, and explore relationships between features and the target variable (Exited)**

#### **Univariate Analysis**

**CreditScore Column**

In [None]:
df['CreditScore'].describe()

In [None]:
plt.figure(figsize = (15, 8))
sns.kdeplot(df['CreditScore'])
plt.title('Credit Score Distribution', fontsize = 20)
plt.xlabel('CreditScore')
plt.ylabel('Frequency')
plt.show()

- The distribution of credit scores is slightly skewed to the right, meaning more customers have higher credit scores (closer to 800).
- This could indicate that most customers have a good credit history.

**Geography Column**

In [None]:
df['Geography'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
plt.subplot(1,2, 1)
ax = sns.countplot(data = df, x = 'Geography', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Customer Geography Disribution",fontweight="black",size=20,pad=20)


plt.subplot(1,2,2)
plt.pie(df['Geography'].value_counts(), autopct = '%1.1f%%', labels = df['Geography'].value_counts().index, explode = [0,0,0.1], textprops={"fontweight":"black"})
plt.title("Customer Geography Disribution",fontweight="black",size=20,pad=20)

plt.show()

**Gender Column**

In [None]:
df['Gender'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
plt.subplot(1,2, 1)
ax = sns.countplot(data = df, x = 'Gender', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Customer Gender Disribution",fontweight="black",size=20,pad=20)


plt.subplot(1,2,2)
plt.pie(df['Gender'].value_counts(), autopct = '%1.1f%%', labels = df['Gender'].value_counts().index, explode = [0,0.1], textprops={"fontweight":"black"})
plt.title("Customer Gender Disribution",fontweight="black",size=20,pad=20)

plt.show()

**Age Column**

In [None]:
df['Age'].describe()

In [None]:
plt.figure(figsize = (15, 8))
sns.kdeplot(df['Age'], fill = True, palette = 'Set2')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

- The age distribution is concentrated around 30–50 years, with fewer customers in the older age range.
- This suggests that most customers are middle-aged, which might be relevant depending on the bank’s target demographic.

**Tenure Column**

In [None]:
df['Tenure'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
ax = sns.countplot(data = df, x = 'Tenure', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Customer Gender Disribution",fontweight="black",size=20,pad=20)

- The uniform distribution suggests that the bank has a balanced customer base across different tenure lengths.

**Balance Column**

In [None]:
plt.figure(figsize = (15, 8))
sns.kdeplot(data = df, x = 'Balance', fill = True, palette = 'Set2')
plt.title('Balance Distribution')
plt.xlabel('Balance')
plt.ylabel('Frequency')
plt.show()

- A bimodal distribution in the Balance column suggests two distinct customer segments:
  - Low-balance customers who may be more likely to churn.
  - High-balance customers who are likely more engaged and loyal.

**NumOfProducts Column**

In [None]:
df['NumOfProducts'].value_counts()

In [None]:
plt.figure(figsize = (10, 6))
ax = sns.countplot(data = df, x = 'NumOfProducts', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight = 'black', size = 15)
plt.title('Number of Customer Products Distribution', fontweight = 'black', size = 20, pad = 20)
plt.show()

**HasCrCard Column**

In [None]:
df['HasCrCard'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
plt.subplot(1,2, 1)
ax = sns.countplot(data = df, x = 'HasCrCard', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Does the customer have Credit?",fontweight="black",size=20,pad=20)


plt.subplot(1,2,2)
plt.pie(df['HasCrCard'].value_counts(), autopct = '%1.1f%%', labels = ['Yes', 'No'], explode = [0,0.1],colors=sns.set_palette("Set2"), textprops={"fontweight":"black"})
plt.title("Does the customer have Credit?",fontweight="black",size=20,pad=20)

plt.show()

**IsActiveMember Column**

In [None]:
df['IsActiveMember'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
plt.subplot(1,2, 1)
ax = sns.countplot(data = df, x = 'IsActiveMember', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Does the customer have Credit?",fontweight="black",size=20,pad=20)


plt.subplot(1,2,2)
plt.pie(df['IsActiveMember'].value_counts(), autopct = '%1.1f%%', labels = ['Yes', 'No'], explode = [0,0.1], textprops={"fontweight":"black"})
plt.title("Does the customer have Credit?",fontweight="black",size=20,pad=20)

plt.show()

**EstimatedSalary Column**

In [None]:
df['EstimatedSalary'].describe()

In [None]:
plt.figure(figsize = (15, 8))
sns.kdeplot(df['EstimatedSalary'], fill = True, palette = 'Set2')
plt.title('Estimated Salary Distribution')
plt.xlabel("Salary")
plt.ylabel('Frequency')
plt.show()

**Exited Column**

In [None]:
df['Exited'].value_counts()

In [None]:
plt.figure(figsize = (15, 6))
plt.subplot(1,2, 1)
ax = sns.countplot(data = df, x = 'Exited', palette = 'Set2')
for container in ax.containers:
    ax.bar_label(container, fontweight="black", size = 15)
plt.title("Customer Churned Distribution",fontweight="black",size=20,pad=20)


plt.subplot(1,2,2)
plt.pie(df['Exited'].value_counts(), autopct = '%1.1f%%', labels = ['No', 'Yes'], explode = [0,0.1], textprops={"fontweight":"black"})
plt.title("Customer Churned Distribution",fontweight="black",size=20,pad=20)
plt.show()

#### **Bivariate Analysis**

**Numerical Features vs. Churn (Exited)**

In [None]:
cols = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']

In [None]:
plt.figure(figsize = (15, 10))
for ind, val in enumerate(cols):
    plt.subplot(2,2, ind + 1)
    sns.boxplot(data = df, x = 'Exited', y = val, palette = 'Set2')
    plt.title(f'{val} vs Exited (Churn)')
    plt.xlabel("Churn (Exited)")

plt.tight_layout()
plt.show()

**Categorical Features vs. Churn (Exited)**

In [None]:
cat_cols = ['Geography', 'Gender', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']

In [None]:
plt.figure(figsize = (15, 10))
for ind, val in enumerate(cat_cols):
    plt.subplot(3,2, ind + 1)
    sns.countplot(data = df, x = val, hue = 'Exited', palette = 'Set2')
    plt.title(f'{val} vs Exited (Churn)')
    plt.xlabel(val)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()

**Multivariate Analysis**

In [None]:
cols_correlation = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary', 'Exited']
corr = df[cols_correlation].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot = True, fmt = '0.2f', cmap = 'coolwarm', lw = 0.5)
plt.title('Correlation Matrix of Features and Target (Exited)')
plt.show()

## **Data Preprocessing**

**Drop Unneeded Columns**

In [None]:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1, inplace=True)

**Define Features `X` and Target `y`**

In [None]:
X = df.drop('Exited', axis = 1)
y = df['Exited']

**Data Encoding**

**To apply encoding to categorical columns in your dataset, the type of encoding will depend on whether the categorical variables are nominal (unordered) or ordinal (ordered)**

In [None]:
X = pd.get_dummies(X, columns=['Geography', 'Gender'], drop_first=False)

In [None]:
X.head()

**Data Scaling**

In [None]:
cols_scaling = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts']

scaler = StandardScaler()
scaler.fit_transform(X[cols_scaling])

**Data Splitting**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

**Data Balancing**
- We can apply SMOTE model to balance the data.

In [None]:
smote = SMOTE()

X_train, y_train = smote.fit_resample(X_train, y_train)

print("Resampled X_train shape:", X_train.shape)
print("Resampled y_train shape:", y_train.shape)

## **Machine Learning Models**

**Logistic Regression model**

In [None]:
lr = LogisticRegression(max_iter = 1000)
lr.fit(X_train, y_train)

In [None]:
# Train Score
lr.score(X_train, y_train)

In [None]:
# Test Score
lr.score(X_test, y_test)

In [None]:
# Make predictions on the test set
y_pred = lr.predict(X_test)

In [None]:
# Get our metrics
lr_acc = accuracy_score(y_test, y_pred)
lr_per = precision_score(y_test, y_pred)
lr_rec = recall_score(y_test, y_pred)

In [None]:
# Create a confusion matrix
con_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix with a heatmap
sns.heatmap(con_matrix, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**SVC Model**

In [None]:
# Define hyperparameter grid
hyper_param = {'kernel': ['linear', 'poly', 'rbf'],
              'C': [0.1, 1, 10]}

svm = SVC()

In [None]:
# # Perform Grid Search with cross-validation
# grid_search = GridSearchCV(svm, hyper_param, cv=5)
# grid_search.fit(X, y)

In [None]:
# # Get best model
# best_model = grid_search.best_estimator_
# best_model

In [None]:
# # Train Score
# best_model.score(X_train, y_train)

In [None]:
# # Test Score
# best_model.score(X_test, y_test)

In [None]:
# y_pred = best_model.predict(X_test)

In [None]:
# # Get our metrics
# svc_acc = accuracy_score(y_test, y_pred)
# svc_per = precision_score(y_test, y_pred)
# svc_rec = recall_score(y_test, y_pred)

In [None]:
# # Create a confusion matrix
# conf_matrix = confusion_matrix(y_test, y_pred)

# # Visualize the confusion matrix with a heatmap
# plt.figure(figsize=(6, 4))
# sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', linewidths=.5, cbar=False)
# plt.title('Confusion Matrix')
# plt.xlabel('Predicted')
# plt.ylabel('Actual')
# plt.show()

**K-Neighbors Classifier**

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

In [None]:
knn.score(X_train, y_train)

In [None]:
knn.score(X_test, y_test)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
# Get our metrics
knn_acc = accuracy_score(y_test, y_pred)
knn_per = precision_score(y_test, y_pred)
knn_rec = recall_score(y_test, y_pred)

In [None]:
# Create a confusion matrix
con_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix with a heatmap
sns.heatmap(con_matrix, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**Decision Tree Classifier**

In [None]:
clf = DecisionTreeClassifier(max_depth=4, min_impurity_decrease=0.01, random_state=42)

# Train the model
clf.fit(X_train, y_train)

In [None]:
# Train Score
clf.score(X_train, y_train)

In [None]:
# Test Score
clf.score(X_test, y_test)

In [None]:
# Make predictions on the test set
y_pred = clf.predict(X_test)

In [None]:
# Get our metrics
clf_acc = accuracy_score(y_test, y_pred)
clf_per = precision_score(y_test, y_pred)
clf_rec = recall_score(y_test, y_pred)

In [None]:
# Create a confusion matrix
con_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix with a heatmap
sns.heatmap(con_matrix, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**RandomForestClassifier Model**

In [None]:
RF_classifier = RandomForestClassifier(n_estimators=100, min_impurity_decrease=0.01)

RF_classifier.fit(X_train, y_train)

In [None]:
# Train score
RF_classifier.score(X_train, y_train)

In [None]:
# Test score
RF_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred = RF_classifier.predict(X_test)

In [None]:
# Get our metrics
rf_acc = accuracy_score(y_test, y_pred)
rf_per = precision_score(y_test, y_pred)
rf_rec = recall_score(y_test, y_pred)

In [None]:
# Create a confusion matrix
con_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix with a heatmap
sns.heatmap(con_matrix, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**Bagging Classifier**

In [None]:
# Create a base decision tree classifier
base_classifier = DecisionTreeClassifier(min_impurity_decrease=0.01)

# Create a bagging classifier with decision trees
bagged_classifier = BaggingClassifier(base_classifier, n_estimators=10)

# Train the bagged classifier on your data
bagged_classifier.fit(X_train, y_train)

In [None]:
bagged_classifier.score(X_train, y_train)


In [None]:
bagged_classifier.score(X_test, y_test)

In [None]:
# Make predictions
y_pred = bagged_classifier.predict(X_test)

In [None]:
# Get our metrics
bag_acc = accuracy_score(y_test, y_pred)
bag_per = precision_score(y_test, y_pred)
bag_rec = recall_score(y_test, y_pred)

In [None]:
# Create a confusion matrix
con_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix with a heatmap
sns.heatmap(con_matrix, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**ExtraTreesClassifier**

In [None]:
# ExtraTreeClassifier Model
ET_classifier = ExtraTreesClassifier(n_estimators=100, min_impurity_decrease=0.01)

ET_classifier.fit(X_train, y_train)

In [None]:
# Train score
ET_classifier.score(X_train, y_train)

In [None]:
# Test score
ET_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred_et = ET_classifier.predict(X_test)

In [None]:
# Get our metrics
et_acc = accuracy_score(y_test, y_pred_et)
et_per = precision_score(y_test, y_pred_et)
et_rec = recall_score(y_test, y_pred_et)

In [None]:
# Create a confusion matrix for ExtraTreeClassifier
con_matrix_et = confusion_matrix(y_test, y_pred_et)

# Visualize the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_matrix_et, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix (ExtraTreesClassifier)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**AdaBoostClassifier**

In [None]:
# Create AdaBoostClassifier Model
AB_classifier = AdaBoostClassifier(n_estimators=100)

AB_classifier.fit(X_train, y_train)

In [None]:
# Train score
AB_classifier.score(X_train, y_train)

In [None]:
# Test score
AB_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred_ab = AB_classifier.predict(X_test)

In [None]:
# Get our metrics
ada_acc = accuracy_score(y_test, y_pred_ab)
ada_per = precision_score(y_test, y_pred_ab)
ada_rec = recall_score(y_test, y_pred_ab)

In [None]:
# Create a confusion matrix for AdaBoostClassifier
con_matrix_ab = confusion_matrix(y_test, y_pred_ab)

# Visualize the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_matrix_ab, annot=True, fmt='1.0f', cmap='Blues', linewidths=.5, cbar=False)
plt.title('Confusion Matrix (AdaBoostClassifier)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**XGBClassifier Model**

In [None]:
# Create XGBClassifier Model
XGB_classifier = XGBClassifier(n_estimators=200)

XGB_classifier.fit(X_train, y_train)

In [None]:
# Train score
XGB_classifier.score(X_train, y_train)

In [None]:
# Test score
XGB_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred_xgb = XGB_classifier.predict(X_test)

In [None]:
# Get our metrics
xgb_acc = accuracy_score(y_test, y_pred_xgb)
xgb_per = precision_score(y_test, y_pred_xgb)
xgb_rec = recall_score(y_test, y_pred_xgb)

In [None]:
# Create a confusion matrix for XGBClassifier
con_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)

# Visualize the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_matrix_xgb, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix (XGBClassifier)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**CatBoostClassifier**

In [None]:
# Create CatBoostClassifier Model
CatBoost_classifier = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6)

CatBoost_classifier.fit(X_train, y_train)

In [None]:
# Train score
CatBoost_classifier.score(X_train, y_train)

In [None]:
# Test score
CatBoost_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred_catboost = CatBoost_classifier.predict(X_test)

In [None]:
# Get our metrics
cat_acc = accuracy_score(y_test, y_pred_catboost)
cat_per = precision_score(y_test, y_pred_catboost)
cat_rec = recall_score(y_test, y_pred_catboost)

In [None]:
# Create a confusion matrix for CatBoostClassifier
con_matrix_catboost = confusion_matrix(y_test, y_pred_catboost)

# Visualize the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_matrix_catboost, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix (CatBoostClassifier)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

**LGBMClassifier**

In [None]:
# Create LGBMClassifier Model
LGBM_classifier = LGBMClassifier(n_estimators=200)
LGBM_classifier.fit(X_train, y_train)

In [None]:
# Train score
LGBM_classifier.score(X_train, y_train)

In [None]:
# Test score
LGBM_classifier.score(X_test, y_test)

In [None]:
# Get prediction
y_pred_lgbm = LGBM_classifier.predict(X_test)

In [None]:
# Get our metrics
lgbm_acc = accuracy_score(y_test, y_pred_lgbm)
lgbm_per = precision_score(y_test, y_pred_lgbm)
lgbm_rec = recall_score(y_test, y_pred_lgbm)

In [None]:
# Create a confusion matrix for LGBMClassifier
con_matrix_lgbm = confusion_matrix(y_test, y_pred_lgbm)

# Visualize the confusion matrix with a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(con_matrix_lgbm, annot=True, fmt='1.0f', cmap='Blues', lw=.5, cbar=False)
plt.title('Confusion Matrix (LGBMClassifier)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()