# Bank Customer Churn Prediction
**Team Members:** Dharani Murugesan, Jeyshree Venkatesan, Nishanthika Murugan, Vaishnavi Uma Asokkumar, Vijay Sarathy Vivekanadan

## Introduction
Customer retention is a key concern for banks. Predicting customer churn allows the bank to proactively engage at-risk customers and prevent revenue loss. This project uses customer data from ABC Multinational Bank to build machine learning models to predict churn.
Dataset Source: [Kaggle - Bank Customer Churn Dataset](https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset/data)


## Dataset Overview
**Features:**
- `customer_id`: Unique ID assigned to each customer
- `credit_score`: Customer's credit score
- `country`: Country of the customer
- `gender`: Male/Female
- `age`: Age of the customer
- `tenure`: Years with the bank
- `balance`: Account balance
- `products_number`: Number of products used
- `credit_card`: Whether the customer has a credit card (1/0)
- `active_member`: Whether the customer is active (1/0)
- `estimated_salary`: Estimated annual salary
- `churn`: Target variable — 1 if customer left, 0 otherwise


## Data Preprocessing
Here we clean the dataset, handle categorical variables, and prepare the features for model training.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load dataset
df = pd.read_csv("BankCustomerChurn.csv")  # Replace with your dataset path
df.head()


## Exploratory Data Analysis (EDA)
We explore key statistics and correlations to understand the behavior of churn vs. non-churned customers.


## Model Building and Evaluation
We trained the following models:
- Logistic Regression
- Naive Bayes
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)
- Random Forest

Each model's performance is evaluated using Accuracy, Precision, Recall, and F1-score.


### Random Forest Classifier
The model achieved **86.5% accuracy** with high recall (0.97), making it suitable for identifying churners. It also helps with feature importance and handles overfitting well.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Encoding and splitting data
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['country'] = le.fit_transform(df['country'])

X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))


### Logistic Regression
Logistic Regression is used for binary classification. It estimates the probability that a given input point belongs to a certain class.


In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))


### Naive Bayes
Naive Bayes is a probabilistic model based on Bayes’ Theorem assuming feature independence.


In [None]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

print("Naive Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb))


### Support Vector Machine (SVM)
SVM finds the optimal hyperplane that separates the classes. Kernels can be used to handle non-linear data.


In [None]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))


### K-Nearest Neighbors (KNN)
KNN classifies a data point based on how its neighbors are classified. It’s simple but can be slow on large datasets.


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))


## Conclusion
- **Best Model:** Random Forest
- **Recall:** 0.97 — High sensitivity in identifying churners
- **Business Impact:** Enables proactive customer retention strategies

This project demonstrates the value of predictive modeling in customer retention and emphasizes the importance of model selection based on business goals.
