# **Title of Project**

 Churn Prediction in Banking

## **Objective**

The objective of this project is to predict customer churn in a banking dataset using various machine learning algorithms and identify the best-performing model for future predictions.

## **Data Source**

The dataset used in this project is 'Churn_Modelling.csv.'

## **Import Library**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import joblib


## **Import Data**

In [None]:
df = pd.read_csv('Churn_Modelling.csv')


## **Describe Data**

In [None]:
df.info()
df.describe(include='all')
df.isnull().sum()


## **Data Visualization**

In [None]:
sns.countplot(df['Exited'])


## **Data Preprocessing**

In [None]:
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
df = pd.get_dummies(df, drop_first=True)


## **Define Target Variable (y) and Feature Variables (X)**

In [None]:
X = df.drop('Exited', axis=1)
y = df['Exited']


## **Train Test Split**

In [None]:
X_res, y_res = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=47)


## **Modeling**

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Logistic Regression
log = LogisticRegression()
log.fit(X_train, y_train)

# SVM
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)

# KNeighbors Classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Decision Tree Classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Gradient Boosting Classifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

# XGBoost
model_xgb = xgb.XGBClassifier(random_state=42, verbosity=0)
model_xgb.fit(X_train, y_train)


## **Model Evaluation**

In [None]:
# Evaluation Metrics
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"F1 Score: {f1_score(y_test, y_pred)}")

# Evaluate Models
evaluate_model(log, X_test, y_test)
evaluate_model(svm_model, X_test, y_test)
evaluate_model(knn, X_test, y_test)
evaluate_model(dt, X_test, y_test)
evaluate_model(rf, X_test, y_test)
evaluate_model(gbc, X_test, y_test)
evaluate_model(model_xgb, X_test, y_test)


## **Prediction**

In [None]:
# Save the best model (XGBoost)
model_xgb.fit(X_res, y_res)
joblib.dump(model_xgb, 'churn_predict_model')
model = joblib.load('churn_predict_model')


## **Explaination**

The project focuses on predicting customer churn in a banking dataset. After importing the necessary libraries and loading the data, irrelevant features are dropped, and categorical data is encoded. The target variable (Exited) and feature variables are defined, and the dataset is split into training and testing sets. To address imbalanced data, the Synthetic Minority Over-sampling Technique (SMOTE) is applied.

Several machine learning models, including Logistic Regression, SVM, KNeighbors Classifier, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and XGBoost, are trained and evaluated using accuracy, precision, recall, and F1 score metrics. The XGBoost classifier is identified as the best-performing model based on accuracy.

Finally, the best model (XGBoost) is saved for future predictions.