# Customer Churn Analysis and Prediction


Customer churn, defined as the rate at which customers leave a company or discontinue its services, is a critical metric for businesses. It has a direct impact on revenue, long-term growth, and customer retention strategies.  

This notebook focuses on analyzing and predicting customer churn in a bank using a dataset that contains diverse customer attributes. The primary objective is to identify the key factors influencing customer churn and to build a machine learning model capable of accurately predicting whether a customer is likely to leave the bank.

---

## Dataset Overview

The dataset contains information related to customer demographics, financial status, and engagement with the bank. These attributes are used to understand customer behavior and predict churn.

---

## Project Objectives

- Explore and understand customer data.
- Identify important features influencing customer churn.
- Apply data preprocessing and feature engineering techniques.
- Build and evaluate machine learning models for churn prediction.
- Select the best-performing model based on evaluation metrics.

---

## Workflow

A structured and systematic workflow is followed to ensure reliable and interpretable results.

### 1. Data Loading and Exploration
- Load the dataset using Pandas.
- Explore the dataset to identify categorical and numerical features.
- Understand the structure and basic characteristics of the data.

### 2. Data Cleaning and Preprocessing
- Check for missing, null, or unknown values.
- Identify and remove duplicate records, if any.
- Ensure data consistency and quality.

### 3. Exploratory Data Analysis (EDA)
- Perform in-depth analysis to uncover patterns, trends, and relationships.
- Use visualizations such as:
  - Histograms
  - Box plots
  - Heatmaps
  - Scatter plots  
- Identify key predictors of customer churn.

### 4. Feature Engineering
- Apply appropriate feature transformation techniques.
- Encode categorical variables into numerical formats suitable for machine learning models.
- Scale numerical features to improve model performance.

### 5. Machine Learning Modeling
- Split the dataset into training and testing sets.
- Train and compare multiple machine learning algorithms.
- Perform hyperparameter tuning to optimize model performance.

### 6. Model Evaluation and Conclusion
- Evaluate model performance using metrics such as:
  - Accuracy
  - Precision
  - Recall
  - F1-score
- Analyze results and draw conclusions about the most effective churn prediction approach.

---

## Dataset Description

### 1. Categorical Variables

**Geography**  
- Country of the customer  
- Example values: France, Spain, Germany  

**Gender**  
- Customer’s gender  
- Example values: Male, Female  

**Tenure**  
- Number of years the customer has been with the bank  
- Example values: 1, 5, 10  

**HasCrCard**  
- Indicates whether the customer has a credit card  
- Values: 1 (Yes), 0 (No)  

**NumOfProducts**  
- Number of bank products used by the customer  
- Example values: 1, 2, 3  

**Exited**  
- Indicates whether the customer has left the bank  
- Values: 1 (Exited), 0 (Retained)  

**IsActiveMember**  
- Indicates whether the customer is an active member  
- Values: 1 (Yes), 0 (No)  

**PostExitQuestionnaire**  
- Indicates whether a questionnaire was sent after exit  
- Values: 1 (Distributed), 0 (Not Distributed)  

---

### 2. Continuous Variables

**CreditScore**  
- Customer’s credit score  
- Example values: 450, 750, 850  

**Balance**  
- Bank account balance  
- Example values: 0.00, 50,000.00, 120,000.00  

**Age**  
- Age of the customer  
- Example values: 25, 40, 60  

**EstimatedSalary**  
- Estimated annual salary  
- Example values: $20,000.00, $80,000.00, $200,000.00  

---

By following this structured approach, the project aims to develop a robust and interpretable machine learning model capable of effectively predicting customer churn in the banking sector.


# 1. Importing Libraries

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import FunctionTransformer, PowerTransformer

from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import chi2_contingency

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from xgboost.sklearn import XGBClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay

print("All Libraries Imported successfully")

All Libraries Imported successfully


# 2. Loading Dataset

In [11]:
# Using first column as index
customers_train = pd.read_csv("../data/data_train.csv", index_col = 0 )
customers_test = pd.read_csv("../data/data_test.csv", index_col = 0 )

In [13]:
customers_train.head(5) # top five rows of data set

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,PostExitQuestionnaire
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0


# 3. Data Exploration + Pre-Processing

In [14]:
customers_train.info() # Seems like data has no missing values

<class 'pandas.core.frame.DataFrame'>
Index: 8000 entries, 0 to 7999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CreditScore            8000 non-null   int64  
 1   Geography              8000 non-null   object 
 2   Gender                 8000 non-null   object 
 3   Age                    8000 non-null   int64  
 4   Tenure                 8000 non-null   int64  
 5   Balance                8000 non-null   float64
 6   NumOfProducts          8000 non-null   int64  
 7   HasCrCard              8000 non-null   int64  
 8   IsActiveMember         8000 non-null   int64  
 9   EstimatedSalary        8000 non-null   float64
 10  Exited                 8000 non-null   int64  
 11  PostExitQuestionnaire  8000 non-null   int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 812.5+ KB


In [15]:
customers_train["Geography"].value_counts()

Geography
France     4010
Spain      1995
Germany    1995
Name: count, dtype: int64

In [16]:
customers_train["Gender"].value_counts()

Gender
Male      4343
Female    3657
Name: count, dtype: int64

In [17]:
def convert_to_category(data_train, data_test, columns):
    
    for column in columns:
        data_train[column] = data_train[column].astype("category")
        data_test[column] = data_test[column].astype("category")
    
    return data_train, data_test
    
columns_to_convert = ['Gender', 'Geography']
customers_train, customers_test = convert_to_category(customers_train, customers_test, columns_to_convert)

In [18]:
customers_train.isna().sum() # No missing or null values 

CreditScore              0
Geography                0
Gender                   0
Age                      0
Tenure                   0
Balance                  0
NumOfProducts            0
HasCrCard                0
IsActiveMember           0
EstimatedSalary          0
Exited                   0
PostExitQuestionnaire    0
dtype: int64

In [19]:
customers_train.duplicated().sum() # No duplicate values

np.int64(0)

In [20]:
customers_train.describe().T # Descriptive stats for our datset descriving mean, std deviation and other measures


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CreditScore,8000.0,650.179625,96.844314,350.0,583.0,651.0,717.0,850.0
Age,8000.0,38.937875,10.511224,18.0,32.0,37.0,44.0,92.0
Tenure,8000.0,5.01275,2.884376,0.0,3.0,5.0,7.0,10.0
Balance,8000.0,76800.037193,62391.192584,0.0,0.0,97658.06,127827.3325,250898.09
NumOfProducts,8000.0,1.528,0.583102,1.0,1.0,1.0,2.0,4.0
HasCrCard,8000.0,0.701625,0.457574,0.0,0.0,1.0,1.0,1.0
IsActiveMember,8000.0,0.512625,0.499872,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,8000.0,100198.588701,57524.002768,11.58,51271.41,100272.165,149372.3875,199992.48
Exited,8000.0,0.205875,0.404365,0.0,0.0,0.0,0.0,1.0
PostExitQuestionnaire,8000.0,0.18375,0.387304,0.0,0.0,0.0,0.0,1.0


In [21]:
# Function to rename columns and set them to lowercase
def preprocess_columns(df, mapper):
    df.rename(columns=mapper, inplace=True)
    df.columns = df.columns.str.lower()
    return df

# Column mapping
customers_mapper = {
    "CreditScore": "credit_score", 
    "NumOfProducts": "num_of_products", 
    "HasCrCard": "has_credit_card",
    "IsActiveMember": "is_active_member", 
    "EstimatedSalary": "estimated_salary", 
    "PostExitQuestionnaire": "post_exit_questionnaire"
}

# Apply preprocessing to both train and test datasets
customers_train = preprocess_columns(customers_train, customers_mapper)
customers_test = preprocess_columns(customers_test, customers_mapper)

# Print test columns to confirm
print(customers_train.columns)


Index(['credit_score', 'geography', 'gender', 'age', 'tenure', 'balance',
       'num_of_products', 'has_credit_card', 'is_active_member',
       'estimated_salary', 'exited', 'post_exit_questionnaire'],
      dtype='object')
