<a href="https://www.kaggle.com/code/vidhikishorwaghela/binary-classification-bank-churn-dataset?scriptVersionId=157565236" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Customer Churn Prediction Model

## Overview

This Kaggle notebook contains a machine learning model for predicting customer churn using the Kaggle Playground Series Season 4, Episode 1 dataset. The goal is to anticipate whether a customer will continue with their account or close it (churn).

## Dataset

The dataset includes two CSV files:
- `train.csv`: The training dataset with labeled churn information.
- `test.csv`: The test dataset for making predictions.

## Model Building Process

1. **Data Preprocessing:**
   - Drop unnecessary columns (`id`, `CustomerId`, `Surname`).
   - Split data into features (X) and the target variable (y).
   - Perform train-test split for model evaluation.
   - Identify numerical and categorical features.

2. **Feature Engineering:**
   - Create a preprocessing pipeline to handle missing values and scale numerical features.
   - Apply one-hot encoding to categorical features.

3. **Model Training:**
   - Construct a model pipeline using the RandomForestClassifier.
   - Train the model on the training set.

4. **Model Evaluation:**
   - Evaluate the model on the validation set using ROC AUC.

5. **Prediction and Submission:**
   - Make predictions on the test set.
   - Create a submission file adhering to the required format.

## Usage

1. Run the notebook cells sequentially to build and evaluate the model.
2. The trained model will be available for making predictions.
3. Generate predictions on the test set and create the submission file.

## Dependencies

- pandas
- numpy
- matplotlib
- seaborn
- xgboost
- scikit-learn

Ensure the required libraries are installed:

```python
!pip install pandas numpy matplotlib seaborn xgboost scikit-learn
```


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score



In [2]:
# Load datasets
train_data = pd.read_csv('/kaggle/input/playground-series-s4e1/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e1/test.csv')

In [3]:
train_data.describe()

Unnamed: 0,id,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0
mean,82516.5,15692010.0,656.454373,38.125888,5.020353,55478.086689,1.554455,0.753954,0.49777,112574.822734,0.211599
std,47641.3565,71397.82,80.10334,8.867205,2.806159,62817.663278,0.547154,0.430707,0.499997,50292.865585,0.408443
min,0.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,41258.25,15633140.0,597.0,32.0,3.0,0.0,1.0,1.0,0.0,74637.57,0.0
50%,82516.5,15690170.0,659.0,37.0,5.0,0.0,2.0,1.0,0.0,117948.0,0.0
75%,123774.75,15756820.0,710.0,42.0,7.0,119939.5175,2.0,1.0,1.0,155152.4675,0.0
max,165033.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               165034 non-null  int64  
 1   CustomerId       165034 non-null  int64  
 2   Surname          165034 non-null  object 
 3   CreditScore      165034 non-null  int64  
 4   Geography        165034 non-null  object 
 5   Gender           165034 non-null  object 
 6   Age              165034 non-null  float64
 7   Tenure           165034 non-null  int64  
 8   Balance          165034 non-null  float64
 9   NumOfProducts    165034 non-null  int64  
 10  HasCrCard        165034 non-null  float64
 11  IsActiveMember   165034 non-null  float64
 12  EstimatedSalary  165034 non-null  float64
 13  Exited           165034 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB


In [5]:
train_data.columns

Index(['id', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender',
       'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [6]:
train_data.head(3)

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0


In [7]:
train_data.tail(3)

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
165031,165031,15664752,Hsia,565,France,Male,31.0,5,0.0,1,1.0,1.0,127429.56,0
165032,165032,15689614,Hsiung,554,Spain,Female,30.0,7,161533.0,1,0.0,1.0,71173.03,0
165033,165033,15732798,Ulyanov,850,France,Male,31.0,1,0.0,1,1.0,0.0,61581.79,1


In [8]:
test_data.describe()

Unnamed: 0,id,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
count,110023.0,110023.0,110023.0,110023.0,110023.0,110023.0,110023.0,110023.0,110023.0,110023.0
mean,220045.0,15692100.0,656.530789,38.122205,4.996637,55333.611354,1.553321,0.753043,0.495233,112315.147765
std,31761.048671,71684.99,80.315415,8.86155,2.806148,62788.519675,0.544714,0.431244,0.49998,50277.048244
min,165034.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58
25%,192539.5,15632860.0,597.0,32.0,3.0,0.0,1.0,1.0,0.0,74440.325
50%,220045.0,15690180.0,660.0,37.0,5.0,0.0,2.0,1.0,0.0,117832.23
75%,247550.5,15756930.0,710.0,42.0,7.0,120145.605,2.0,1.0,1.0,154631.35
max,275056.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48


In [9]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110023 entries, 0 to 110022
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               110023 non-null  int64  
 1   CustomerId       110023 non-null  int64  
 2   Surname          110023 non-null  object 
 3   CreditScore      110023 non-null  int64  
 4   Geography        110023 non-null  object 
 5   Gender           110023 non-null  object 
 6   Age              110023 non-null  float64
 7   Tenure           110023 non-null  int64  
 8   Balance          110023 non-null  float64
 9   NumOfProducts    110023 non-null  int64  
 10  HasCrCard        110023 non-null  float64
 11  IsActiveMember   110023 non-null  float64
 12  EstimatedSalary  110023 non-null  float64
dtypes: float64(5), int64(5), object(3)
memory usage: 10.9+ MB


In [10]:
test_data.head(3)

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,165034,15773898,Lucchese,586,France,Female,23.0,2,0.0,2,0.0,1.0,160976.75
1,165035,15782418,Nott,683,France,Female,46.0,2,0.0,1,1.0,0.0,72549.27
2,165036,15807120,K?,656,France,Female,34.0,7,0.0,2,1.0,0.0,138882.09


In [11]:
test_data.tail(3)

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
110020,275054,15728456,Ch'iu,712,France,Male,31.0,2,0.0,2,1.0,0.0,16287.38
110021,275055,15687541,Yegorova,709,France,Female,32.0,3,0.0,1,1.0,1.0,158816.58
110022,275056,15663942,Tuan,621,France,Female,37.0,7,87848.39,1,1.0,0.0,24210.56


In [12]:
# Drop unnecessary columns
drop_columns = ['id', 'CustomerId', 'Surname']
train_data = train_data.drop(columns=drop_columns)
test_data = test_data.drop(columns=drop_columns)

In [13]:
# Separate features and target variable
X = train_data.drop('Exited', axis=1)
y = train_data['Exited']

In [14]:
# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Define numerical and categorical features
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

In [16]:
# Create preprocessing pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])


In [17]:
# Create model pipeline (Random Forest Classifier)
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])


In [18]:
# Train the model
model.fit(X_train, y_train)


In [19]:
# Make predictions on the validation set
y_pred = model.predict_proba(X_valid)[:, 1]

In [20]:
# Evaluate the model
roc_auc = roc_auc_score(y_valid, y_pred)
print(f'ROC AUC Score on Validation Set: {roc_auc}')

ROC AUC Score on Validation Set: 0.8742602142946315


In [21]:
# Make predictions on the test set
test_predictions = model.predict_proba(test_data)[:, 1]

In [22]:
# Create a submission file with default index as 'id'
submission = pd.DataFrame({'Exited': test_predictions})
submission['id'] = test_data.index  # Use the default index as 'id'
submission = submission[['id', 'Exited']]  # Reorder columns as 'id' should come first
submission.to_csv('/kaggle/working/submission.csv', index=False)