### 1. Import Dependencies

How Binning Helps?

1. Non-linear relationship with target
    - Binning can help capture non-linear patterns that a linear model might miss

2. Skewed distribution
    - Binning can smooth out skew and reduce the effect of extreme values

3. Interpretability is key
    - Easier for business users to understand "age 18-25" than "age = 23"

4. Model is prone to overfitting (performing very well for traning data set, but not for test dataset)
    - Binning reduces granularity -> fewer splits -> less overfitting

5. Need to reduce cardinality
    - Helps when a numeric column has too many unique values

6. Sparse or noisy data
    - Binning can group rare or noisy values to improve signal strength

In [1]:
import os
import pandas as pd #alias
import numpy as np #alias
import seaborn as sns
from matplotlib import pyplot as plt

### 2. Basic Processing

In [2]:
df = pd.read_csv('processed/ChurnModelling_Outliers_Handled.csv')
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42.0,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41.0,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42.0,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,38.91,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43.0,2,125510.82,1,1,1,79084.1,0


#### For this specific dataset we can apply binning like this (for numerical data),

- When you apply Binning you must know exactly the ranges and values. You cant create binning with abnormal intervals.
- if you dint have an clear visibility of what is the upper bound and what is the lower bound would be in the ranges 'Binning' would not be a ideal choice.
- In the binning the ranges dont need to be exactly similar distributed but cant have unrealistic ranges.

#### Eligible 
1. Age - we can apply binning to Age like mentioned below (but this is way too common)
2. EstiamtedSalary - also we can apply binning to this but what happens if a person comes up with a lot more money??
3. Credit Score - we exactly know 350 and 850 is a margin, so this works
4. Tenure - good variable to apply but there are some understanding that needs to be done in this

#### Not Eligible
1. Geograpgy - no point of adding binning because this is already is a categorical variable
2. Balance - we dont know the upper margin?? what is people come with 100M we cant make big bounds with abnormal margins.
3. Gender, NumOfProducts, HasCrCard, IsActiveMember - is already an categorical bin


In [3]:
df['Age'].min(), df['Age'].max()

(18.0, 92.0)

for Age we can simply do like this as feature binning groups:

- 18 - 30 => Youngters
- 30 - 50 => Middle Age
- 50+ => Seniors

In [4]:
df['CreditScore'].min(), df['CreditScore'].max()

(350, 850)

for CreditScore we can simply do like this as feature binning groups:

- 350 - 580 => 'Poor'
- 580 - 670 => 'Fair'
- 670 - 740 => 'Good'
- 740 - 800 => 'Very Good'
- 800 - 850 => 'Excellent'

In [5]:
def custom_binning_credit_score(score):
    if score < 580:
        return 'Poor'
    if score < 670:
        return 'Fair'
    if score < 740:
        return 'Good'
    if score < 800:
        return 'Very Good'
    if score <= 850:
        return 'Excellent'
    else:
        assert True, "Credit Score can't go beyond 850"
        
df['CreditScoreBins'] = df['CreditScore'].apply(custom_binning_credit_score)
del df['CreditScore']

df 

Unnamed: 0,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScoreBins
0,France,Female,42.00,2,0.00,1,1,1,101348.88,1,Fair
1,Spain,Female,41.00,1,83807.86,1,0,1,112542.58,0,Fair
2,France,Female,42.00,8,159660.80,3,1,0,113931.57,1,Poor
3,France,Female,38.91,1,0.00,2,0,0,93826.63,0,Good
4,Spain,Female,43.00,2,125510.82,1,1,1,79084.10,0,Excellent
...,...,...,...,...,...,...,...,...,...,...,...
9995,France,Male,39.00,5,0.00,2,1,0,96270.64,0,Very Good
9996,France,Male,35.00,10,57369.61,1,1,1,101699.77,0,Poor
9997,France,Female,36.00,7,0.00,1,0,1,42085.58,1,Good
9998,Germany,Male,42.00,3,75075.31,2,1,0,92888.52,1,Very Good


In [6]:
df.to_csv('processed/ChrunModelling_Binning_Applied.csv', index=False)