# Predicting Churn - Desafio Data Science - Indicium

As I explained better about the problem in the EDA, here let's focus on the prediction model.

As we have to predict the variavle Exited(0 or 1), we are facing a **classification** problem. Where with the data, we will predict 0 if the customer will not leave the bank or 1 if the customer will leave the bank.


## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import warnings
import joblib
warnings.filterwarnings('ignore')

# Set random seed
SEED = 11

In [2]:
df = pd.read_csv('Abandono_clientes.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Feature Engineering


In [3]:
# Start droping columns that are not needed for the prediction
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)

In [4]:
# Check for null values
df.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

No null values found, that's good because we can use all the data for the prediction without worrying about missing values.


#### Encoding categorical variables

explciar importancia de encoding

In [5]:
# Converting Gender to binary
df['Male'] = df['Gender'].map({'Male': 1, 'Female': 0})
df.drop(columns=['Gender'], inplace=True)

# Making One Hot Encoding for the Geography column as it is a categorical variable
df = pd.get_dummies(df, columns=['Geography'], prefix='Geo', dtype=int)

df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Male,Geo_France,Geo_Germany,Geo_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0,1,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,0,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0,1,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,1,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,0,0,1


We can see that the Geography column was transformed into 3 new columns, one for each country. This is necessary because the models that understand only numerical values will not be able to understand the categorical variable so we use one hot encoding to transform it into numerical values.


### Splitting the data

Before split the data i want to clarify that after analyzing the correlation matrix and the charts of the variables (Present on the eda notebook) I decided to remove the feature HasCrCard because it has a very low correlation with the target variable and it chart shows that it doesn't have a significant difference between the two classes of the target variable. Futhermore i made some tests that i'm not gonna show here for th enotebook not get too big testing with logistic regression, random forest and decision tree, the accuracy was the same with and without this feature and its feature importance was very low in all three models. So our final dataset will be:

In [6]:
df.drop(columns=['HasCrCard']).head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary,Exited,Male,Geo_France,Geo_Germany,Geo_Spain
0,619,42,2,0.0,1,1,101348.88,1,0,1,0,0
1,608,41,1,83807.86,1,1,112542.58,0,0,0,0,1
2,502,42,8,159660.8,3,0,113931.57,1,0,1,0,0
3,699,39,1,0.0,2,0,93826.63,0,0,1,0,0
4,850,43,2,125510.82,1,1,79084.1,0,0,0,0,1


In [7]:
X = df.drop(columns=['Exited', 'HasCrCard'])
y = df['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

By splitting the data into training and testing sets, we can evaluate the model's performance on unseen data. This is crucial for assessing the model's generalization capabilities and ensuring it can make accurate predictions on new, unseen data. For this instance, we split on 8000 samples for training and 2000 samples for testing.

#### Talking about scaling
For tree-based models, scaling the data is not necessary. These models are not distance-based, so their performance and structure are unaffected by features being on different scales. In fact, scaling can even reduce the interpretability of tree-based models because the raw feature values often carry meaningful insights into splits and thresholds.

However, for other types of models, scaling is crucial. For example, in this case, the variables have different ranges: Age ranges from 18 to 92, while EstimatedSalary ranges from 11.58 to 199,992. If features are not on the same scale, it can introduce bias in distance-based models, such as k-Nearest Neighbors (k-NN), or hinder convergence in gradient-based models, such as logistic regression or neural networks. By scaling the data, we ensure that all features contribute equally, improving model performance and reliability.

But its timing is equally important. By splitting your dataset before scaling, you preserve the independence of the test set, prevent data leakage, and ensure accurate model evaluation. 

So we are gonna make a data non scaled for tree-based models and scaled for other models.


In [None]:
# Scaling the data for models that require it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler
joblib.dump(scaler, 'scalers_n_models/standard_scaler.pkl');

## Testing models

### Tree-based models

As we are dealing with a problem that business interpretability is important, we will start with tree-based models. These models are easy to interpret and can provide insights into feature importance, which can be valuable for understanding the factors that drive customer churn.

#### Decision Tree