# Summary

Welcome to the Binary Classification with Bank Churn Dataset competition! In this challenge, our goal is to predict customer churn based on a comprehensive set of information. The dataset comprises 13 columns, each offering valuable insights into customer behavior and characteristics:

    01. Customer ID: A unique identifier for each customer.
    02. Surname: The customer's last name or surname.
    03. Credit Score: A numerical representation of the customer's credit score.
    04. Geography: The country where the customer resides, with options including France, Spain, or Germany.
    05. Gender: The customer's gender, categorized as Male or Female.
    06. Age: The customer's age.
    07. Tenure: The number of years the customer has been associated with the bank.  
    08. Balance: The current account balance of the customer.
    09. NumOfProducts: The count of bank products the customer uses, such as savings accounts or credit cards.
    10. HasCrCard: A binary indicator of whether the customer possesses a credit card (1 = yes, 0 = no).
    11. IsActiveMember: A binary indicator of the customer's active membership status (1 = yes, 0 = no).
    12. EstimatedSalary: The estimated salary of the customer.
    13. Exited: The target variable indicating whether the customer has churned (1 = yes, 0 = no).
    
Throughout the competition, our performance will be evaluated using the Area Under the ROC Curve (AUC-ROC) metric, a widely-used measure for assessing binary classification models. For more details about the original dataset, please refer to the Kaggle dataset page: [Bank Customer Churn Prediction](https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction). Let's embark on this predictive analytics journey and strive for accurate churn predictions!

# Imports

In [1]:
# Libraries for data wrangling
import pandas as pd
import numpy as np 
# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#Library to see the progress
from tqdm import tqdm


# Libraries with functions used in modelling
from sklearn.base import clone
from sklearn.inspection import permutation_importance
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
from category_encoders import MEstimateEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, make_scorer, mean_squared_error, accuracy_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

# Libraries with the models
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import HistGradientBoostingClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
import optuna

# Library to ignore warnings
import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
sns.set_theme(style = 'white', palette = 'viridis')
pal = sns.color_palette('viridis')

In [3]:
seed = 10

# Loading data

In [4]:
df_train = pd.read_csv("data/train.csv", index_col='id')
df_test = pd.read_csv("data/test.csv", index_col='id')

In [5]:
numerical_features = list(df_test._get_numeric_data())
categorical_features = list(df_test.drop(numerical_features, axis = 1))

# Encoding of the data

In [6]:
# 'categorical_columns' should contain the names of the categorical columns you want to encode
onehot_columns = ['Geography', 'Gender']  # replace with your actual column names

# Extract the categorical columns from the DataFrame
df_onehot = df_train[onehot_columns]

# Create an instance of OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

# Fit and transform the categorical columns
df_train_encoded = pd.DataFrame(encoder.fit_transform(df_onehot), columns=encoder.get_feature_names_out(onehot_columns))

# Concatenate the one-hot encoded columns with the original DataFrame
df_train = pd.concat([df_train, df_train_encoded], axis=1)

# Drop the original categorical columns if needed
df_train = df_train.drop(onehot_columns, axis=1)




In [7]:
mencoder = MEstimateEncoder(cols=['Surname'])

In [8]:
df_train_mencoded = mencoder.fit_transform(df_train.drop(columns=['Exited']), df_train.Exited)

In [9]:
# 'categorical_columns' should contain the names of the categorical columns you want to encode
onehot_columns = ['Geography', 'Gender']  # replace with your actual column names

# Extract the categorical columns from the DataFrame
df_onehot_test = df_test[onehot_columns]

# Create an instance of OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

# Fit and transform the categorical columns
df_test_encoded = pd.DataFrame(encoder.fit_transform(df_onehot_test), columns=encoder.get_feature_names_out(onehot_columns))

# Drop the original categorical columns if needed
df_test = pd.merge(left=df_test.reset_index().reset_index(), right=df_test_encoded.reset_index(), how='inner', on='index').drop(columns=['index'] + onehot_columns)

In [10]:
df_test = df_test.set_index('id')

In [11]:
df_test_mencoded = mencoder.transform(df_test)

In [12]:
df_test_mencoded

Unnamed: 0_level_0,CustomerId,Surname,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
165034,15773898,0.188126,586,23.0,2,0.00,2,0.0,1.0,160976.75,1.0,0.0,0.0,1.0,0.0
165035,15782418,0.263225,683,46.0,2,0.00,1,1.0,0.0,72549.27,1.0,0.0,0.0,1.0,0.0
165036,15807120,0.175690,656,34.0,7,0.00,2,1.0,0.0,138882.09,1.0,0.0,0.0,1.0,0.0
165037,15808905,0.187270,681,36.0,8,0.00,1,1.0,0.0,113931.57,1.0,0.0,0.0,0.0,1.0
165038,15607314,0.267633,752,38.0,10,121263.62,1,1.0,0.0,139431.00,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
275052,15662091,0.161045,570,29.0,7,116099.82,1,1.0,1.0,148087.62,0.0,0.0,1.0,0.0,1.0
275053,15774133,0.242410,575,36.0,4,178032.53,1,1.0,1.0,42181.68,1.0,0.0,0.0,1.0,0.0
275054,15728456,0.271010,712,31.0,2,0.00,2,1.0,0.0,16287.38,1.0,0.0,0.0,0.0,1.0
275055,15687541,0.255843,709,32.0,3,0.00,1,1.0,1.0,158816.58,1.0,0.0,0.0,1.0,0.0


# LGBM optimization

## Data processing

In [13]:
data = df_train_mencoded
target = df_train.Exited

In [14]:
from sklearn.model_selection import cross_val_score

In [17]:
def objective(trial, data=data,target=target, seed=seed):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=seed)
    param = {
        'random_state': seed,
        'n_estimators': 100,
        'boosting_type': 'dart',
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': 0.8,
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [10,20,50]),
        'num_leaves' : trial.suggest_int('num_leaves', 2, 500),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300)
    }
    model = LGBMClassifier(**param)  
    
    #model.fit(train_x,train_y,eval_set=[(test_x,test_y)])
    
    #preds = model.predict(test_x)
    
    score = cross_val_score(model, data, target, cv=5, scoring=make_scorer(roc_auc_score, needs_proba=True), n_jobs=-1).mean()
    
    return score

In [18]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f'Number of finished trials:: {len(study.trials)}')
print(f'Best trial: {study.best_trial.params}')

[I 2024-01-11 15:19:38,741] A new study created in memory with name: no-name-581abe9d-2f87-4c7d-b5de-bf405719b138
[I 2024-01-11 15:19:42,272] Trial 0 finished with value: 0.8934212947369833 and parameters: {'reg_alpha': 0.12985802439288033, 'reg_lambda': 0.13836791867676862, 'subsample': 0.4, 'learning_rate': 0.008, 'max_depth': 50, 'num_leaves': 43, 'min_child_samples': 88}. Best is trial 0 with value: 0.8934212947369833.
[I 2024-01-11 15:19:47,289] Trial 1 finished with value: 0.8963777755590213 and parameters: {'reg_alpha': 0.9391952575860697, 'reg_lambda': 0.08633836272281764, 'subsample': 0.8, 'learning_rate': 0.008, 'max_depth': 10, 'num_leaves': 210, 'min_child_samples': 296}. Best is trial 1 with value: 0.8963777755590213.
[I 2024-01-11 15:19:53,244] Trial 2 finished with value: 0.8970701424993195 and parameters: {'reg_alpha': 0.03604062614906168, 'reg_lambda': 0.9454219405299346, 'subsample': 1.0, 'learning_rate': 0.01, 'max_depth': 20, 'num_leaves': 499, 'min_child_samples': 

Number of finished trials:: 100
Best trial: {'reg_alpha': 0.47912783030143985, 'reg_lambda': 0.0010280123809212156, 'subsample': 0.6, 'learning_rate': 0.02, 'max_depth': 20, 'num_leaves': 290, 'min_child_samples': 137}


In [None]:
#study.trials_dataframe()

In [19]:
optuna.visualization.plot_optimization_history(study)

In [21]:
optuna.visualization.plot_slice(study, params=['reg_alpha', 'reg_lambda', 'subsample', 'learning_rate', 'max_depth', 'num_leaves', 'min_child_samples'])

In [22]:
optuna.visualization.plot_param_importances(study)

In [None]:
#splits = 5
#skf = StratifiedKFold(n_splits = splits, random_state = seed, shuffle = True)
#np.random.seed(seed)

In [None]:
#params = study.best_params
#params['random_state'] = seed
#params['n_estimators'] = 100

In [23]:
study.best_params

{'reg_alpha': 0.47912783030143985,
 'reg_lambda': 0.0010280123809212156,
 'subsample': 0.6,
 'learning_rate': 0.02,
 'max_depth': 20,
 'num_leaves': 290,
 'min_child_samples': 137}

In [25]:
best_model = LGBMClassifier(random_state=seed, 
               n_estimators=100, 
               boosting_type='dart',
               colsample_bytree=0.8,
               reg_alpha=0.47912783030143985,
               reg_lambda=0.0010280123809212156,
               subsample=0.6,
               learning_rate=0.02,
               max_depth=20,
               num_leaves=290,
               min_child_samples=137
               )

In [26]:
best_model.fit(data, target)

[LightGBM] [Info] Number of positive: 34921, number of negative: 130113
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001039 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1373
[LightGBM] [Info] Number of data points in the train set: 165034, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.211599 -> initscore=-1.315315
[LightGBM] [Info] Start training from score -1.315315


In [29]:
submission = df_test_mencoded.copy()
submission['Exited'] = best_model.predict_proba(submission)[:, 1]



In [32]:
submission.Exited.to_csv('data/submission_lgbm.csv')