# ⚽S4E1 - EDA & initial submission - Binary Classification with a Bank Churn Dataset 

Welcome to 2024! For this Episode of the Series, your task is to predict whether a customer continues with their account or closes it (e.g., churns). Good luck!

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Submission Format

For each id in the test set, you must predict the probability for the target variable Exited. The file should contain a header and have the following format:

```
id,Exited
0,0.9
1,0.1
2,0.5
etc.
```

## Data Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Bank Customer Churn Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. 

# Code

## ToC

- [Imports](#Imports)


## Imports

In [None]:
# essentials
import os
import pathlib
from copy import copy


import pandas as pd
import numpy as np
from tqdm import tqdm

# visualisation
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn imports
import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MaxAbsScaler, PowerTransformer, FunctionTransformer, StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MaxAbsScaler, PowerTransformer, FunctionTransformer, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline, make_union, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2, f_classif, SequentialFeatureSelector, RFECV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.base import clone as clone_model
from sklearn.metrics import classification_report, confusion_matrix, log_loss
from sklearn.impute import SimpleImputer, MissingIndicator, KNNImputer


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier, ExtraTreesClassifier, BaggingClassifier, StackingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression, ElasticNet, SGDClassifier, RidgeClassifier, PassiveAggressiveClassifier, TweedieRegressor
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB, ComplementNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import RocCurveDisplay, roc_auc_score, make_scorer, roc_curve

from sklearn.preprocessing import Binarizer, Normalizer, RobustScaler, StandardScaler
from sklearn.preprocessing import FunctionTransformer

# others
import xgboost as xgb 
import lightgbm as lgb

import optuna
import shap

RANDOM_SEED = 64

palette = ["#4464ad", "#dc136c", "#F4FF52", "#f58f29","#45cb85"]

sns.set_theme(style="whitegrid")
sns.set_palette(palette)
sns.palplot(palette)

## Data loading & EDA

First we will check

1. Number and types of columns
2. Number of rows in train and test
2. Missing values
3. Target variable distribution

In [None]:
IN_KAGGLE = False

kaggle_folder = "/kaggle/input/"
local_folder = "./data/"
input_folder = kaggle_folder if IN_KAGGLE else local_folder
train_df = pd.read_csv(input_folder + "playground-series-s4e1/train.csv", index_col="id")
test_df = pd.read_csv(input_folder + "playground-series-s4e1/test.csv", index_col="id")
submission_df = pd.read_csv(input_folder + "playground-series-s4e1/sample_submission.csv")
original_df = pd.read_csv(input_folder + "bank-customer-churn-prediction/Churn_Modelling.csv")

target_col = "Exited"

def initial_feature_engineering(df):
    df['HasCrCard'] = df['HasCrCard'].astype('bool')
    df['IsActiveMember'] = df['IsActiveMember'].astype('bool')
    return df

train_df = initial_feature_engineering(train_df)

train_df.head()

In [None]:
num_columns = len(train_df.columns)
num_rows = len(train_df)

train_df.info()

<div>
    <div style="background-color: #4F8EC9; padding: 10px; border-radius: 5px; border: 5px solid #3C6D9C; margin: 10px 0;">
        <h4>Data shape</h4>
        <p>Data contains 13 columns, of which 8 are numeric, 2 are boolean and 3 are categorical. There are no missing values.</p>
    </div>
</div>

### Comparison of target column value counts across train and original datasets

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 6))

sns.despine()

fig.suptitle("Target distribution", fontsize=20)

sns.countplot(x=target_col, data=train_df, hue=target_col, ax=ax[0])
sns.countplot(x=target_col, data=original_df, hue=target_col, ax=ax[1])

plt.show()

In [None]:

total_train = len(train_df)
total_original = len(original_df)


train_target_pct = train_df[target_col].value_counts() / total_train
original_target_pct = original_df[target_col].value_counts() / total_original


fig, ax = plt.subplots(1, 2, figsize=(10, 6))

sns.despine()

fig.suptitle("Target distribution (%)", fontsize=20)

sns.barplot(x=train_target_pct.index, y=train_target_pct.values, ax=ax[0], hue=train_target_pct.values, palette=palette)
sns.barplot(x=original_target_pct.index, y=original_target_pct.values, ax=ax[1], hue=original_target_pct.values, palette=palette)
plt.show()

In [None]:
train_df[target_col].value_counts()

The competition dataset and original have similar target distribution - around 80% of customers stay with the bank in each.


Next steps:

4. Distribution of numeric features
5. Distribution of categorical features
6. Correlation between numeric features and target
7. Chi-square test on categorical features and target

## Distribution of numeric features

In [None]:
numeric_features = train_df.select_dtypes(include=np.number).columns.tolist()
# remove target col from list
numeric_features.remove(target_col)
print(numeric_features)

#### CustomerId - 'unique' identifier?

In [None]:
train_df['CustomerId'].value_counts()

In [None]:
val_cnt = train_df['CustomerId'].value_counts().reset_index()
val_cnt[ val_cnt['count'] > 1 ]

#### CreditScore


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.despine()

fig.suptitle(f"CreditScore histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)

sns.histplot(x="CreditScore", data=train_df[train_df[target_col] == 0], ax=ax[0], kde=True, color=palette[0])
sns.histplot(x="CreditScore", data=train_df[train_df[target_col] == 1], ax=ax[1], kde=True, color=palette[1])

<div>
    <div style="background-color: #4F8EC9; padding: 10px; border-radius: 5px; border: 5px solid #3C6D9C; margin: 10px 0;">
        <h4>Credit score</h4>
        <p>
          <ul>
            <li>Distribution has slight left skew, and we see the 850 cutoff as it's the max FICO credit score.</li>
            <li>Min value is 350, max is 850, 50% of customers have score of at least 659. FICO scores could be as low as 300, but we don't have such customers in the dataset.</li>
            <li>Credit score is not very different between customers who stay and leave</li>
          </ul>        
        </p>
    </div>
</div>

#### Age

In [None]:
train_df['Age'].describe()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Age histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)

sns.histplot(x="Age", data=train_df[train_df[target_col] == 0], ax=ax[0], kde=True, color=palette[0])
sns.histplot(x="Age", data=train_df[train_df[target_col] == 1], ax=ax[1], kde=True, color=palette[1])

Min Age is 18 while the oldest customer is 92 years old.

#### Tenure

In [None]:
train_df['Tenure'].describe()

In [None]:
train_df['Tenure'].value_counts()

In [None]:
customers_less_than_1_yr = train_df[train_df['Tenure'] < 1]['CustomerId'].unique()
customers_more_than_9_yrs = train_df[train_df['Tenure'] > 9]['CustomerId'].unique()
total_cust = len(train_df)
print(f"Customers with less than 1 year of tenure: {len(customers_less_than_1_yr)/total_cust*100:.2f}%")
print(f"Customers with more than 9 years of tenure: {len(customers_more_than_9_yrs)/total_cust*100:.2f}%")

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Tenure histogram", fontsize=20)

sns.histplot(x="Tenure", data=train_df, ax=ax, kde=True, color=palette[0])

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Tenure histogram", fontsize=20)

sns.histplot(x="Tenure", data=train_df, ax=ax, kde=True, hue=target_col, color=palette[0])

Around 2% of customers have been with the bank for more than 9 years. Around the same number of customers came to bank less than 1 year ago.

#### Balance

In [None]:
train_df['Balance'].describe()

In [None]:
train_df[train_df[target_col] == 0]['Balance'].describe()

In [None]:
train_df[train_df[target_col] == 1]['Balance'].describe()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Balance histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)
# distribution plot 

sns.histplot(x="Balance", data=train_df[train_df[target_col] == 0], ax=ax[0], kde=True, color=palette[0])
sns.histplot(x="Balance", data=train_df[train_df[target_col] == 1], ax=ax[1], kde=True, color=palette[1])

sns.displot(train_df, x="Balance", hue=target_col, kind="ecdf")

<div>
    <div style="background-color: #4F8EC9; padding: 10px; border-radius: 5px; border: 5px solid #3C6D9C; margin: 10px 0;">
        <h4>Balance</h4>
        <p>
          <ul>
            <li>More than 50% of customers balance was 0 (at the time of data collection). Max balance is around 250k.</li>
            <li>Among customers who exited the bank, 50% of them had a balance of at least ~100k</li>
          </ul>        
        </p>
    </div>
</div>

#### NumOfProducts

In [None]:
train_df['NumOfProducts'].describe()

In [None]:
train_df[train_df[target_col] == 0]['NumOfProducts'].describe()

In [None]:
train_df[train_df[target_col] == 0]['NumOfProducts'].describe()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.despine()
fig.suptitle(f"NumOfProducts histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)

sns.histplot(x="NumOfProducts", data=train_df[train_df[target_col] == 0], ax=ax[0], kde=True, color=palette[0])
sns.histplot(x="NumOfProducts", data=train_df[train_df[target_col] == 1], ax=ax[1], kde=True, color=palette[1])

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"NumOfProducts histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)

sns.histplot(x="NumOfProducts", data=train_df, ax=ax, kde=True, hue=target_col, color=palette[0])

In [None]:
three_or_more_products = train_df[train_df['NumOfProducts'] >= 3]
total_num_three_or_more_products = len(three_or_more_products)
val_cnt = three_or_more_products[target_col].value_counts()
# calculate pct
val_cnt = val_cnt / total_num_three_or_more_products * 100
val_cnt

In [None]:
less_than_three = train_df[train_df['NumOfProducts'] < 3]
less_than_three[target_col].value_counts()

<div>
    <div style="background-color: #4F8EC9; padding: 10px; border-radius: 5px; border: 5px solid #3C6D9C; margin: 10px 0;">
        <h4>NumOfProducts</h4>
        <p>
          <ul>
            <li>In both groups, min number of products was 1 while max was 4</li>
            <li>Among customers who stay with bank, most of them have 2 products, while having 1 product is the second most common</li>
            <li>Among customers who exited the bank, having 1 product is the most common, but having 3 or 4 products is much more common than in the other group</li>
            <li>Almost 90% of customers who have 3 or 4 products leave the bank</li>
          </ul>        
        </p>
    </div>
</div>

#### EstimatedSalary

In [None]:
train_df["EstimatedSalary"].describe()

In [None]:
train_df[train_df[target_col] == 0]["EstimatedSalary"].describe()

In [None]:
train_df[train_df[target_col] == 1]["EstimatedSalary"].describe()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

sns.despine()
fig.suptitle(f"EstimatedSalary histogram ({target_col} = 0, {target_col} = 1) ", fontsize=20)

sns.histplot(x="EstimatedSalary", data=train_df[train_df[target_col] == 0], ax=ax[0], kde=True, color=palette[0])
sns.histplot(x="EstimatedSalary", data=train_df[train_df[target_col] == 1], ax=ax[1], kde=True, color=palette[1])

People who leave the bank have a slightly higher salary than those who stay.

## Categorical features

In [None]:
categorical_features = train_df.select_dtypes(exclude=np.number).columns.tolist()

print(categorical_features)

#### HasCrCard

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"HasCrCard distribution between target groups", fontsize=20)

val_cnt = train_df[[target_col, 'HasCrCard']].value_counts(normalize=True).reset_index()
pivot_table = val_cnt.pivot_table(index='HasCrCard', columns=target_col, values='proportion', aggfunc=sum, fill_value=0)
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"HasCrCard distribution between target groups", fontsize=20)

val_cnt = train_df[[target_col, 'HasCrCard']].value_counts().reset_index()
pivot_table = val_cnt.pivot_table(index='HasCrCard', columns=target_col, values='count', aggfunc=sum, fill_value=0)
# divide each cell by total sum in each column
pivot_table = pivot_table / pivot_table.sum(axis=0)
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

#### IsActiveMember

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"IsActiveMember distribution between target groups", fontsize=20)

val_cnt = train_df[[target_col, 'IsActiveMember']].value_counts(normalize=True).reset_index()
pivot_table = val_cnt.pivot_table(index='IsActiveMember', columns=target_col, values='proportion', aggfunc=sum, fill_value=0)

sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"IsActiveMember distribution between target groups (proportions among target groups)", fontsize=20)

val_cnt = train_df[[target_col, 'IsActiveMember']].value_counts().reset_index()
pivot_table = val_cnt.pivot_table(index='IsActiveMember', columns=target_col, values='count', aggfunc=sum, fill_value=0)
# divide each cell by total sum in each column
pivot_table = pivot_table / pivot_table.sum(axis=0)
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

<div>
    <div style="background-color: #4F8EC9; padding: 10px; border-radius: 5px; border: 5px solid #3C6D9C; margin: 10px 0;">
        <h4>HasCrCard & IsActiveMember</h4>
        <p>
          <ul>
            <li>Having a credit card does not seem to contribute to staying or leaving.</li>
            <li>70% of people who left were not active members while only 55% of people who stayed were active members.</li>
          </ul>        
        </p>
    </div>
</div>

#### Gender

In [None]:
train_df[[target_col, 'Gender']].value_counts()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Gender proportions among target groups", fontsize=20)

val_cnt = train_df[[target_col, 'Gender']].value_counts().reset_index()
pivot_table = val_cnt.pivot_table(index='Gender', columns=target_col, values='count', aggfunc=sum, fill_value=0)
# divide each cell by total sum in each column
pivot_table = pivot_table / pivot_table.sum(axis=0)
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

Among people who stay with bank, 60% are men, while 60% of people who leave the bank are women.

#### Geography

In [None]:
train_df['Geography'].value_counts()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.despine()
fig.suptitle(f"Geography proportions among target groups", fontsize=20)

val_cnt = train_df[[target_col, 'Geography']].value_counts().reset_index()
pivot_table = val_cnt.pivot_table(index='Geography', columns=target_col, values='count', aggfunc=sum, fill_value=0)
# divide each cell by total sum in each column
pivot_table = pivot_table / pivot_table.sum(axis=0)
sns.heatmap(pivot_table, annot=True, fmt=".2f", cmap="Reds")
val_cnt

#### Surname

In [None]:
train_df['Surname'].value_counts()

Not likely to be useful for prediction. We will drop this column before training.

## Correlation between numeric features and target

In [None]:
# plot correlation matrix

corr_matrix = train_df[numeric_features + [target_col]].corr( method='pearson')


fig.suptitle("Correlation matrix", fontsize=20)

sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="Reds")

There are not corellated features

## Chi-square test on categorical features and target

Here we want to create a table which could tell us by which features the target groups differ the most. We will use chi-square test to check if the difference is statistically significant.

In [None]:
from scipy.stats import chi2_contingency

def chi_square_test(dataframe, target_col, categorical_features):
    # Create a dictionary to hold the results
    results = {"Feature": [], "Chi-Square Statistic": [], "p-Value": [], "Significant": []}

    # Split the data into two groups based on the target column
    group_1 = dataframe[dataframe[target_col] == 0]
    group_2 = dataframe[dataframe[target_col] == 1]

    # Iterate over each categorical feature
    for feature in categorical_features:
        # Create a contingency table
        contingency_table = pd.crosstab(dataframe[feature], dataframe[target_col])

        # Perform the chi-square test
        chi2, p, dof, expected = chi2_contingency(contingency_table)

        # Determine if the result is significant
        significant = p < 0.005

        # Append the results to the dictionary
        results["Feature"].append(feature)
        results["Chi-Square Statistic"].append(chi2)
        results["p-Value"].append(p)
        results["Significant"].append(significant)

    # Convert the dictionary to a DataFrame and return it
    return pd.DataFrame(results).sort_values(by="Chi-Square Statistic", ascending=False)

In [None]:
results_df = chi_square_test(train_df, target_col, ["Geography", "IsActiveMember", "Gender", "HasCrCard"])
results_df

## Model training

In [None]:
train_df = pd.read_csv(input_folder + "playground-series-s4e1/train.csv", index_col="id")
test_df = pd.read_csv(input_folder + "playground-series-s4e1/test.csv", index_col="id")
submission_df = pd.read_csv(input_folder + "playground-series-s4e1/sample_submission.csv")
original_df = pd.read_csv(input_folder + "bank-customer-churn-prediction/Churn_Modelling.csv")
target_col = "Exited"

numeric_features = ['CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
categorical_features = ['Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

features_to_drop = ['CustomerId', 'Surname']

for f in features_to_drop:
    if f in numeric_features:
        numeric_features.remove(f)
    if f in categorical_features:
        categorical_features.remove(f)
    
    train_df = train_df.drop(columns=f)

def initial_feature_engineering(df):
    df['HasCrCard'] = df['HasCrCard'].astype('bool')
    df['IsActiveMember'] = df['IsActiveMember'].astype('bool')
    df['Gender'] = df['Gender'].map({ "Male": 0, "Female": 1}).astype("bool")
    # encode geography
    df = pd.get_dummies(df, columns=['Geography'])

    return df


train_df = initial_feature_engineering(train_df)

X_train, X_val, y_train, y_val = train_test_split(train_df.drop(columns=target_col), train_df[target_col], test_size=0.2, random_state=RANDOM_SEED, stratify=train_df[target_col])

In [None]:

def create_pipeline(model, numeric_scalers=("scaler", StandardScaler())):
    numeric_pipeline = Pipeline(
        [numeric_scalers]
    )

    categorical_pipeline = Pipeline([
        #("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("one_hot_encoder", OneHotEncoder(handle_unknown="ignore", drop='if_binary')),
    ])

    preprocessor = ColumnTransformer([
        ("numeric", numeric_pipeline, numeric_features),
        #("categorical", categorical_pipeline, categorical_features),
    ], remainder='passthrough')

    return Pipeline([
        ("preprocessor", preprocessor),
        ("classifier", model),
    ])

def train_models(models, X_train, y_train):
    trained_models = {}
    for model_name, model in tqdm(models.items()):
        model = create_pipeline(model)
        model.fit(X_train, y_train)
        trained_models[model_name] = model
    return trained_models

def evaluate_models(models, X_val, y_val):
    # create a dataframe with "model_name", "accuracy", "precision", "recall", "area under the ROC curve"
    results_df = pd.DataFrame(columns=["model_name", "accuracy", "precision", "recall", "auc"])

    for model_name, model in tqdm(models.items()):
        y_pred = model.predict(X_val)
        y_proba = model.predict_proba(X_val)[:, 1]
        results_df = pd.concat([
            results_df,
            pd.DataFrame({
                "model_name": [model_name],
                "accuracy": [model.score(X_val, y_val)],
                "precision": [sklearn.metrics.precision_score(y_val, y_pred)],
                "recall": [sklearn.metrics.recall_score(y_val, y_pred)],
                "auc": [sklearn.metrics.roc_auc_score(y_val, y_proba)],
            })
        ])
    return results_df

def plot_roc_curve(models, X_val, y_val):
    fig, ax = plt.subplots(1, 1, figsize=(16, 8))
    palette_to_use = sns.color_palette("husl", len(models))
    # for each model, plot the roc curve in the same plot, with other color
    for i, (model_name, model) in enumerate(models.items()):
        y_proba = model.predict_proba(X_val)[:, 1]
        fpr, tpr, _ = roc_curve(y_val, y_proba)
        roc_auc = roc_auc_score(y_val, y_proba)
        ax.plot(fpr, tpr, label=f"{model_name} (AUC = {roc_auc:.2f})", color=palette_to_use[i])
        ax.plot([0, 1], [0, 1], color='black', linestyle='--')
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("ROC Curve")
    # show legend
    ax.legend()


In [None]:
models = {
    "xgboost": xgb.XGBClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    "lightgbm": lgb.LGBMClassifier(random_state=RANDOM_SEED, n_jobs=-1, verbosity=-1),
    "logistic_regression": LogisticRegression(random_state=RANDOM_SEED, n_jobs=-1),
    "knn": KNeighborsClassifier(n_jobs=-1),
    "decision_tree": DecisionTreeClassifier(random_state=RANDOM_SEED),
    "random_forest": RandomForestClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    "gradient_boosting": GradientBoostingClassifier(random_state=RANDOM_SEED),
    "extra_trees": ExtraTreesClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    "bagging": BaggingClassifier(random_state=RANDOM_SEED, n_jobs=-1),
    "sgd": SGDClassifier(random_state=RANDOM_SEED, loss="log_loss", n_jobs=-1),
}

print("Training models...")
trained_models = train_models(models, X_train, y_train)
print("Evaluating models...")
results_df = evaluate_models(trained_models, X_val, y_val)
results_df.sort_values(by="auc", ascending=False)

In [None]:
plot_roc_curve(trained_models, X_val, y_val)

#### Bonus: Shapley values

In [None]:
best_model = trained_models['lightgbm'].named_steps["classifier"]

x_shap_sample = X_train.sample(500)

explainer = shap.TreeExplainer(best_model, data=x_shap_sample, feature_perturbation="interventional", model_output='probability')
shap_values = explainer.shap_values(x_shap_sample)

shap.summary_plot(shap_values, X_train, plot_type="violin")

In [None]:
shap.plots.beeswarm(shap_values)

For a particular sample

In [None]:
sample_target_0 = x_shap_sample[ x_shap_sample[target_col] == 0 ].sample(1).iloc[0]
sample_target_1 = x_shap_sample[ x_shap_sample[target_col] == 1 ].sample(1).iloc[0]


shap.plots.waterfall(shap_values[sample_target_0])

In [None]:
shap.plots.force(shap_values[sample_target_0])

In [None]:
shap.plots.waterfall(shap_values[sample_target_1])

In [None]:
shap.plots.force(shap_values[sample_target_1])

## Optimizing hyperparameters for the best model

We will use optuna here to first optimize all different hyperparameters, then second time to find best learning rate

In [None]:
def objective_lightgbm(trial):
    lightgbm_model_optuna = lgb.LGBMClassifier(random_state=RANDOM_SEED, verbose=-1, n_jobs=-1)
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
    clf = create_pipeline(lightgbm_model_optuna)
    params = {
        'classifier__n_estimators' : 377,
        #"classifier__learning_rate" : trial.suggest_float('learning_rate',1e-4, 0.25, log=True),
        "classifier__max_depth":trial.suggest_int('max_depth',3,50),
        'classifier__reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0),
        'classifier__reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0),
        "classifier__min_child_weight" : trial.suggest_float('min_child_weight', 0.5,4),
        "classifier__min_child_samples" : trial.suggest_int('min_child_samples',1,100),
        "classifier__subsample" : trial.suggest_float('subsample', 0.4, 1),
        "classifier__subsample_freq" : trial.suggest_int('subsample_freq',0,5),
        "classifier__colsample_bytree" : trial.suggest_float('colsample_bytree',0.2,1),
        "classifier__num_leaves" : trial.suggest_int('num_leaves', 2, 64*2),
        "classifier__max_bin" : trial.suggest_int('max_bin', 128, 1024),
    }
    
    clf.set_params(**params)
    return cross_val_score(clf, X_train, y_train, cv = skf, scoring='roc_auc', n_jobs=-1).mean()

In [None]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(objective_lightgbm, n_trials=1, timeout=1000)
print("Best score:", study.best_value)
print("Best params:", study.best_params)
lgbm_best_params = study.best_params

# Submission

In [None]:
train_df = pd.read_csv(input_folder + "playground-series-s4e1/train.csv", index_col="id")
test_df = pd.read_csv(input_folder + "playground-series-s4e1/test.csv", index_col="id")
submission_df = pd.read_csv(input_folder + "playground-series-s4e1/sample_submission.csv")
original_df = pd.read_csv(input_folder + "bank-customer-churn-prediction/Churn_Modelling.csv")
target_col = "Exited"

numeric_features = ['CustomerId', 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
categorical_features = ['Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

features_to_drop = ['CustomerId', 'Surname']

for f in features_to_drop:
    if f in numeric_features:
        numeric_features.remove(f)
    if f in categorical_features:
        categorical_features.remove(f)
    
    test_df = test_df.drop(columns=f)
    train_df = train_df.drop(columns=f)

def initial_feature_engineering(df):
    df['HasCrCard'] = df['HasCrCard'].astype('bool')
    df['IsActiveMember'] = df['IsActiveMember'].astype('bool')
    df['Gender'] = df['Gender'].map({ "Male": 0, "Female": 1}).astype("bool")
    # encode geography
    df = pd.get_dummies(df, columns=['Geography'])

    return df

train_df = initial_feature_engineering(train_df)
test_df = initial_feature_engineering(test_df)

In [None]:
# train model on train data
model = clone_model(trained_models['lightgbm'])
model.set_params(**lgbm_best_params)

X_train = train_df.drop(columns=target_col)
y_train = train_df[target_col]
X_test = test_df

model.fit(X_train, y_train)

In [None]:
# predict on test data
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]


# create submission df

submission_df = pd.DataFrame({
    "id": test_df.index,
    target_col: y_proba
})


submission_df.to_csv("submission.csv", index=False)
submission_df