<a href="https://colab.research.google.com/github/boiBASH/Elite-Bank-Project/blob/main/Data_Transformation_and_Model_Training_with_Dagshub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install pyngrok
!pip install catboost
!pip install xgboost
!pip install shap
!pip install -q dagshub mlflow

In [None]:
#Import the necessary libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shap
import subprocess
import mlflow
import dagshub
from pyngrok import ngrok, conf
import getpass
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from mlflow.models.signature import infer_signature
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [None]:
dagshub.init(repo_owner='boiBASH', repo_name='Elite-Bank-Project', mlflow=True)

Output()



Open the following link in your browser to authorize the client:
https://dagshub.com/login/oauth/authorize?state=2ac7281e-0a88-4d72-8362-9acaf1e7032e&client_id=32b60ba385aa7cecf24046d8195a71c07dd345d9657977863b52e7748e0f0f28&middleman_request_id=4e6f754f402a9be3f9d77522f1bb2eb4992a1cc832b22a2a782dc332046405c6




In [None]:
df = pd.read_csv("/content/Bank_Marketing_Dataset.csv")

In [None]:
# Select the column types
scale_columns = [
    "age",
    "balance",
    "day",
    "duration"
]

categorical_columns = df.select_dtypes(include = ["object"]).columns.tolist()
categorical_columns.remove("deposit")

In [None]:
# Extract features and labels from dataset
X, y = df.drop(labels = ["deposit"], axis = 1), df["deposit"]

In [None]:
# Encode labels
map_dictionary = {
    "yes": 1,
    "no": 0
}

y = y.apply(lambda x: map_dictionary[x])

In [None]:
# Separate into train and test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, stratify = y)

In [None]:
# Implement data preparation transformer
def get_transformer(categorical_columns, scale_columns, one_hot=False):
    if one_hot:
        transformer = ColumnTransformer(
            transformers=[
                ("ord", OneHotEncoder(), categorical_columns),
                ("scale", StandardScaler(), scale_columns)
            ],
            remainder="passthrough"
        )
    else:
        transformer = ColumnTransformer(
            transformers=[
                ("ord", OrdinalEncoder(), categorical_columns),
                ("scale", StandardScaler(), scale_columns)
            ],
            remainder="passthrough"
        )
    return transformer

In [None]:
def transform_data(df, transformer):
    """
    Fit and transform the DataFrame using the provided transformer.
    Returns a DataFrame with the appropriate feature names.
    """
    transformer.fit(df)
    transformed_array = transformer.transform(df)
    try:
        feature_names = transformer.get_feature_names_out()
    except AttributeError:
        feature_names = [f"feature_{i}" for i in range(transformed_array.shape[1])]
    return pd.DataFrame(transformed_array, columns=feature_names)

In [None]:
# Logistic Regression pipeline using one-hot encoding for categorical variables
lr_pipe = Pipeline(
    steps=[
        ("1", get_transformer(categorical_columns, scale_columns, one_hot=True)),
        ("2", LogisticRegression(max_iter=1000))
    ]
)

# CatBoost pipeline using ordinal encoding for categorical variables
cat_pipe = Pipeline(
    steps=[
        ("1", get_transformer(categorical_columns, scale_columns, one_hot=False)),
        ("2", CatBoostClassifier())
    ]
)

# ExtraTrees pipeline using ordinal encoding for categorical variables
xgb_pipe = Pipeline(
    steps=[
        ("1", get_transformer(categorical_columns, scale_columns, one_hot=False)),
        ("2", XGBClassifier())
    ]
)

In [None]:
# Fit LogisticRegression Model
lr_pipe.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [None]:
# Fit CatBoost model
cat_pipe.fit(X_train, y_train)

Learning rate set to 0.026238
0:	learn: 0.6762805	total: 51ms	remaining: 51s
1:	learn: 0.6623483	total: 55.2ms	remaining: 27.5s
2:	learn: 0.6512269	total: 59.1ms	remaining: 19.6s
3:	learn: 0.6376379	total: 62.7ms	remaining: 15.6s
4:	learn: 0.6268372	total: 66.4ms	remaining: 13.2s
5:	learn: 0.6147106	total: 70.4ms	remaining: 11.7s
6:	learn: 0.6046366	total: 74.1ms	remaining: 10.5s
7:	learn: 0.5948885	total: 77.9ms	remaining: 9.66s
8:	learn: 0.5860148	total: 81.7ms	remaining: 8.99s
9:	learn: 0.5781420	total: 85.4ms	remaining: 8.45s
10:	learn: 0.5694255	total: 89.2ms	remaining: 8.02s
11:	learn: 0.5613750	total: 93ms	remaining: 7.66s
12:	learn: 0.5539463	total: 96.6ms	remaining: 7.33s
13:	learn: 0.5469171	total: 100ms	remaining: 7.06s
14:	learn: 0.5401197	total: 104ms	remaining: 6.83s
15:	learn: 0.5339919	total: 108ms	remaining: 6.63s
16:	learn: 0.5283369	total: 112ms	remaining: 6.45s
17:	learn: 0.5229701	total: 115ms	remaining: 6.29s
18:	learn: 0.5166368	total: 119ms	remaining: 6.14s
19:	

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [None]:
# Fit Xgboost model
xgb_pipe.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [None]:
def train_and_log_pipeline(pipeline, model_name, preprocess_name="1", model_step="2"):
    with mlflow.start_run(run_name=model_name):
        # Train the pipeline
        pipeline.fit(X_train, y_train)

        # Predictions on raw (untransformed) data
        y_pred = pipeline.predict(X_test)
        y_prob = pipeline.predict_proba(X_test)[:, 1]

        # Compute evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_prob)

        # Compute specificity
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        specificity = tn / (tn + fp) if (tn + fp) > 0 else None

        # --- Use Option 2: Infer model signature from raw input ---
        signature = infer_signature(X_test, y_pred)
        mlflow.sklearn.log_model(pipeline, model_name, signature=signature)

        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)
        mlflow.log_metric("roc_auc", roc_auc)
        if specificity is not None:
            mlflow.log_metric("specificity", specificity)

        # --- SHAP Explainability ---
        try:
            # Extract the model from the pipeline
            model = pipeline.named_steps[model_step]

            # Transform data for SHAP using the transformer step
            X_train_transformed = pipeline.named_steps[preprocess_name].transform(X_train)
            X_test_transformed = pipeline.named_steps[preprocess_name].transform(X_test)
            try:
                feature_names = pipeline.named_steps[preprocess_name].get_feature_names_out()
            except AttributeError:
                feature_names = [f"feature_{i}" for i in range(X_test_transformed.shape[1])]
            X_test_transformed = pd.DataFrame(X_test_transformed, columns=feature_names)

            # Compute SHAP values
            explainer = shap.Explainer(model, X_train_transformed)
            shap_values = explainer(X_test_transformed)
            shap_values_array = shap_values.values

            # SHAP Summary Plot
            shap.summary_plot(shap_values_array, X_test_transformed, show=False)
            summary_fig = plt.gcf()  # Get the current figure created by shap
            summary_fig.set_size_inches(10, 6)
            summary_plot_path = f"{model_name}_shap_summary.png"
            summary_fig.savefig(summary_plot_path, bbox_inches="tight")
            mlflow.log_artifact(summary_plot_path)
            plt.close(summary_fig)

            # SHAP Dependence Plots for Top 3 Features
            top_features = X_test_transformed.columns[:3]
            for feature in top_features:
                shap.dependence_plot(feature, shap_values_array, X_test_transformed, show=False)
                dep_fig = plt.gcf()
                dep_fig.set_size_inches(10, 6)
                dep_plot_path = f"{model_name}_shap_dependence_{feature}.png"
                dep_fig.savefig(dep_plot_path, bbox_inches="tight")
                mlflow.log_artifact(dep_plot_path)
                plt.close(dep_fig)

            print(f"✅ SHAP explanations logged for {model_name}")

        except Exception as e:
            print(f"⚠️ SHAP logging failed for {model_name}: {e}")

        print(f"✅ Model {model_name} logged successfully in MLflow!")

## --- Set up MLflow Experiment ---

## --- Run the Pipelines ---

In [None]:
train_and_log_pipeline(lr_pipe, "LogisticRegression", preprocess_name="1", model_step="2")



✅ SHAP explanations logged for LogisticRegression
✅ Model LogisticRegression logged successfully in MLflow!
🏃 View run LogisticRegression at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0/runs/d82f69923a3640f498dd8a5aff4865d4
🧪 View experiment at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0


In [None]:
train_and_log_pipeline(cat_pipe, "CatBoost", preprocess_name="1", model_step="2")

Learning rate set to 0.026238
0:	learn: 0.6762805	total: 15ms	remaining: 15s
1:	learn: 0.6623483	total: 26ms	remaining: 13s
2:	learn: 0.6512269	total: 37.1ms	remaining: 12.3s
3:	learn: 0.6376379	total: 52.2ms	remaining: 13s
4:	learn: 0.6268372	total: 59.1ms	remaining: 11.8s
5:	learn: 0.6147106	total: 66ms	remaining: 10.9s
6:	learn: 0.6046366	total: 75.9ms	remaining: 10.8s
7:	learn: 0.5948885	total: 86.4ms	remaining: 10.7s
8:	learn: 0.5860148	total: 97.7ms	remaining: 10.8s
9:	learn: 0.5781420	total: 109ms	remaining: 10.8s
10:	learn: 0.5694255	total: 120ms	remaining: 10.8s
11:	learn: 0.5613750	total: 130ms	remaining: 10.7s
12:	learn: 0.5539463	total: 140ms	remaining: 10.7s
13:	learn: 0.5469171	total: 150ms	remaining: 10.6s
14:	learn: 0.5401197	total: 163ms	remaining: 10.7s
15:	learn: 0.5339919	total: 173ms	remaining: 10.7s
16:	learn: 0.5283369	total: 184ms	remaining: 10.7s
17:	learn: 0.5229701	total: 200ms	remaining: 10.9s
18:	learn: 0.5166368	total: 211ms	remaining: 10.9s
19:	learn: 0.5



✅ SHAP explanations logged for CatBoost
✅ Model CatBoost logged successfully in MLflow!
🏃 View run CatBoost at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0/runs/ef64735c09614c069d2c9241461c0ac6
🧪 View experiment at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0


In [None]:
train_and_log_pipeline(xgb_pipe, "XGBoost", preprocess_name="1", model_step="2")



✅ SHAP explanations logged for XGBoost
✅ Model XGBoost logged successfully in MLflow!
🏃 View run XGBoost at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0/runs/a2b19531f3424f48b915407a6988f04a
🧪 View experiment at: https://dagshub.com/boiBASH/Elite-Bank-Project.mlflow/#/experiments/0


In [24]:
mlflow.autolog()

2025/03/15 02:22:40 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2025/03/15 02:22:40 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.
2025/03/15 02:22:41 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.
