# Stacking Classifier
* In this last experiment we are going to stack the estimators we trained and use Logistic Regression as the final estimator to see if we get better results. 
* The models are selected based on their performances and diverse learning approach. 
* There are 2 approaches that we can take, 
    * First approach is to select the models to make the stack as diverse as possible and with that in mind we have selected following 6 estimators,
        * Logistic Regression v2
        * Linear SVC v2
        * Random Forest v2
        * SVC v2
        * KNN v2
        * KNN v3

    * If first approach doesn't work we can try the kitchen sink method and create a stack with all the estimators. Although it might be computationally expensive and has higher probability that will perform worse than first approach due to lack of diversity. 

## Install Libraries

In [1]:
# %pip install scikit-learn

## Import Libraries

In [2]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score,recall_score,precision_score,precision_recall_curve
import seaborn as sns
import joblib


# Build an absolute path from this notebook's parent directory
module_path = os.path.abspath(os.path.join('..'))

# Add to sys.path if not already present
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.utils import preprocessing
from src.utils import common
from src.utils.training import refit_strategy

## Initialize Directories

In [3]:
data_root_dir = Path("..", "data/")
models_root_dir = Path("..", "models/")

## Read Data

In [4]:
X_train = pd.read_csv(Path(data_root_dir,"X_train.csv"))
y_train = pd.read_csv(Path(data_root_dir,"y_train.csv"))

In [5]:
# create list of models from
estimator_mapping = [
    {"name": "logistic_regression_v2", "file_name": "logistic_regression_v2.joblib", "estimator": None},
    {"name": "linear_svc_v2", "file_name": "linear_svc_v2.joblib", "estimator": None},
    {"name": "random_forest_v2", "file_name": "random_forest_v2.joblib", "estimator": None},
    {"name": "svc_v2", "file_name": "svc_v2.joblib", "estimator": None},
    {"name": "knn_v2", "file_name": "knn_v2.joblib", "estimator": None},
    {"name": "knn_v3", "file_name": "knn_v3.joblib", "estimator": None}
]

In [6]:
## load the job lib files into the mapping
def map_estimator(mapping):
    model_path = Path(models_root_dir, mapping["file_name"])
    estimator = joblib.load(model_path)
    # mapping["estimator"] = estimator
    return (mapping["name"], estimator)

estimators = list(map(map_estimator, estimator_mapping))
estimators

[('logistic_regression_v2',
  Pipeline(steps=[('preprocessing',
                   ColumnTransformer(transformers=[('preprocess_gender',
                                                    Pipeline(steps=[('default_cat_pipeline',
                                                                     Pipeline(steps=[('fill_empty_strings',
                                                                                      FunctionTransformer(feature_names_out='one-to-one',
                                                                                                          func=<function fill_empty_strings_fn at 0x7faa0e5000e0>)),
                                                                                     ('strip_spaces',
                                                                                      FunctionTransformer(feature_names_out='one-to-one',
                                                                                                          func=<functio

In [11]:
# from sklearn.ensemble import StackingClassifier
# from sklearn.linear_model import LogisticRegression


# final_estimator = LogisticRegression(random_state=42, max_iter=10000)

# stacking_clf = StackingClassifier(
#     estimators=estimators,
#     final_estimator=final_estimator,
#     passthrough=False,
#     stack_method="predict",
#     n_jobs=-1,
#     verbose=10
# )

# stacking_clf.fit(X=X_train,y=y_train.values.ravel())


Observations:
* So the `StackingClassifier` ran for more than 25 hours with no convergence. On research we found that stacking would take train 37 model fits and out of those 12 computationally expensive are not multi threaded. Apart from that for all 37 fits, we execute the preprocessing pipeline 37 different times. 
* In order to continue the progress on the project and free up resources (my PC), we are going to pivot to `VotingClassifier` and see if we get a performance improvement here. 

# Hard Voting Classifier

In [12]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_validate

voting_clf = VotingClassifier(
    estimators=estimators,
    voting="hard"
)

scoring = ["recall", "precision", "f1"]

# Get a reliable performance estimate using cross-validation
voting_classifier_scores = cross_validate(
    estimator=voting_clf,
    X=X_train,
    y=y_train.values.ravel(),
    cv=3, # or 5
    scoring=scoring,
    n_jobs=-1
)

voting_clf.fit(X_train, y_train.values.ravel())

In [13]:
voting_classifier_scores

{'fit_time': array([70.02628922, 68.13778281, 69.01933575]),
 'score_time': array([10.96803546, 11.85654569, 11.56871486]),
 'test_recall': array([0.89302112, 0.88292011, 0.88797062]),
 'test_precision': array([0.84860384, 0.84379114, 0.85955556]),
 'test_f1': array([0.87024609, 0.86291227, 0.87353207])}

In [14]:
mean_recall,mean_precision,mean_f1 = common.calculate_mean_from_cv(voting_classifier_scores)
mean_recall,mean_precision,mean_f1

Mean Recall: 0.8879706152433425, Mean Precision: 0.8879706152433425,Mean F1: 0.8688968088314876


(np.float64(0.8879706152433425),
 np.float64(0.8879706152433425),
 np.float64(0.8688968088314876))

Observation:
* Based on cross validation numbers this model out performs all the other models with a high recall and precision of `~0.89`

In [15]:
# common.update_models_metrics("Hard Voting Classifier", "v0", mean_recall,mean_precision,mean_f1)
# _, file_name = common.save_model(
#     "Hard Voting Classifier", "v0", voting_clf)

# Soft Voting Classifier

* For soft voting classifier we'll need to create new instances of SVC and Linear SVC as our saved estimators do not support `predict_proba` methods.
* Also since `Linear SVC` do not support `probabilites` we are doing to use `SVC` with `kernel=linear`
* For both classifier we'll keep rest of the params same as their respective versions. 
* As an experiment we'll also create new instances of all the classifiers to see how it affects the performance. 

In [16]:
# creating all estimator pipelines
from sklearn.discriminant_analysis import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# logistic regression v2
logistic_regression_v2 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", LogisticRegression(max_iter=1000, random_state=42))
]).set_params(
    **{
        "prediction__C": 1,
        "prediction__penalty": "l2",
        "prediction__solver": "saga",
        "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding": "ordinal",
        "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding": "ordinal",
        "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding": "ordinal",
        "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding": "ordinal",
        "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding": "onehot",
        "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding": "ordinal"
    })

# linear svc v2
linear_svc_v2 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", SVC(random_state=42, kernel="linear", probability=True))
]).set_params(**{
    "prediction__C": 0.1,
    "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding": "ordinal",
    "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding": "ordinal",
    "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding": "ordinal",
    "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding": "ordinal",
    "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding": "onehot",
    "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding": "onehot"
})

# random forest
random_forest_v2 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", RandomForestClassifier(random_state=42))
]).set_params(**{
    "prediction__criterion": "gini",
    "prediction__max_features": "sqrt",
    "prediction__n_estimators": 400,
    "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding": "onehot",
    "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding": "ordinal",
    "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding": "ordinal",
    "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding": "ordinal",
    "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding": "ordinal",
    "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding": "onehot"
})

# svc v2
svc_v2 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", SVC(random_state=42, probability=True))
]).set_params(**{
    "prediction__C": 0.1,
    "prediction__degree": 4,
    "prediction__kernel": "poly",
    "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding": "ordinal",
    "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding": "ordinal",
    "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding": "ordinal",
    "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding": "ordinal",
    "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding": "ordinal",
    "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding": "onehot"
})

# knn v2
knn_v2 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", MinMaxScaler()),
    ("prediction", KNeighborsClassifier())
]).set_params(**{
    "prediction__algorithm": "brute",
    "prediction__n_neighbors": 20,
    "prediction__weights": "uniform"
})

# knn v3
knn_v3 = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", KNeighborsClassifier())
]).set_params(**{
    "prediction__algorithm": "ball_tree",
    "prediction__n_neighbors": 20,
    "prediction__weights": "distance",
    "preprocessing__age_pipeline__age_encoding__age_range_encoding__encoding": "onehot",
    "preprocessing__cgpa_pipeline__cgpa_encoding__cgpa_range_encoding__encoding": "onehot",
    "preprocessing__degree_pipeline__degree_encoding__degree_level_encoding__encoding": "ordinal",
    "preprocessing__dietary_habits_pipeline__dietary_habits_encoding__encoding": "ordinal",
    "preprocessing__hours_pipeline__hours_encoding__hours_range_encoding__encoding": "ordinal",
    "preprocessing__sleep_duration_pipeline__sleep_duration_encoding__encoding": "ordinal"
})

In [17]:
estimators = [
    ("logistic_regression_v2",logistic_regression_v2),
    ("linear_svc_v2",linear_svc_v2),
    ("random_forest_v2",random_forest_v2),
    ("svc_v2",svc_v2),
    ("knn_v2",knn_v2),
    ("knn_v3",knn_v3)
]

In [18]:
voting_clf = VotingClassifier(
    estimators=estimators,
    voting="soft"
)

scoring = ["recall", "precision", "f1"]

# Get a reliable performance estimate using cross-validation
voting_classifier_scores = cross_validate(
    estimator=voting_clf,
    X=X_train,
    y=y_train.values.ravel(),
    cv=3, # or 5
    scoring=scoring,
    n_jobs=-1
)

voting_clf.fit(X_train, y_train.values.ravel())

In [19]:
voting_classifier_scores

{'fit_time': array([71.07017612, 66.00241423, 67.85215664]),
 'score_time': array([11.2282865 , 11.43015838, 11.29761267]),
 'test_recall': array([0.90128558, 0.88980716, 0.89761249]),
 'test_precision': array([0.83763601, 0.83283197, 0.84650357]),
 'test_f1': array([0.86829592, 0.86037736, 0.87130919])}

In [20]:
mean_recall,mean_precision,mean_f1 = common.calculate_mean_from_cv(voting_classifier_scores)
mean_recall,mean_precision,mean_f1

Mean Recall: 0.8962350780532599, Mean Precision: 0.8962350780532599,Mean F1: 0.866660823395622


(np.float64(0.8962350780532599),
 np.float64(0.8962350780532599),
 np.float64(0.866660823395622))

Observation:
* So this model is slightly better than the previous one, but not a significant change. This is still a lot more balanced than all the other models with recall and precision of `~.90`

In [21]:
common.update_models_metrics("Soft Voting Classifier", "v0", mean_recall,mean_precision,mean_f1)
_, file_name = common.save_model(
    "Soft Voting Classifier", "v0", voting_clf)