<a href="https://colab.research.google.com/github/devparikh0506/DATA-602/blob/main/week_7/Homework_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preamble
This problem set is an extension of Problem Set 6.  You will need the MNIST 784 dataset from OpenML, with dimensionality reduced to about 75\% of original variance.

As with last week, the first 60,000 observations are available to use as training data, and the remaining 10,000 images as test data.  In training the models, you do not need to use all 60,000 observations.  (It is suggested to partition the training data into a training dataset and holdout dataset rather than use cross-validation.  Training on as few as 5000 observations is sufficient to reduce training time.)

For purposes of this problem set, recode the target variable for both the test and training sets to classify whether a digit is less than 5 (i.e., $y \in \left\{0, 1, 2, 3, 4\right\}$).  That is, the target variable should take the value 0 where the corresponding observation depicts a 0, 1, 2, 3, or 4; and the value 1 where the corresponding observation depicts a 5, 6, 7, 8, or 9.


In [61]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [62]:
#fetch OpenML data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

#split into test/training sets
train_bound = slice(0, 60_000)
test_bound = slice(60_000, None)
partition = lambda X, y, slc: (X[slc], y[slc])
X_train_full, y_train_full = partition(X, y, train_bound)
X_test, y_test = partition(X, y, test_bound)

In [63]:
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X_train_full, y_train_full, train_size=5000, random_state=0
)

In [64]:
# Reference taken from week-6 homework
import seaborn as sns
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Since we know the min and max bounds for training data its good to use MinMaxScaler
mm_scaler = MinMaxScaler()

# For PCA using n_components equals to 0.75 to take 75% of original variance
pca = PCA(n_components=0.75)

# Using pipeline to apply scaler and PCA
pipe = Pipeline(steps=[('pca', pca)])

In [65]:
X_train_pca = pipe.fit_transform(X_train)
print(f"shape of training dataset before PCA: {X_train.shape}")
print(f"shape of training dataset after PCA: {X_train_pca.shape}")

shape of training dataset before PCA: (5000, 784)
shape of training dataset after PCA: (5000, 33)


In [66]:
X_test_pca = pipe.transform(X_test)

Let's convert Y_train and Y_test to 0 or 1

In [67]:
y_train_final = (y_train.astype(int) >= 5).astype(int)
y_test_final = (y_test.astype(int) >= 5).astype(int)

# Problem 1 -- Classifiers

Train 3 classifiers on the dataset, each using a different algorithm.  Each classifier must have an $F_1$ score of at least 0.9.  At least one classifier must use gradient boosting (AdaBoost, Gradient Boost, or xgboost).  Show the $F_1$ score and classification report for each model.

In [68]:
scores={}

In [69]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import f1_score, classification_report

In [70]:
def get_optimized_classifier(classifier, param_grid, X_train, y_train):
    """
    Perform GridSearchCV for a given classifier and return the optimized model along with its performance metrics.

    Args:
    classifier: The classifier object to be optimized
    param_grid (dict): Parameter grid for GridSearchCV
    X_train, y_train: Training data

    Returns:
    dict: A dictionary containing the best model, its parameters, F1 score, and classification report
    """

    grid_search = GridSearchCV(classifier, param_grid, cv=2, scoring='f1', n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_
    # Not using test data at the moment, we'll just evaluate performance based on training data
    y_pred = best_model.predict(X_train)
    f1 = f1_score(y_train, y_pred)

    return {
        'model': best_model,
        'best_params': grid_search.best_params_,
        'f1_score': f1,
        'classification_report': classification_report(y_train, y_pred)
    }

classifiers = [
    (GradientBoostingClassifier(), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 0.2],
        'max_depth': [3, 5, 7]
    }),
    (RandomForestClassifier(), {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, None],
        'min_samples_split': [2, 5, 10]
    }),
    (AdaBoostClassifier(), {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 1.0]
    })
]


results = []
for classifier, param_grid in classifiers:
    result = get_optimized_classifier(classifier, param_grid, X_train_pca, y_train_final)
    results.append(result)


for clf_name, result in zip(['Gradient Boosting', 'Random Forest', 'ADABoost'], results):
    print(f"\n{clf_name} Results:")
    print(f"Best parameters: {result['best_params']}")
    print(f"F1 Score: {result['f1_score']}")
    print(result['classification_report'])

best_models = [result['model'] for result in results]
best_grd, best_rf, best_ada = best_models

Fitting 2 folds for each of 27 candidates, totalling 54 fits
Fitting 2 folds for each of 27 candidates, totalling 54 fits
Fitting 2 folds for each of 9 candidates, totalling 18 fits





Gradient Boosting Results:
Best parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200}
F1 Score: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2564
           1       1.00      1.00      1.00      2436

    accuracy                           1.00      5000
   macro avg       1.00      1.00      1.00      5000
weighted avg       1.00      1.00      1.00      5000


Random Forest Results:
Best parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
F1 Score: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2564
           1       1.00      1.00      1.00      2436

    accuracy                           1.00      5000
   macro avg       1.00      1.00      1.00      5000
weighted avg       1.00      1.00      1.00      5000


ADABoost Results:
Best parameters: {'learning_rate': 1.0, 'n_estimators': 200}
F1 Score: 0.891781936533767

# Problem 2 -- Voting ensemble model

(20 pts) Build a voting ensemble model that combines the three classifiers from the previous problem, in addition to the SVM model developed last week.  What is the $F_1$ score of the ensemble model?

Let's create classifier from hyperparameters we found last week

In [71]:
# Using hyperparameters from last week's assignment to get best SVM model
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
best_svc = SVC(kernel='rbf', C=10)
svc_pipeline = make_pipeline(mm_scaler, best_svc)

In [72]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
clfs = [('ada', best_ada),
        ('grd', best_grd),
        ('rf', best_rf),
        ('svc', svc_pipeline)]

clf_vot = VotingClassifier(clfs)

# fit training data
clf_vot.fit(X_train_pca, y_train_final)

# Evaluating f1 score based on training data
y_pred = clf_vot.predict(X_train_pca)
f1 = f1_score(y_train_final, y_pred)
print(f"F1 score of voting ensemble model: {f1}")



F1 score of voting ensemble model: 1.0


## Problem 3 -- Stacking ensemble model
Stacking uses a final classifier (often a logistic regression) that outputs an aggregate of the predictors. Repeat the previous problem using a StackingClassifier rather than voting to compute the final prediction.  What is the $F_1$ score of the stacking classifier?

In [73]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

In [74]:
stacking_clf = StackingClassifier(
    estimators=clfs,
    final_estimator=LogisticRegression(),
    verbose=1
)
stacking_clf.fit(X_train_pca, y_train_final)

# Make predictions on the test set
y_pred_stacking = stacking_clf.predict(X_train_pca)

# Calculate the F1 score
f1_stacking = f1_score(y_train_final, y_pred_stacking)

print(f"F1 score of stacking ensemble model: {f1_stacking}")



F1 score of stacking ensemble model: 1.0
