<a href="https://colab.research.google.com/github/devparikh0506/DATA-602/blob/main/week_10/Homework_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem 1 - Hyperparameter optimization.

The template provides a dataset that derives from the merged weather-citation dataframe that you generated
    	in Week 2.  Each observation in the data frame represents the number of red light citations issued each hour since 2020, together with
    	a bad weather indicator for that hour (refer to the Week 2 solution).  The dataframe includes additional
    	features such as circular components of the day of week, time of day, and day of year and indicator variables indicating whether the day is a weekend or U.S. federal holiday.
    	The template provides code to split into training and test instances, and to scale both.

Using a regression algorithm of your choice to predict the number of citations from the other features:

1. Using a hyperparameter optimization method of your choice, find optimal hyperparameter values that maximize the regressor's $R^2$ value.
2. Print the selected hyperparameter values.
3. Fit the estimator, using the optimal hyperparameters that you identified in (a), to the full set of training data
4. Evaluate the $R^2$ metric against the provided test data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


The below code loads and prepares the test and training data sets.

In [3]:
df = pd.read_parquet("/content/drive/Shareddrives/DS602-F22/Data/weather_week10.parquet")
y = df.pop("count")
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
ct = ColumnTransformer(
    [
        ("ts", MinMaxScaler(), ["timestamp"]),
        ("drop", "drop", ["wobsts"])
    ], remainder="passthrough"
)
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)
y_train = y_train.values
y_test = y_test.values

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestRegressor() #Here I chose RandomForestRegressor because it effectively handles complex, non-linear relationships and reduces overfitting through ensemble learning.

In [5]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(rf_model, param_grid, scoring='r2', cv=2, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

Fitting 2 folds for each of 27 candidates, totalling 54 fits


In [6]:
print(f"Best hyperparameters found using GridSearchCV: {grid_search.best_params_}")
print(f"Best R^2 score found using GridSearchCV: {grid_search.best_score_}")

Best hyperparameters found using GridSearchCV: {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 200}
Best R^2 score found using GridSearchCV: 0.8188561602634765


Fitting model with best params to training data

In [7]:
rf_model = RandomForestRegressor(**grid_search.best_params_)
rf_model.fit(X_train, y_train)

Testing model against test data

In [8]:
from sklearn.metrics import r2_score
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R^2 score on test data: {r2}")

R^2 score on test data: 0.831984216290856


# Problem 2 - Evaluation
This problem is an extension of the problem sets 6 and 7.  Refer to the Model Evaluation and Selection section from Week 8 for guidance on model evaluation.

After completing problem set 7, you should have 5 models: the 3 classifiers from week 7 problem 1, the voting model from week 7, problem 2, and the stacking model from week 7, problem 3.  If you wish, you may also include the SVM model you developed for week 6, problem 2.  Then:

The below code performs the required preprocessing from Week 7:

In [9]:
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier

#fetch OpenML data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False,
                    parser="auto")

#split into test/training sets
N=60000
X_train, y_train = X[:N, :], y[:N]
X_test, y_test = X[N:, :], y[N:]

#scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# PCA
pca = PCA(n_components = 0.75)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

# recoded target variable
recode_fn = lambda y: np.choose(np.isin(y, list("01234")), [-1,1])
y_test, y_train = (recode_fn(y) for y in [y_test, y_train])



You may use the following models if you wish, or use the classifiers you developed in the homework assignments.

In [11]:
non_ensemble_classifiers = {'svm': SVC(coef0=0.3864834922270453, degree=4, kernel='poly'),
               'xgboost': XGBClassifier(learning_rate=0.19299343762455148, max_depth=9, n_estimators=409),
               'rfc': RandomForestClassifier(max_depth=18, max_features=0.26213576122784343,
                        min_samples_leaf=6),
               'logreg': Pipeline(
                   steps=[
                          ('polynomialfeatures', PolynomialFeatures()),
                          ('logisticregression',
                            LogisticRegression(C=0.00013211684329604987, max_iter=2000))
                          ]),
                }
ensemble_classifiers = {
    'voting': VotingClassifier([(k, v) for k, v in non_ensemble_classifiers.items()]),
    'stacking': StackingClassifier([(k, v) for k, v in non_ensemble_classifiers.items()])
}






Then:

a. Select a model from the candidate models.  You may, but are not required to, select the best-performing model.  Justify your rationale for selecting that model.  You may refer back to your solution and the posted solution from prior assignments to inform your selection.  Document any assumptions.



I am selecting the VotingClassifier for its balanced performance and simplicity, as it combines multiple models to enhance generalization and reduce variance while remaining easy to interpret compared to more complex ensemble methods.

In [18]:
model = ensemble_classifiers['voting']
model

b. Train the selected model on the full set of training data.  (As with earlier assignment, you may train on a subset of data if training takes too long.)


In [19]:
model.fit(X_train, y_train)

c. Preprocess the test data (last 10,000 observations of MNIST) identically to the training data.


As I am using provided preprocessing code, I am able to see that X_test, y_test is already been through preprocessing steps same as X_train, y_train so I am using that directly  

In [20]:
X_test[:1, :]

array([[-5.03127351,  3.64112918, -7.32268689,  0.54986598, -2.77308791,
         2.10817192,  3.76234458,  6.8195386 , -4.07925865,  2.28805112,
        -0.77496835,  4.03829138, -1.68804222, -1.24066622, -1.05615137,
         0.88068608, -1.61994675, -2.97384063,  1.71794568, -2.84879893,
         0.94924206, -0.03903973,  1.97077762,  0.32725795,  4.16806047,
        -0.60561678, -1.44824394,  1.16477363,  0.36972318, -0.47856909,
         2.65222156, -1.58789162,  0.18472617,  0.64628324,  0.37560366,
         0.46404012, -1.21253202,  2.14328094,  2.7007946 , -1.69455615,
        -2.1579641 , -0.20540823,  1.44556758, -0.59223814,  0.85916959,
        -0.9053983 , -0.19027172,  0.16669902,  0.20493263, -0.45902117,
         1.63698415, -1.21857818, -1.65696904, -1.46428435, -0.69669855,
        -0.76318435, -0.52334722, -0.42755947, -0.04484432, -0.82979789,
        -0.46332114, -0.67451862, -1.49102635,  0.06895393,  0.60481114,
         0.28821297, -0.52112234, -0.45263761, -0.6

In [21]:
 y_test[1]

1

d. Use the fully-trained selected model from (b) to evaluate performance of the preprocessed test data from (c).  Confirm that
the accuracy of the test model is comparable to the expected performance of the model prior to selection.



In [23]:
from sklearn.metrics import f1_score
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(f"F1 score of model on test data: {f1}")

F1 score of model on test data: 0.9767032106499608


The f1 accuracy score is high which is acceptable for the ensamble model