# CS82B – Principles of Data Science  
## Module 9 – Lab 1: Hyperparameter Tuning Using Grid Search  
**Student Name:** Derek McCrary  
**Course:** CS82B – Principles of Data Science  
**Due Date:** June 1, 2025

## Module 9 Lab 1 :Hyperparameter Tuning Using Grid Search

**Objective:**
The objective of this assignment is to practice using the `GridSearchCV` class in scikit-learn for hyperparameter tuning. You will implement a grid search to find the optimal hyperparameters for a K-nearest neighbors (KNN) regression model using the California Housing Prices dataset.

**Instructions:**

**Hyperparameter Tuning and Grid Search:** Hyperparameter tuning is the process of finding the optimal values for the hyperparameters of a machine learning model to improve its performance. Hyperparameters are settings that are not learned from the data but are set before training the model. They control the behavior of the model and can significantly impact its performance. In contrast, parameters are learned from the data during the model training process. They are the coefficients or weights that the model adjusts to make predictions. Hyperparameter tuning helps find the best configuration for a model to achieve optimal performance on a given task.

**Implementation of GridSearchCV for KNN Regression:** In this assignment, you will complete the incomplete code for a grid search using the `GridSearchCV` class in scikit-learn to find the optimal hyperparameters for a KNN regression model. The KNN regression model predicts continuous target variables based on the values of the nearest neighbors in the feature space.

**Analyzing the Results:** Analyze the results obtained from the grid search. Discuss the impact of the best hyperparameters found on the model's performance. Compare the results with the default hyperparameter settings of the KNN regression model. Interpret the best score obtained and assess the performance improvement achieved through hyperparameter tuning.

**Compare and Analyze the Results:** Once you have performed RandomizedSearchCV, compare and analyze the results with those obtained from GridSearchCV. Evaluate the impact of the alternative search method on the model's performance and the time taken for hyperparameter tuning. 

**Conclusion and Discussion:** Conclude your assignment by summarizing the key findings from both the GridSearchCV and RandomizedSearchCV. 

In [1]:
!pip3 install --upgrade pip
!pip3 install scikit-optimize



In [2]:
# Step 1: Import required libraries
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

**Loading the California Housing Dataset:** Start by loading the California Housing Prices dataset. You can use the `fetch_california_housing()` function from scikit-learn's datasets module to load the dataset. The dataset contains information about housing prices in different regions of California. It includes various features such as the average number of rooms, population, and median income. Separate the features and target variables appropriately.

In [3]:
# Step 2: Load the California Housing dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target

# Step 3: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Setting up the GridSearchCV Object:** Initialize the KNN regressor model (`KNeighborsRegressor()`) and define a dictionary of hyperparameters. The hyperparameters dictionary should include the values you want to explore during the grid search. Then, set up the `GridSearchCV` object using the defined KNN regressor model, the hyperparameters dictionary, the scoring metric (use `'neg_mean_absolute_error'` for mean absolute error), and the number of CPU cores to utilize (`n_jobs=-1` to use all available CPU cores).

In [4]:
# Step 5: Define the KNN model and parameter grid
knn = KNeighborsRegressor()
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan and Euclidean distances
}

# Step 6: Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

**Fitting the GridSearchCV Object:** Fit the `GridSearchCV` object to the features and target variables using the `fit()` method. This will perform cross-validation with the specified hyperparameter combinations and evaluate the models using the negative mean absolute error (MAE) as the scoring metric.


In [5]:
# Step 7: Evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)

**Printing the Best Results:** Once the grid search is complete, print the best estimator, best parameters, and best score obtained from the grid search. The best estimator represents the KNN model with the optimal hyperparameters found during the grid search. The best parameters indicate the specific values of the hyperparameters that yielded the best performance. The best score indicates the mean absolute error achieved by the best model.

In [6]:
# Step 8: Display results
print("Best Parameters:", grid_search.best_params_)
print(f"R² Score: {r2:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")

Best Parameters: {'n_neighbors': 9, 'p': 1, 'weights': 'distance'}
R² Score: 0.7233
Root Mean Squared Error: 0.6027
Mean Absolute Error: 0.4071


**Exploring Alternative Hyperparameter Search Methods:** In addition to GridSearchCV, there are other hyperparameter search methods available in scikit-learn that you can explore. RandomizedSearchCV is a popular alternative. RandomizedSearchCV performs a randomized search over the specified hyperparameter distributions.

**Try RandomizedSearchCV :** As an extension to this assignment, try implementing RandomizedSearchCV with the same KNN regression model and California Housing dataset. Follow similar steps as in the GridSearchCV implementation, but substitute the respective search class and modify the necessary parameters. Experiment with different hyperparameter distributions or search strategies.

In [7]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Step 1: Define the KNN model
knn = KNeighborsRegressor()

# Step 2: Define the parameter distributions
param_distributions = {
    'n_neighbors': randint(3, 15),  # random integers from 3 to 14
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

# Step 3: Perform Randomized Search with Cross-Validation
random_search = RandomizedSearchCV(
    knn,
    param_distributions=param_distributions,
    n_iter=10,  # number of random combinations to try
    cv=5,
    scoring='r2',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train_scaled, y_train)

# Step 4: Evaluate the best model
best_random_model = random_search.best_estimator_
y_pred_random = best_random_model.predict(X_test_scaled)

# Step 5: Show evaluation metrics
r2_random = r2_score(y_test, y_pred_random)
rmse_random = np.sqrt(mean_squared_error(y_test, y_pred_random))
mae_random = mean_absolute_error(y_test, y_pred_random)

print("Best Parameters (RandomizedSearchCV):", random_search.best_params_)
print(f"R² Score: {r2_random:.4f}")
print(f"Root Mean Squared Error: {rmse_random:.4f}")
print(f"Mean Absolute Error: {mae_random:.4f}")

Best Parameters (RandomizedSearchCV): {'n_neighbors': 7, 'p': 1, 'weights': 'distance'}
R² Score: 0.7205
Root Mean Squared Error: 0.6057
Mean Absolute Error: 0.4080
