# Hyperparameter tuning of Machine Learning Models

* Hyperparameter tuning is the process of `finding the best hyperparameters` for a machine learning model.
* `Hyperparameters` are the parameters that are not learned by the model. 
* They are set before training the model. 
* The process of hyperparameter tuning is also known as `hyperparameter optimization.`

## Why Hyperparameter Tuning is Important?

* Hyperparameter tuning is important because the performance of the model is highly dependent on the hyperparameters.
* The right choice of hyperparameters can make a huge difference in the performance of the model.
* Hyperparameter tuning helps to find the best hyperparameters for the model which results in the best performance.
* It helps to improve the performance of the model.
* It helps to avoid overfitting and underfitting.
* It helps to make the model more robust.

## Techniques for Hyperparameter Tuning

There are several techniques for hyperparameter tuning. Some of the most popular techniques in Scikit-learn are:
* Grid Search
* Random Search
* Successive Halving
  * Halving Grid Search
  * Halving Random Search 

In [37]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV , HalvingRandomSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Hyperparameter Tuning with scikit-learn on the Tips Dataset
# This notebook demonstrates how to perform hyperparameter tuning using scikit-learn's GridSearchCV on the Tips dataset.


# Load and Explore the Data
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [38]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [39]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cat_columns = tips.select_dtypes(include=['category']).columns
for col in cat_columns:
    tips[col] = le.fit_transform(tips[col])    


In [40]:
X = tips.drop('tip', axis=1)
y = tips['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(random_state=42)

params = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30], # Maximum depth of each tree
    'min_samples_split': [2, 5, 10]  # Minimum samples required to split a node
}

grid_search = GridSearchCV(
    estimator=model,  # Fixed the typo here (estimater -> estimator)
    param_grid=params,
    cv=5,
    n_jobs=-1,  # Also fixed this commented line (removed # and ==)
    scoring='neg_mean_squared_error'
)

grid_search.fit(X_train, y_train)

print("Grid Search Best Params : ", grid_search.best_params_)
print("Grid Search Best Score : ", -grid_search.best_score_)

rf_best = grid_search.best_estimator_
y_pred = rf_best.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Test MSE with Best Params: ", mse)

Grid Search Best Params :  {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 100}
Grid Search Best Score :  1.1604106955754185
Test MSE with Best Params:  1.0437141793821858


In [41]:
X = tips.drop('tip', axis=1)
y = tips['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(random_state=42)

params = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30], # Maximum depth of each tree
    'min_samples_split': [2, 5, 10]  # Minimum samples required to split a node
}

randomized_search = RandomizedSearchCV(
    estimator=model,  # Fixed the typo here (estimater -> estimator)
    param_distributions=params,
    cv=5,
    n_jobs=-1,  # Also fixed this commented line (removed # and ==)
    scoring='neg_mean_squared_error'
)

randomized_search.fit(X_train, y_train)

print("randomized Search Best Params : ", randomized_search.best_params_)
print("randomized Search Best Score : ", -randomized_search.best_score_)

rf_best = randomized_search.best_estimator_
y_pred = rf_best.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Test MSE with Best Params: ", mse)

randomized Search Best Params :  {'n_estimators': 100, 'min_samples_split': 10, 'max_depth': 10}
randomized Search Best Score :  1.160473954910404
Test MSE with Best Params:  0.9849131613311116


In [None]:
# HalvingGridSearchCV is a scikit-learn algorithm that speeds up hyperparameter search by iteratively eliminating poor-performing parameter combinations (using a "successive halving" approach).
%%time
# Initialize HalvingGridSearchCV
halving_grid_search = HalvingGridSearchCV(
    estimator=model,
    param_grid=params,
    cv=5,
    factor=2,
    # resource='n_estimators',
    scoring='neg_mean_squared_error'
)

# Fit HalvingGridSearchCV
halving_grid_search.fit(X_train, y_train)

# Best Parameters and Evaluation
print(f"Best Parameters: {halving_grid_search.best_params_}")

# Predict on test set
best_rf = halving_grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Set: {mse:.2f}")

Best Parameters: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 50}
Mean Squared Error on Test Set: 1.05
CPU times: total: 1min 56s
Wall time: 2min 32s
