# Optimizing Your Machine Learning – Hyperparameters Tuning Simplified

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Just what is this Hyperparameter tuning? Hyperparameter tuning involves optimizing the configuration settings aka hyperparameters of a machine learning model to optimize its performance.

The aim is to find the *optimal combination of hyperparameter values* that leads to the **best model performance**.

### Methods:

1.	Grid Search: Systematically searches through a predefined set of hyperparameter combinations.
2.	Random Search: Randomly samples hyperparameter combinations from a specified range.
3.	Bayesian Optimization: Uses probabilistic models to model the performance landscape and iteratively explores promising regions.
4.	Genetic Algorithms: Employs evolutionary principles to evolve a population of hyperparameter sets over multiple generations.

For the purposes of article, I will just step through briefly the 2 more popular methods – Grid Search and Random Search with some examples for illustrations.


We need to install the imperative package **scikit-learn** for this exercise.

In [2]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Let's use a opensource dataset Titanic for our analysis and modeling. To keep things simple, and to focus our tests only on the hjyperparameter tuning methods, we will stick to the same classier for both methods, and also same set of hyperparameters. 

In reality, there may be more adjustments and iterations, or we may even conclude we need to switch to a different algorithm / classifier.

In [3]:
# Import necessary libraries
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score

# Load the Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

# Drop rows with missing values for simplicity
titanic = titanic.dropna()

# Prepare the data
X = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']]
y = titanic['survived']

# Convert categorical variables to numerical using one-hot encoding
X = pd.get_dummies(X, columns=['sex'], drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()


Here we define the hyperparameters for both methods:

In [4]:
# Hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [5]:

# GridSearchCV
grid_search = GridSearchCV(dt_classifier, param_grid, cv=5)
grid_search.fit(X_train, y_train)
grid_search_best_params = grid_search.best_params_


# Fit models with best parameters
grid_best_model = grid_search.best_estimator_
grid_predictions = grid_best_model.predict(X_test)



# Evaluate performance
grid_accuracy = accuracy_score(y_test, grid_predictions)

print("Grid Search Best Parameters:", grid_search_best_params)
print("Grid Search Accuracy:", grid_accuracy)


Grid Search Best Parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10}
Grid Search Accuracy: 0.717391304347826


Next, let's take a look at RandomizedSearchCV

In [6]:
# RandomizedSearchCV
random_search = RandomizedSearchCV(dt_classifier, param_grid, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)
random_search_best_params = random_search.best_params_

# Fit models with best parameters
random_best_model = random_search.best_estimator_
random_predictions = random_best_model.predict(X_test)


# Evaluate performance
random_accuracy = accuracy_score(y_test, random_predictions)

print("Random Search Best Parameters:", random_search_best_params)
print("Random Search Accuracy:", random_accuracy)


Random Search Best Parameters: {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 3, 'criterion': 'gini'}
Random Search Accuracy: 0.7391304347826086


A quick comparison for both sets will tell us which is the better algorithm and with the optimal set of hyperparameters.


This concludes that Grid Search is better in our current situation.