## Hyperparameter tuning

Hyperparameter tuning is the process of selecting the optimal values for a machine learning model's hyperparameters. These are typically set before the actual training process begins and control aspects of the learning process itself.

Effective tuning helps the model learn better patterns, avoid overfitting or underfitting and achieve higher accuracy on unseen data.

**CV - cross-validation**

### GridSearchCV
It trains the model using all possible combinations of specified hyperparameter values to find the best-performing setup. 
- Brute-force
- Slow, uses a lot of computer power which makes it hard to use with big datasets or many settings.

How it works:
- Create a grid of potential values for each hyperparameter.
- Train the model for every combination in the grid.
- Evaluate each model using cross-validation.
- Select the combination that gives the highest score.

In [1]:
# Tuning logistic regression using GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import make_classification

# generate sample data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# define a range of c values (parameter) using logarithmic scale
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

logreg = LogisticRegression()

# GridSearchCV tries all combinations from param_grid and uses 5-fold cross-validation
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

logreg_cv.fit(X, y)

print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': np.float64(0.006105402296585327)}
Best score is 0.853


### RandomizedSearchCV
It picks random combinations of hyperparameters from the given ranges instead of checking every single one.
- In each iteration it tries a new random combination of hyperparameter values.
- It records the modelâ€™s performance for each combination.
- After several attempts it selects the best-performing set.

In [3]:
# Tuning decision tree with RandomizedSearchCV

from sklearn.datasets import make_classification
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# define a range of values for each hyperparameter
param_dist = {
    "max_depth": [3, None],
    "max_features": randint(1, 9),
    "min_samples_leaf": randint(1, 9),
    "criterion": ["gini", "entropy"]
}

tree = DecisionTreeClassifier()

# random combinations are picked and evaluated using 5-fold cross-validation.
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)

print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 8}
Best score is 0.826


### Bayesian Optimization
It treats hyperparameter tuning like a mathematical optimization problem and learns from past results to decide what to try next.
- Build a probabilistic model (surrogate function) that predicts performance based on hyperparameters.
- Update this model after each evaluation.
- Use the model to choose the next best set to try.
- Repeat until the optimal combination is found. 

Common surrogate models used in Bayesian optimization include:
- Gaussian Processes
- Random Forest Regression
- Tree-structured Parzen Estimators (TPE)

**Tip: The Optuna study can be saved to a database, which allows stopping and resuming hyperparamter tuning or running it across multiple machines at once.**

In [4]:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# 1. Define the objective function
def objective(trial):
    # Load data inside or outside the function
    data, target = load_iris(return_X_y=True)
    
    # 2. Suggest hyperparameters using the 'trial' object
    # suggest_int for discrete values, suggest_float for continuous
    n_estimators = trial.suggest_int('n_estimators', 10, 200)
    max_depth = trial.suggest_int('max_depth', 2, 32, log=True)
    
    # 3. Initialize and evaluate the model
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    
    # We use cross-validation to get a stable score
    score = cross_val_score(clf, data, target, n_jobs=-1, cv=3)
    accuracy = score.mean()
    
    return accuracy

# 4. Create a study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

# 5. Results
print(f"Best Accuracy: {study.best_value}")
print(f"Best Params: {study.best_params}")

[32m[I 2026-02-10 11:30:39,038][0m A new study created in memory with name: no-name-02bc37a0-dc87-458b-8176-294d21c4dcb6[0m
[32m[I 2026-02-10 11:30:55,438][0m Trial 0 finished with value: 0.96 and parameters: {'n_estimators': 181, 'max_depth': 7}. Best is trial 0 with value: 0.96.[0m
[32m[I 2026-02-10 11:30:57,348][0m Trial 1 finished with value: 0.9666666666666667 and parameters: {'n_estimators': 189, 'max_depth': 10}. Best is trial 1 with value: 0.9666666666666667.[0m
[32m[I 2026-02-10 11:30:58,994][0m Trial 2 finished with value: 0.96 and parameters: {'n_estimators': 34, 'max_depth': 18}. Best is trial 1 with value: 0.9666666666666667.[0m
[32m[I 2026-02-10 11:31:00,674][0m Trial 3 finished with value: 0.96 and parameters: {'n_estimators': 102, 'max_depth': 8}. Best is trial 1 with value: 0.9666666666666667.[0m
[32m[I 2026-02-10 11:31:02,440][0m Trial 4 finished with value: 0.9466666666666667 and parameters: {'n_estimators': 160, 'max_depth': 2}. Best is trial 1 with

Best Accuracy: 0.9666666666666667
Best Params: {'n_estimators': 189, 'max_depth': 10}


### Cross-Validation (CV)
You should never tune hyperparameters on your test set (that's "data leakage"). Instead, we use K-Fold Cross-Validation.
1. Split the training data into $K$ "folds" (usually 5 or 10).
2. Train the model on $K-1$ folds and validate on the remaining fold. 
3. Repeat this $K$ times so every fold acts as the validator once. 
4. Average the scores. This gives a much more reliable estimate of how the hyperparameters will perform on unseen data.

### Parameters for most common algorithms

| Algorithm | Key Hyperparameters | What they do |
| :--- | :--- | :--- |
| **Random Forest** | `n_estimators`, `max_depth` | Number of trees and how deep they grow. |
| **XGBoost** | `learning_rate` ($\eta$), `subsample` | Speed of learning and % of data used for each tree. |
| **SVM** | `C`, `kernel`, `gamma` ($\gamma$) | Error tolerance, boundary shape, and influence range. |
| **KNN** | `n_neighbors`, `weights` | Number of neighbors and how much they "count." |

Advantages of Hyperparameter tuning
- **Improved Model Performance**: Finding the optimal combination of hyperparameters can significantly boost model accuracy and robustness.
- **Reduced Overfitting and Underfitting**: Tuning helps to prevent both overfitting and underfitting resulting in a well-balanced model.
- **Enhanced Model Generalizability**: By selecting hyperparameters that optimize performance on validation data the model is more likely to generalize well to unseen data.
- **Optimized Resource Utilization**: With careful tuning resources such as computation time and memory can be used more efficiently avoiding unnecessary work.
- **Improved Model Interpretability**: Properly tuned hyperparameters can make the model simpler and easier to interpret.