In [1]:
!pip install optuna
!pip install optuna-dashboard
!pip install optuna-integration

Collecting optuna
  Downloading optuna-4.7.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.18.3-py3-none-any.whl.metadata (7.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Collecting sqlalchemy>=1.4.2 (from optuna)
  Downloading sqlalchemy-2.0.46-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading mako-1.3.10-py3-none-any.whl.metadata (2.9 kB)
Collecting greenlet>=1 (from sqlalchemy>=1.4.2->optuna)
  Downloading greenlet-3.3.1-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (3.7 kB)
Downloading optuna-4.7.0-py3-none-any.whl (413 kB)
Downloading alembic-1.18.3-py3-none-any.whl (262 kB)
Downloading sqlalchemy-2.0.46-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import optuna

# INTRODUCTION

Hyperparameter tuning is the process of optimizing certain parameters in a machine learning model to improve the model's performance. These parameters are not learned from the data but are determined before the model training process begins. In practice, choosing the right hyperparameter values can significantly enhance the model's accuracy and computational efficiency.

Optuna is an open-source framework for automating hyperparameter tuning. Using Bayesian optimization methods, Optuna provides an efficient and straightforward way to find the best set of hyperparameters. One of Optuna's main advantages is its ability to manage the tuning process dynamically, allowing users to adaptively optimize hyperparameters based on previous experiment results.

In addition to Bayesian Optimization, Optuna also supports Grid Search, Random Search, Pruning, Multi-objective Optimization, and Hyperband to accelerate the tuning process. By combining these techniques, Optuna can efficiently and effectively optimize hyperparameters for machine learning models.

This tutorial will discuss how to perform hyperparameter tuning using Optuna with a case study on the XGBoost model. However, the principles and methods used can be applied to other machine learning models, including deep learning models built with TensorFlow and PyTorch, as well as models from libraries like Scikit-Learn, LightGBM, and CatBoost.

# Dataset Setup

The code below is used to prepare the data, starting from reading the dataset, preprocessing the data, and splitting the data into training and testing sets. In the dataset, there is an `id` column that only contains data order and is not related to modeling, so it can be removed.

Next, determine the feature columns and also the target variable column. The target variable is located in the last column, which is `quality`. This column contains values `LOW` or `HIGH`, indicating the class of the observation. There are 11 other columns, all of which are numeric, and these will be used to predict the `quality` class. Since the class labels are still in text form, they need to be converted into numeric values, for example, using 0 for the `LOW` class and 1 for the `HIGH` class.

The next step is to split the data into training and testing sets using the `train_test_split` function. Here, we use 70 percent for training data and 30 percent for testing data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Reading the dataset
data = pd.read_csv("https://raw.githubusercontent.com/sainsdataid/dataset/main/wine-quality-binary.csv")

# Removing the id column (not used for modeling)
data = data.drop(columns="id")    

# Defining features and target
X = data.drop('quality', axis=1)
y = data['quality'].apply(lambda q: 1 if q=="HIGH" else 0) 

print(data.info())

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed.acidity         1143 non-null   float64
 1   volatile.acidity      1143 non-null   float64
 2   citric.acid           1143 non-null   float64
 3   residual.sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free.sulfur.dioxide   1143 non-null   float64
 6   total.sulfur.dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   object 
dtypes: float64(11), object(1)
memory usage: 107.3+ KB
None


# XGBoost Model Hyperparameters

Hyperparameters are key components that influence the performance of an XGBoost model. Below are some of the hyperparameters in the XGBoost model:

- **eta** or **learning_rate**: Controls the step size when updating weights. Lower values make the training process slower but more stable. Conversely, if set too high, the model may miss important patterns in the data.
- **max_depth**: The maximum limit on the depth of the tree. The deeper the tree, the more complex the model, and the higher the risk of overfitting.
- **n_estimators**: The number of trees to be built in the boosting model.
- **subsample**: The proportion of data samples used to train each tree.
- **colsample_bytree**: The proportion of features used for node splitting.
- **min_child_weight**: The minimum weight of a tree leaf. This parameter prevents overfitting by setting a minimum threshold for the number of examples required to split a node.
- **gamma**: A threshold for making splits on the leaf node. Higher gamma values can make the model too simple, leading to a loss in its ability to capture the complexity of the data.

In addition to these seven parameters, there are other parameters that can also be controlled for XGBoost model training. More details can be found in the official documentation at [**XGBoost Parameters — xgboost 2.1.0 documentation**.](https://xgboost.readthedocs.io/en/stable/parameter.html)

# Hyperparameter Tuning with Optuna

In practice, not all model hyperparameters need to be tuned. We can apply tuning to only a few parameters, while others can be set directly or use their default values. In this example, we will focus on tuning the hyperparameters mentioned earlier, namely `max_depth`, `learning_rate`, `n_estimators`, `subsample`, `colsample_bytree`, `min_child_weight`, and `gamma`. Feel free to adjust or reduce the number of hyperparameters if needed.

## Objective Function

In the tuning process, the first step is to create an objective function. This function contains a list of model hyperparameters and the corresponding search value ranges. For integer-type parameters such as maximum tree depth (`max_depth`) and the number of trees (`n_estimators`), we can define the values using the `suggest_int` function. Meanwhile, for parameters with decimal values, such as `learning_rate`, we use the `suggest_float` function. Additionally, if there are categorical parameters (for example, the optimizer parameter in neural network models), they can be set up using the `suggest_categorical` function.

Still within the objective function, we initiate the XGBoost model with the predefined parameters. Next, to evaluate the model's performance, we will use the k-fold cross-validation technique with 3 folds. The return value of the objective function is the average score based on this evaluation. By default, the evaluation metric used in classification problems is accuracy. However, other metrics can be used, such as `balanced_accuracy`, `average_precision`, `f1`, and so on (refer to [**Metrics and Scoring**](https://scikit-learn.org/stable/modules/model_evaluation.html)).

In [2]:
import xgboost as xgb
from sklearn.model_selection import cross_val_score

# Defining the objective function
def objective(trial):
    param = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0, 5),
    }
    
    # initializing the XGBoost model
    model = xgb.XGBClassifier(**param, 
                              tree_method = 'hist',  # Using GPU for training 
                              device =  'cuda')       # If you have multiple GPUs, you can specify the GPU ID
                              
    # if you try in the computer without GPU, just remove the last 2 params (tree_method & device)
    
    score = cross_val_score(model, X_train, y_train, cv=3).mean()   # calculating score using cross-validation
    return score

## Hyperparameter Search

Based on the objective function, we can create an object for the hyperparameter search with `create_study`. The parameter `direction` is set to `"maximize"`, in line with the objective function, which aims to find the accuracy value—the higher the accuracy, the better the model.

After the study object is created, run the `optimize` method. In the example below, the search is performed 100 times. We can set `n_jobs=-1`, which means that the computation process will use all available processors. This value can be adjusted, for example, using 1 means only 1 processor will be used, or other values depending on the available device.

Once the search process is complete, the best parameter values can be accessed via the `best_params` property of the study object.

**Note**: The results presented here may vary due to the involvement of random numbers, both in the model creation within the objective function and during the hyperparameter search process.

In [3]:
import optuna

# Create and run the optimization process with 100 trials
study = optuna.create_study(study_name="example_xgboost_study", direction='maximize') 
study.optimize(objective, n_trials=100, show_progress_bar=True, n_jobs=-1)   

# Retrieve the best parameter values
best_params = study.best_params
print(f"\nBest parameters: {best_params}")

[I 2024-10-05 08:55:03,092] A new study created in memory with name: example_xgboost_study


  0%|          | 0/100 [00:00<?, ?it/s]

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




[I 2024-10-05 08:55:04,403] Trial 2 finished with value: 0.7312288211164616 and parameters: {'max_depth': 8, 'learning_rate': 0.08041857568258248, 'n_estimators': 106, 'subsample': 0.6331414087871776, 'colsample_bytree': 0.9942577014151945, 'min_child_weight': 10, 'gamma': 4.61923261800357}. Best is trial 2 with value: 0.7312288211164616.
[I 2024-10-05 08:55:06,421] Trial 4 finished with value: 0.7412257234847418 and parameters: {'max_depth': 8, 'learning_rate': 0.04196275526867259, 'n_estimators': 216, 'subsample': 0.8614958796458743, 'colsample_bytree': 0.7896644021769165, 'min_child_weight': 9, 'gamma': 3.418039865485575}. Best is trial 4 with value: 0.7412257234847418.
[I 2024-10-05 08:55:07,280] Trial 0 finished with value: 0.7399960575596295 and parameters: {'max_depth': 6, 'learning_rate': 0.04588135456414662, 'n_estimators': 414, 'subsample': 0.8719094134785379, 'colsample_bytree': 0.9200106577725002, 'min_child_weight': 9, 'gamma': 1.6195248840304954}. Best is trial 4 with val

Optuna provides visualization tools to show which parameters have the most significant influence on improving model performance during the search process. In this example, the gamma parameter is the most important, followed by subsample and `min_child_weight`.

In [4]:
import optuna.visualization as vis

display(vis.plot_param_importances(study))
display(vis.plot_optimization_history(study))

## Training the Best Model and Evaluation

The final step is to train the model with the best parameters found on the entire training data. The model can then be used to predict the test data. To enable reuse, we can also save the model to a file, for example, using the `dump` function from the `joblib` library.

Here’s how you can implement this in code:

1. **Training the model with the best parameters:**
   After you retrieve the best parameters from Optuna, train the model using the entire training dataset.

2. **Making predictions on the test data:**
   After training, use the model to predict the labels of the test data and evaluate its performance.

3. **Saving the model:**
   Save the trained model to a file using `joblib` so that it can be loaded and reused later.



In [5]:
from sklearn.metrics import accuracy_score, classification_report

# Train the model with the best parameters
best_model = xgb.XGBClassifier(**best_params)
best_model.fit(X_train, y_train)

# Predict the test data
y_pred = best_model.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Generate a classification report
cr = classification_report(y_test, y_pred)
print(f"\nReport:\n{cr}")

Accuracy: 0.7667638483965015

Report:
              precision    recall  f1-score   support

           0       0.72      0.78      0.75       152
           1       0.81      0.76      0.78       191

    accuracy                           0.77       343
   macro avg       0.76      0.77      0.77       343
weighted avg       0.77      0.77      0.77       343



<hr>  

### Source: [sainsdata.id](https://sainsdata.id/machine-learning/12313/tuning-hyperparameter-model-xgboost-dengan-optuna/) <hr>
