# Project AIF 24-25



### 1. Introduction
This notebook report is part of the AIF 24-25 course of the Master's Degree in Computer Science (Artificial Intelligence curriculum) at the University of Pisa.  
It is built upon code from my thesis work, "The Hyperview Challenge: proposal of a Machine Learning model to improve the state of the art in estimating soil parameters from hyperspectral images" based on the [Hyperview Challenge](#https://platform.ai4eo.eu/seeing-beyond-the-visible-permanent). 

In this project, we focus on comparing different Optuna samplers—particularly Optuna's NSGAIISampler (genetic-based), but also TPE, Random, to assess their performance in tuning a Random Forest regressor for predicting soil parameters. By systematically analyzing both **efficiency** (CPU time, memory usage) and **effectiveness** (MSE improvements).

Below, we present a compact report including **Related Works**, **Methodologies**, **Assessment**, **Conclusions**, and an **Appendix**.  
We also provide the full code and additional documentation in the [GitHub repository link](https://github.com/alessioPardiniJob/AIF24-25) 

### 2. Related Work

Hyperparameter optimization plays a crucial role in developing efficient machine learning models. Akiba et al. (2019) presented Optuna, a hyperparameter optimization framework, demonstrating that it outperforms traditional frameworks such as Hyperopt and SMAC on challenging problems such as Combined Algorithm Selection and Hyperparameter Optimization (CASH) with complex search spaces [link to the [paper](#https://arxiv.org/abs/1907.10902)].


Chiba (2024) evaluated open-source optimization algorithms, including TPE, CMA-ES, and NSGA-II, for optimizing the composition of functionally graded materials (FGMs) under thermal variations. Using the Optuna framework, the study found CMA-ES to be most effective in reducing residual thermal stress, while TPE showed faster convergence but was less effective than evolutionary algorithms [link to the [paper](#https://doi.org/10.1504/IJCAET.2024.10063825)].

Most existing literature has focused on the NSGA-II sampler for multi-objective optimization, as detailed by Haiping et al. (2023) [link to the [paper](#https://doi.org/10.1007/s10462-023-10526-z)], leaving a gap in systematic evaluations for single-objective optimization.

Our study aims to address this gap by proposing an evaluation of different configurations of the NSGA-II sampler within single-objective optimization (focused on rMSE reduction). This includes an analysis of its computational performance (CPU time and memory usage) and effectiveness (measured by Mean Squared Error reduction). This study broadens the existing landscape of practical applications of samplers.


### 3. Methodologies

#### 3.1 Setup and Constants
Notebook initialization with standard imports, third-party libraries, and essential constants.

In [8]:
import os
import sys

# Ottiene il percorso della directory del notebook
notebook_dir = os.getcwd()

# Aggiunge il percorso relativo della directory 'utils' al sys.path
utils_path = os.path.abspath(os.path.join(notebook_dir, '../utils'))
sys.path.append(utils_path)


In [3]:
# Constants used in the notebook
DEBUG = True
AUGMENT_CONSTANT_RF = 1
LABEL_NAMES = ["P2O5", "K", "Mg", "pH"]
LABEL_MAXS = np.array([325.0, 625.0, 400.0, 7.8])
COL_IX = [0, 1, 2, 3]

# Set number of trials based on debug mode
n_trials = 3 if DEBUG else 20

#### 3.2 Data Loading, Preprocessing, and Cross-Validation Pipeline for Training and Test Sets

(For details on data loading and preprocessing, refer to the [GitHub repository](#https://github.com/your-repo-link)).


In [9]:
# Ensure the directory and file paths are correctly set for your system
train_data_dir = os.path.abspath(os.path.join(notebook_dir, '../data/train_data/train_data'))
test_data_dir = os.path.abspath(os.path.join(notebook_dir, '../data/test_data'))
gt_data_path = os.path.abspath(os.path.join(notebook_dir, '../data/train_data/train_gt.csv'))


# Load training raw data
X_train, M_train, y_train, X_aug_train, M_aug_train, y_aug_train = load_data(
    train_data_dir, gt_data_path, is_train=True, augment_constant=AUGMENT_CONSTANT_RF, DEBUG=DEBUG, LABEL_MAXS=LABEL_MAXS
)

# Load test raw data
X_test, M_test = load_data(
    test_data_dir, gt_file_path=None, is_train=False, DEBUG=DEBUG, LABEL_MAXS=LABEL_MAXS
)

# Preprocessing the loaded data
X_tr_processed_RF, avg_edge_train = preprocess(X_train, M_train)
X_aug_processed_RF, avg_edge_train_aug_RF = preprocess(X_aug_train, M_aug_train)
X_te_processed_RF, avg_edge_test = preprocess(X_test, M_test)

# Select set of labels 
y_train_col = y_train[:, COL_IX]  
y_aug_train_col = y_aug_train[:len(y_train_col)*AUGMENT_CONSTANT_RF, COL_IX]

# 5-fold cross validation for training.
kfold = KFold(shuffle=True, random_state=2022)
kfold.get_n_splits(X_aug_train, y_aug_train_col)

Loading training data ..: 100%|██████████| 10/10 [00:00<00:00, 1041.93it/s]
Loading augmentation 1 ..: 100%|██████████| 10/10 [00:00<00:00, 671.98it/s]
Loading test data ..: 100%|██████████| 10/10 [00:00<00:00, 1049.02it/s]
INFO: Preprocessing data ...: 100%|██████████| 10/10 [00:00<00:00, 420.98it/s]
INFO: Preprocessing data ...: 100%|██████████| 10/10 [00:00<00:00, 419.48it/s]
INFO: Preprocessing data ...: 100%|██████████| 10/10 [00:00<00:00, 435.81it/s]


5

#### 3.3 Objective Function Definition

The objective function optimizes the hyperparameters of the [Random Forest](https://scikit-learn.org/1.6/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
 model using Optuna:
- **n_estimators**: Number of trees, suggested in the range [50, 300] with steps of 25.
- **max_depth**: Maximum tree depth, suggested in the range [3, 15].
- **min_samples_split**: Minimum samples required to split a node, suggested in the range [2, 10].

In [21]:
def objective(trial):
    """
    Objective function for Optuna:
    Performs cross-validation with RandomForest and returns the relative MSE (mse_rf / mse_bl) to minimize.
    """
    # Suggested parameters
    n_estimators = trial.suggest_int("n_estimators", 50, 300, step=25)
    max_depth = trial.suggest_int("max_depth", 3, 15)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 10)

    # Cross-validation
    y_hat_rf_cv, y_hat_bl_cv, y_v_list_cv = [], [], []

    for idx, (ix_train, ix_valid) in enumerate(kfold.split(np.arange(len(y_train)), avg_edge_train.astype(int))):
        # Merge training data
        X_t = np.concatenate((X_tr_processed_RF[ix_train], X_aug_processed_RF[ix_train]), axis=0)
        y_t = np.concatenate((y_train_col[ix_train], y_aug_train_col[ix_train]), axis=0)

        # Validation data
        X_v, y_v = X_tr_processed_RF[ix_valid], y_train_col[ix_valid]
        y_v_list_cv.append(y_v)

        # Baseline model
        baseline = BaselineRegressor()
        baseline.fit(X_t, y_t)
        y_b = baseline.predict(X_v)
        y_hat_bl_cv.append(y_b)

        # RandomForest model
        model = RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            n_jobs=-1
        )
        model.fit(X_t, y_t)
        y_hat_rf_cv.append(model.predict(X_v))

    # Compute relative MSE for each fold
    total_score = 0.0

    for y_hat, y_b, y_v in zip(y_hat_rf_cv, y_hat_bl_cv, y_v_list_cv):
        fold_score = sum(
            mean_squared_error(y_v[:, i] * LABEL_MAXS[i], y_hat[:, i] * LABEL_MAXS[i]) /
            mean_squared_error(y_v[:, i] * LABEL_MAXS[i], y_b[:, i] * LABEL_MAXS[i])
            for i in COL_IX
        ) / len(COL_IX)
        total_score += fold_score

    return total_score / len(y_hat_rf_cv)

#### 3.4 Execution of the Study in a Subprocess

To ensure isolation and accurate measurement of resource utilization, each Optuna study is executed in a separate child process. The `run_study_in_subprocess` function handles the study's execution, collects metrics related to execution time, CPU usage, and memory consumption, and returns the primary results, including the optimal parameters identified.


In [22]:
def run_study_in_subprocess(sampler, sampler_name, n_trials):
    """
    Function that:
      - Runs in a child process.
      - Measures CPU, memory, and time usage of the child process only.
      - Executes the Optuna study with the specified sampler.
      - Returns a dictionary with results (best trial, timings, etc.).
    """
    gc.collect()  # Force garbage collection for safety

    proc = psutil.Process()  # Create a "Process" instance for the current child process

    # Initial measurements
    start_cpu, start_mem, start_time = proc.cpu_times(), proc.memory_info().rss, time.time()

    # Create and run the study
    study = optuna.create_study(direction="minimize", sampler=sampler)
    study.optimize(objective, n_trials=n_trials, n_jobs=-1)

    # Final measurements
    end_time, end_cpu, end_mem = time.time(), proc.cpu_times(), proc.memory_info().rss

    return {
        "sampler_name": sampler_name,
        "best_value": study.best_trial.value,
        "best_params": study.best_trial.params,
        "elapsed_time_sec": end_time - start_time,
        "cpu_time_used_sec": (end_cpu.user - start_cpu.user) + (end_cpu.system - start_cpu.system),
        "memory_used_bytes": end_mem - start_mem
    }

#### 3.5 Sampler Configuration

We define configurations for **samplers** in Optuna, including variants of `NSGAIISampler` and standard samplers like `TPESampler` and `RandomSampler`.

##### NSGAIISampler Configurations

- **Pop5_Cr90_Mut10**:
  - `population_size=5`, `crossover_prob=0.9`, `mutation_prob=0.1`
  - **Expectations**: Lower resource consumption, accepting a trade-off in quality.

- **Pop10_Cr85_Mut15**:
  - `population_size=10`, `crossover_prob=0.85`, `mutation_prob=0.15`
  - **Expectations**: Higher resource consumption, better accuracy compared to Pop5.

- **Pop10_SPXCrossover_Cr80_Mut20**:
  - `population_size=10`, `crossover=SPXCrossover()`, `crossover_prob=0.8`, `mutation_prob=0.2`
  - **Expectations**: Increased diversity with quality solutions, slightly higher computational costs.

---

##### Other Samplers

- **Configurations**: `TPESampler()`, `RandomSampler()`
  - **Expectations**:
    - **TPESampler**: Good balance between effectiveness and resource usage.
    - **RandomSampler**: Maximum efficiency but lower effectiveness.


In [25]:
if __name__ == "__main__":
    # Sampler configurations
    sampler_configurations = [
        (NSGAIISampler(population_size=5, crossover_prob=0.9, mutation_prob=0.1), "NSGAIISampler_customPop5_Cr90_Mut10"),
        (NSGAIISampler(population_size=10, crossover_prob=0.85, mutation_prob=0.15), "NSGAIISampler_customPop10_Cr85_Mut15"),
        (NSGAIISampler(population_size=10, crossover_prob=0.8, mutation_prob=0.2, crossover=SPXCrossover()), "NSGAIISampler_customPop10_SPXCrossover_Cr80_Mut20"),
        (TPESampler(), "TPESampler"),
        (RandomSampler(), "RandomSampler")
    ]

    results = []

    # Running studies for each sampler configuration
    for sampler, sampler_name in sampler_configurations:
        parent_conn, child_conn = mp.Pipe()
        p = mp.Process(
            target=lambda conn, s, name, trials: conn.send(run_study_in_subprocess(s, name, trials)),
            args=(child_conn, sampler, sampler_name, n_trials)
        )
        p.start()
        p.join()

        results.append(parent_conn.recv())

  (NSGAIISampler(population_size=10, crossover_prob=0.8, mutation_prob=0.2, crossover=SPXCrossover()), "NSGAIISampler_customPop10_SPXCrossover_Cr80_Mut20"),
[I 2025-01-23 15:44:50,161] A new study created in memory with name: no-name-6ea4b717-aace-4abc-bb6e-8c95e33f5bf2
[I 2025-01-23 15:44:56,239] Trial 1 finished with value: 1.1589597679263397 and parameters: {'n_estimators': 75, 'max_depth': 12, 'min_samples_split': 6}. Best is trial 1 with value: 1.1589597679263397.
[I 2025-01-23 15:44:57,248] Trial 2 finished with value: 1.1596876563711382 and parameters: {'n_estimators': 225, 'max_depth': 14, 'min_samples_split': 7}. Best is trial 1 with value: 1.1589597679263397.
[I 2025-01-23 15:44:57,250] Trial 0 finished with value: 1.15633747315337 and parameters: {'n_estimators': 125, 'max_depth': 13, 'min_samples_split': 7}. Best is trial 0 with value: 1.15633747315337.
[I 2025-01-23 15:44:57,371] A new study created in memory with name: no-name-3b5011b8-0a6d-4172-8732-0f84dd08775b
[I 2025-

In [26]:
# Save results to a DataFrame and a text file
df_results = pd.DataFrame(results)
print(df_results)
output_file = "final_results.txt"
with open(output_file, "w") as f:
    f.write("================= FINAL RESULTS =================\n" + df_results.to_string(index=False))


                                        sampler_name  best_value  \
0                NSGAIISampler_customPop5_Cr90_Mut10    1.156337   
1               NSGAIISampler_customPop10_Cr85_Mut15    1.178561   
2  NSGAIISampler_customPop10_SPXCrossover_Cr80_Mut20    1.103687   
3                                         TPESampler    1.132061   
4                                      RandomSampler    1.133910   

                                         best_params  elapsed_time_sec  \
0  {'n_estimators': 125, 'max_depth': 13, 'min_sa...          7.092334   
1  {'n_estimators': 175, 'max_depth': 7, 'min_sam...          5.853074   
2  {'n_estimators': 125, 'max_depth': 5, 'min_sam...          6.225976   
3  {'n_estimators': 200, 'max_depth': 3, 'min_sam...          7.327610   
4  {'n_estimators': 250, 'max_depth': 13, 'min_sa...          8.143943   

   cpu_time_used_sec  memory_used_bytes  
0              28.11           33148928  
1              28.58           34082816  
2              26.13

### 4. Assessment

#### 4.1 Performance Metrics:

- **Relative MSE** \(**MSE_RF** / **MSE_baseline**\):  
  This metric represents the ratio of the Random Forest’s mean squared error to that of the baseline regressor. A value below 1 indicates that the Random Forest outperforms the baseline model.

- **CPU Time**:  
  The sum of user and system CPU time (in seconds) is recorded using `psutil` in the dedicated child process.

- **Memory Usage**:  
  The difference in the Resident Set Size (RSS) before and after execution is measured via `psutil`.

#### 4.2 Results Summary:

| **Sampler Name**                                | **Best Value** | **Best Parameters**                                                                 | **Elapsed Time (s)** | **CPU Time Used (s)** | **Memory Used (bytes)** |
|-------------------------------------------------|----------------|------------------------------------------------------------------------------------|-----------------------|-----------------------|--------------------------|
| NSGAIISampler_customPop5_Cr90_Mut10             | 0.858477       | {'n_estimators': 100, 'max_depth': 15, 'min_samples_split': 3}                     | 2016.477158           | 171903.91            | 76177408                |
| NSGAIISampler_customPop10_Cr85_Mut15            | 0.859063       | {'n_estimators': 125, 'max_depth': 14, 'min_samples_split': 6}                     | 1580.563911           | 134562.86            | 72318976                |
| NSGAIISampler_customPop10_SPXCrossover_Cr80_Mut20 | 0.853163       | {'n_estimators': 300, 'max_depth': 15, 'min_samples_split': 2}                     | 2048.171755           | 167423.77            | 100864000               |
| TPESampler                                      | 0.854925       | {'n_estimators': 300, 'max_depth': 15, 'min_samples_split': 2}                     | 2292.659146           | 196769.72            | 102473728               |
| RandomSampler                                   | 0.859254       | {'n_estimators': 150, 'max_depth': 15, 'min_samples_split': 4}                     | 1705.460728           | 144133.91            | 58699776                |

The results indicate a consistent performance in terms of effectiveness (RMSE) across all methods, with minimal variations likely due to the limited number of trials constrained by computational resources.  

In terms of efficiency, the **NSGA-II** method stands out for its relatively high resource consumption. Specifically, the **NSGAIISampler_customPop10_SPXCrossover_Cr80_Mut20** recorded 167,423 seconds of CPU time and 100 MB of memory usage, making it one of the most resource-intensive configurations. However, the **NSGAIISampler_customPop10_Cr85_Mut15** achieves a better trade-off between computational cost and RMSE performance, requiring less CPU time and memory compared to the SPX crossover variant.  

The **TPE Sampler**, while delivering competitive RMSE results, exhibited the highest computational cost, with 196,769 seconds of CPU time and 102 MB of memory usage, rendering it less efficient than both NSGA-II and RandomSampler.

### 5. Conclusions

This study compared multiple Optuna samplers, focusing on **NSGA-II**, for optimizing a Random Forest model predicting soil parameters from hyperspectral data. While RMSE performance was comparable across samplers, **NSGA-II** balanced performance and resource usage effectively. Simpler alternatives like **TPE** and **RandomSampler** showed different computational trade-offs.  

Future work could explore larger trial budgets, alternative machine learning models, or hybrid sampling strategies to further enhance optimization efficiency and accuracy.

### 6. Appendix

### 6.1 Individual Contributions

- **Alessio Pardini**

#### 6.2 Relationship with AIF 24-25

In this project, we analyzed the performance of Optuna's genetic sampler, NSGAIISampler, connecting it to the topic of genetic algorithms covered in the AIF 24-25 course, specifically applied here to the context of model selection.

Theoretically, a genetic algorithm can be viewed as a variant of stochastic beam search inspired by the metaphor of natural selection. It operates on a population of individuals (in this case, potential hyperparameter configurations), evaluates their fitness (here, the rMSE of a Random Forest model), and selects the most promising individuals based on a fitness function. NSGA-II performs selection using non-dominated sorting to group the population into Pareto fronts, prioritizing individuals from the superior front. Within the same front, the crowding distance is used to maintain diversity and avoid population collapse in a specific search space region.

Crossover operators are then applied to combine configurations and generate new ones. In our test, we used the default operator and also evaluated SPXCrossover. Finally, mutation operators introduce random changes to hyperparameters. NSGA-II employs polynomial mutation, which introduces controlled random variations while ensuring generated values remain within valid bounds. The mutation probability is adjusted through a specific parameter to regulate its effect.





