# Hyperparameter Optimization of LightGBM Using Hill Climbing and Genetic Algorithms

## Introduction

In this notebook, we explore hyperparameter optimization for a **LightGBM** classifier using two optimization strategies: **Hill Climbing** and **Genetic Algorithms**. LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient, capable of handling large-scale data with high speed and accuracy.

Our objective is to find the optimal set of hyperparameters that maximize the classification accuracy on the **Breast Cancer Wisconsin** dataset. By employing advanced optimization techniques, we aim to improve model performance beyond what can be achieved with default parameters or manual tuning.

## Dataset

We use the **Breast Cancer Wisconsin dataset**, which is a well-known dataset for binary classification tasks. The dataset contains **569 samples** of malignant and benign breast cancer cases, each described by **30 numerical features** that capture various properties of cell nuclei.

## Hyperparameter Space

We define a comprehensive hyperparameter space for the LightGBM classifier, focusing on parameters that significantly impact model performance:

- **`num_leaves`**: The number of leaves in the full tree. A larger number can lead to more complex models.
- **`max_depth`**: Maximum tree depth for base learners. Controls the complexity of the model to prevent overfitting.
- **`learning_rate`**: Boosting learning rate. Determines the step size at each iteration while moving toward a minimum of a loss function.
- **`n_estimators`**: Number of boosting iterations. More estimators can improve performance but increase training time.
- **`min_child_samples`**: Minimum number of data needed in a child (leaf). Reduces overfitting by controlling leaf size.
- **`subsample`**: Subsample ratio of the training instance. Prevents overfitting by sampling the dataset.
- **`colsample_bytree`**: Subsample ratio of columns when constructing each tree. Similar to `subsample` but for columns.
- **`reg_alpha`**: L1 regularization term on weights. Can be used to make the model more robust to outliers.
- **`reg_lambda`**: L2 regularization term on weights. Can help prevent overfitting.

## Optimization Methods

### Hill Climbing Optimizer

- **Description**: Starts with a random hyperparameter configuration and iteratively makes local changes to find better configurations.
- **Strengths**: Simple to implement and can quickly find local optima.
- **Challenges**: May get stuck in local optima and miss the global optimum.

### Genetic Algorithm Optimizer

- **Description**: Mimics the process of natural selection by maintaining a population of solutions that evolve over time using selection, crossover, and mutation.
- **Strengths**: Can explore a wider search space and avoid local optima.
- **Challenges**: Computationally more intensive due to the population-based approach.

## Evaluation Function

We define an evaluation function that:

- Initializes the LightGBM classifier with the given hyperparameters using `**config` to pass parameters directly.
- Uses **Stratified K-Fold Cross-Validation** to evaluate performance, ensuring that each fold has a similar class distribution.
- Implements a fidelity parameter corresponding to the fold index to enable multi-fidelity optimization, balancing evaluation cost and accuracy.

## Goals

- **Optimize Hyperparameters**: Find the hyperparameter configuration that yields the highest classification accuracy.
- **Compare Optimization Strategies**: Evaluate which optimization method performs better in terms of finding optimal parameters and computational efficiency.
- **Efficient Resource Utilization**: Conduct the optimization within a predefined computational budget.

## Expected Outcomes

- Identification of hyperparameter configurations that improve model performance over default settings.
- Insights into the most influential hyperparameters for LightGBM on this dataset.
- Comparative analysis of the effectiveness of Hill Climbing and Genetic Algorithms in hyperparameter optimization.



In [None]:
cd ..

In [None]:
import logging
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from typing import Dict, Any, Callable, List
import numpy as np
import random

# Importing necessary classes from the src module
from focus_opt.hp_space import HyperParameterSpace, CategoricalHyperParameter, OrdinalHyperParameter, ContinuousHyperParameter
from focus_opt.config_candidate import ConfigCandidate
from focus_opt.helpers import OutOfBudgetError, SessionContext
from focus_opt.optimizers import BaseOptimizer, HillClimbingOptimizer, GeneticAlgorithmOptimizer

# Set up logging
logging.basicConfig(level=logging.INFO)

# Define the hyperparameter space for a Gradient Boosting Classifier
hp_space = HyperParameterSpace("Gradient Boosting HP Space")
hp_space.add_hp(CategoricalHyperParameter(name="loss", values=["log_loss", "exponential"]))
hp_space.add_hp(OrdinalHyperParameter(name="n_estimators", values=[50, 100, 150, 200]))
hp_space.add_hp(OrdinalHyperParameter(name="max_depth", values=[3, 4, 5, 6, 7]))
hp_space.add_hp(ContinuousHyperParameter(name="learning_rate", min_value=0.01, max_value=0.2))
hp_space.add_hp(ContinuousHyperParameter(name="subsample", min_value=0.5, max_value=1.0))
hp_space.add_hp(ContinuousHyperParameter(name="min_samples_split", min_value=2, max_value=20, is_int=True))
hp_space.add_hp(ContinuousHyperParameter(name="min_samples_leaf", min_value=1, max_value=20, is_int=True))

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Refactored evaluation function
def gbm_evaluation(config: Dict[str, Any], fidelity: int) -> float:
    """
    Evaluation function for a Gradient Boosting Classifier with cross-validation.

    Args:
        config (Dict[str, Any]): Configuration of hyperparameters.
        fidelity (int): Fidelity level (index of the CV fold).

    Returns:
        float: Accuracy for the specified CV fold.
    """
    logging.info(f"Evaluating config: {config} at fidelity level: {fidelity}")

    # Initialize the Gradient Boosting Classifier with the config as **kwargs
    clf = GradientBoostingClassifier(**config)

    # Perform cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Get the train and test indices for the specified fold
    for fold_index, (train_index, test_index) in enumerate(skf.split(X, y)):
        if fold_index + 1 == fidelity:
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            logging.info(f"Score for config {config} at fold {fidelity}: {score}")
            return score

    raise ValueError(f"Invalid fidelity level: {fidelity}")

# Instantiate the HillClimbingOptimizer
hill_climbing_optimizer = HillClimbingOptimizer(
    hp_space=hp_space,
    evaluation_function=gbm_evaluation,
    max_fidelity=5,
    maximize=True,
    log_results=True,
    warm_start=20,
    random_restarts=5,
)
# Run the Hill Climbing optimization
best_candidate_hill_climbing = hill_climbing_optimizer.optimize(budget=500)
print(f"Best candidate from Hill Climbing: {best_candidate_hill_climbing.config} with score: {best_candidate_hill_climbing.evaluation_score}")

# Instantiate the GeneticAlgorithmOptimizer
ga_optimizer = GeneticAlgorithmOptimizer(
    hp_space=hp_space,
    evaluation_function=gbm_evaluation,
    max_fidelity=5,
    maximize=True,
    population_size=20,
    crossover_rate=0.8,
    mutation_rate=0.1,
    elitism=1,
    tournament_size=3,
    min_population_size=5,
    log_results=True
)

# Run the Genetic Algorithm optimization
best_candidate_ga = ga_optimizer.optimize(budget=500)
print(f"Best candidate from Genetic Algorithm: {best_candidate_ga.config} with score: {best_candidate_ga.evaluation_score}")


In [None]:
print(f"Best candidate from Hill Climbing: {best_candidate_hill_climbing.config} with score: {best_candidate_hill_climbing.evaluation_score}")

In [None]:
print(f"Best candidate from Genetic Algorithm: {best_candidate_ga.config} with score: {best_candidate_ga.evaluation_score}")
