#Hyperparameter tuning by Aditya


## Section 1: Introduction to Hyperparameter Tuning

### 1.1 Understanding Hyperparameters
Hyperparameters are external configuration settings for a model that cannot be learned from the data. These include learning rates, batch sizes, and regularization strengths.

### 1.2 Importance of Hyperparameter Tuning
Optimal hyperparameter values significantly impact a model's performance. Hyperparameter tuning aims to find the best combination for improved accuracy and generalization.

##What are batch sizes ?

Batch size is a hyperparameter that specifies the number of data samples utilized in one iteration (or training step) to update the model's weights. It is an important concept in the training of neural networks and other machine learning models.

*   **Batch:** A set of data samples processed together in one iteration. The model weights are updated after processing each batch.
*   **Batch Size:** The number of data samples in a batch. It is a hyperparameter that can be adjusted based on computational resources, model architecture, and the size of the training dataset.
*   Typical batch sizes are powers of 2, such as 32, 64, 128, etc. The choice of batch size can also depend on the nature of the data. For example, in tasks where data points are correlated or sequential (like in time series), a smaller batch size is often preferred.

## What are regularization strengths ?

Regularization is a technique used to prevent overfitting in machine learning models. Regularization strength, often denoted by a hyperparameter, controls the amount of regularization applied to the model during training. It influences the impact of regularization penalties on the loss function, helping to balance between fitting the training data well and avoiding overfitting.

- Two common types of regularization are L1 regularization (Lasso) and L2 regularization (Ridge). The regularization strength, usually denoted as "alpha" or "lambda," determines the magnitude of the regularization term added to the loss function.

- Hyperparameter tuning techniques, such as grid search or randomized search, are often employed to find the best regularization strength for a given problem.


## Must not forget the estimators

Estimators are a fundamental concept in the scikit-learn library for machine learning in Python. In scikit-learn, an estimator is any object that learns from data and can be used for predictions. Examples include classifiers, regressors, and clustering algorithms. Each type of estimator in scikit-learn exposes a _fit_ method to train the model and a _predict_ method to make predictions.

- Hyperparameters are external configurations for an estimator that cannot be learned from the training data. They affect the model's learning process and its ability to generalize to new data.

- Identifying the optimal values for hyperparameters is crucial for achieving the best performance of an estimator. The process of hyperparameter tuning involves systematically searching through different hyperparameter combinations to find the ones that result in the best model.

- Common techniques include grid search, random search, and more advanced methods like Bayesian optimization.

## In deep learning we also have few different types of hyper parameters

- Epoch size
- Dropout rate (Fraction of randomly selected neurons to be ignored during training)
- Activation functions
- Weight initialization (We have different methods for choosing how we gonna initialize the weights)
- Optimizer
- Learning rate scheduler
- Kernel size
- Strides

## Kernel size

In CNNs, a kernel (or filter) is a small matrix applied to an input image to extract features. The kernel size refers to the dimensions of this matrix.

- For example, a kernel size of (3,3) implies a 3x3 matrix. During convolution, the kernel slides over the input image in both dimensions, and at each position, it performs an element-wise multiplication and sums up the results to produce a feature map.

- **Strides :-** defines the step size at which the kernel moves across the input image during convolution.





# Learning about different tools for hypertuning

## Grid Search

Grid search is a hyperparameter tuning technique used in machine learning to systematically search through a predefined set of hyperparameter values for a given model. The goal is to find the combination of hyperparameter values that results in the best model performance.

Here's a detailed explanation of grid search and its parameters:

1. **Hyperparameters:**
   - Hyperparameters are parameters that are not learned from the data but are set prior to the training process.
   - Examples include learning rate, regularization strength, the number of hidden layers, and the number of nodes in each layer for neural networks.

2. **Grid Search:**
   - Grid search involves defining a grid of hyperparameter values to explore.
   - For each combination of hyperparameters in the grid, a model is trained and evaluated.
   - The combination that produces the best performance is selected as the optimal set of hyperparameters.

3. **Parameters of Grid Search:**
   - **Param_grid:** This is the dictionary or list of dictionaries that defines the hyperparameter grid to be searched. Each key in the dictionary corresponds to a hyperparameter, and the values are lists of possible values for that hyperparameter.

   - **Scoring:** This parameter defines the metric used to evaluate the performance of the model for each combination of hyperparameters. Common metrics include accuracy, precision, recall, F1 score, and more.

   - **CV (Cross-validation):** Grid search often employs cross-validation to assess the model's performance more reliably. The CV parameter determines the number of folds used in cross-validation.

   - **Refit:** This parameter, when set to True, refits the best model with the entire dataset after finding the optimal hyperparameters.

   - **Verbose:** Determines the amount of output information during the grid search. Higher values provide more details.

   - **n_jobs:** Specifies the number of parallel jobs to run during the grid search. Setting it to -1 uses all available processors.


In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier()

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Evaluate the model on the test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Test Accuracy:", accuracy)

Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}
Test Accuracy: 1.0


## Random search

Random search is another hyperparameter tuning technique, similar to grid search, used to find the optimal combination of hyperparameters for a machine learning model. Instead of searching through a predefined grid of hyperparameter values, random search samples hyperparameter values randomly from specified distributions. This approach can be more efficient than grid search when the search space is large.



*   Random search involves defining a range or distribution for each hyperparameter.
*   For each iteration, random values are sampled from these distributions for each hyperparameter.
*   The model is trained and evaluated with these random hyperparameter values.
*   The process is repeated for a specified number of iterations or until computational resources are exhausted.
*   The combination that produces the best performance is selected as the optimal set of hyperparameters.







In [12]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define the hyperparameter distributions
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4)
}

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier()

# Perform random search with cross-validation
random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_distributions,
                                   n_iter=10, scoring='accuracy', cv=5, random_state=42)
random_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:", random_search.best_params_)

# Evaluate the model on the test set
best_model = random_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Test Accuracy:", accuracy)


Best Hyperparameters: {'max_depth': 20, 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 137}
Test Accuracy: 1.0


## Bayesian Optimization

Bayesian Optimization is a probabilistic model-based optimization technique used for finding the global optimum of an objective function in a noisy or expensive-to-evaluate search space. It models the objective function as a probability distribution and uses the information gained from each evaluation to update this distribution, guiding the search toward promising regions.

Here's a detailed explanation of Bayesian Optimization and its parameters:

1. **Objective Function:**
   - The function we want to optimize, often referred to as the objective function or the fitness function.

2. **Surrogate Model:**
   - Bayesian Optimization maintains a probabilistic surrogate model that approximates the true objective function. Gaussian processes (GP) are commonly used as surrogate models due to their ability to model uncertainty.

3. **Acquisition Function:**
   - The acquisition function is a criterion that guides the search for the next point to evaluate. It balances exploration (sampling in unexplored regions) and exploitation (sampling in regions where the objective function is likely to be optimal).

4. **Parameters of Bayesian Optimization:**
   - **Bounds:** Specifies the search space constraints for each hyperparameter. For each dimension, you define the lower and upper bounds.

   - **Objective Function:** The actual function to be optimized.

   - **Surrogate Model (Kernel, GP Parameters):** The choice of the surrogate model, along with its hyperparameters, affects how well the model captures the true objective function.

   - **Acquisition Function (e.g., Expected Improvement, Probability of Improvement):** The acquisition function determines how the surrogate model guides the search. Different acquisition functions have different characteristics.

   - **Initial Design:** A set of initial points where the objective function is evaluated. These points help in building the initial surrogate model.

   - **Exploration-Exploitation Trade-off:** Parameters that control the balance between exploration and exploitation. For example, in Expected Improvement, you might have a parameter to control the trade-off.

   - **Number of Iterations (or Evaluations):** Specifies the total number of iterations or evaluations of the objective function.

   - **Noise Model:** Bayesian Optimization can handle noisy evaluations. The noise model, if known, can be incorporated into the optimization process.

   - **Parallelism:** Specifies the number of parallel evaluations of the objective function that can be performed simultaneously.