*This Notebook has been created by PALISSON Antoine.*<br>


In [None]:
# Basic packages
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Tensorflow
import tensorflow as tf

# Preprocessing

In this exercise, we will use an income dataset: `'adult.csv'`.

**<font color='blue'>1. Load the dataset and show its content.**

**<font color='blue'>2. Separate the features and the label (the `income` variable).**

**<font color='blue'>3. Create a test and validation set.**

**<font color='blue'>4. Preprocess the data.**

*Tips: The preprocessing is different for the numerical and the categorical data*

# Benchmark

In this part, you will build a **benchmark** to compare the effect of the optimizer algorithms and the learning rate.

Here is a small function that should be run before every training to reset the seeds.<br> This ensures that the **randomness is always the same during training**, hence making the results comparable.

In [None]:
def reset_seeds():
    os.environ['PYTHONHASHSEED']=str(2)
    tf.random.set_seed(2)
    np.random.seed(2)
    random.seed(2)

**<font color='blue'>1. Build a model architecture that contains 2 hidden layers with 16 neurons and a ReLU activation.<br> The remaining parameters are set to default.<br> Add the appropriate output layer.**

*Tips: You should run the reset_seeds() function at the beginning of the cell.*

**<font color='blue'>2. Compile the model with the SGD optimizer, a learning rate of 0.005 and the appropriate loss.**

**<font color='blue'>3. Train the model for 10 epochs with a batch size of 32.**

*Tips: Don't forget the validation data !*

**<font color='blue'>4. Display the loss curves (training & validation).** 

# Optimizers

Momentum is a hyperparameter that controls the influence of previous gradients on the current update. It can be added to the SGD algorithm.

A higher momentum value means that the optimizer will take more of the previous gradients into account when calculating the current update, which can help the optimizer to converge faster and avoid getting stuck in local minima. A lower momentum value can help the optimizer to explore the parameter space more thoroughly, but may also slow down convergence.

The most common values for momentum are between 0.9 and 0.99.

**<font color='blue'>1. Using the same architecture as the benchmark model, add some momentum to the SGD optimizer using its `momentum` parameter.**

**<font color='blue'>2. Compare the result with the benchmark.**

**<font color='blue'>3. Add both the momentum and the Nesterov Momentum (`nesterov=True`) to the SGD optimizer.<br> Train the model and compare the results.**

In TensorFlow, the Adam optimizer is implemented as `tf.keras.optimizers.Adam()`.<br> The beta_1 and the beta_2 parameters refers to the beta_z and beta_s respectively (in the lesson).


**<font color='blue'>4. Using the same architecture as the benchmark model, replace the SGD optimizer by the Adam optimizer.<br> Compare the result.**

# Learning Rate

## Fixed learning rate

**<font color='blue'>1. Using the benchmark architecture and the Adam optimizer, try the following learning rate values:**<br>
**<font color='blue'>0.00001 / 0.0001 / 0.001 / 0.01 / 0.1 / 1**<br>
**<font color='blue'>Compare the losses**

## Learning rate scheduler

TensorFlow Keras provides several learning rate schedulers that can be used to adjust the learning rate during training including:


*   **Exponential** decay function - `tf.keras.optimizers.schedules.ExponentialDecay` - [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/ExponentialDecay)
    * `decay_rate` (typically ranges between 0.1 and 0.5) parameter controls the rate at which the learning rate decreases
    * `decay_steps` parameter (typically ranges between 1000 and 10000 but depends on the number of instances) controls the instance-based frequency at which the learning rate is decreased
*  **Cosine** decay function - `tf.keras.optimizers.schedules.CosineDecay` -  [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/CosineDecayRestarts)
    * `alpha` (typically ranges between 0.0 and 1.0) sets the amplitude of the oscillation.
*   **Polynomial** decay function - `tf.keras.optimizers.schedules.PolynomialDecay` - [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PolynomialDecay)
    * `power` (typically ranges between 0.5 and 2.0) sets the power of the polynomial.
*   **Power** decay function - `tf.keras.optimizers.schedules.InverseTimeDecay` - [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/InverseTimeDecay)
*   **Piecewise** decay function - `tf.keras.optimizers.schedules.PiecewiseConstantDecay`: [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/PiecewiseConstantDecay)
    * `boundaries`: A list of non-decreasing integers that define the steps or epochs at which to apply a new learning rate.
    * `values`: A list of length `len(boundaries) + 1` that specifies the learning rates to apply at each boundary. The first value corresponds to the initial learning rate, and subsequent values correspond to the learning rates after each boundary.
*   **Performance** decay - `tf.keras.callbacks.ReduceLROnPlateau` - [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau)
    * `factor`: (typically ranges between 0.1 and 0.5) sets the factor by which the learning rate will be reduced.

    * `patience`: (typically ranges between 3 and 10) sets the number of epochs to wait before reducing the learning rate when the monitored metric has stopped improving.

    * `min_delta`: sets the minimum change in the monitored metric that is considered an improvement.

    * `cooldown`: sets the number of epochs to wait after reducing the learning rate before resuming normal operation.

There is also a callback that allows you to define a function to schedule the learning rate based on the epoch number or the current iteration number `tf.keras.callbacks.LearningRateScheduler` [Documentation](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule)

**<font color='blue'>1. Use an piecewise decay learning rate scheduler with three boundaries (i.e. four different learning rates) using the benchmark architecture and the Adam optimizer.**

**<font color='blue'>2. Do the same with an exponential learning rate scheduler.<br> Compare the results.**

**<font color='blue'>3. Do the same with a performance learning rate scheduler.<br> Compare the results.**

# Hyperparametrization

There are several Python libraries available for tuning the hyperparameters of a neural network including:
* Keras Tuner - [Documentation](https://keras.io/keras_tuner/)
* Optuna - [Documentation](https://optuna.readthedocs.io/en/stable/index.html)
* Ray Tune - [Documentation](https://docs.ray.io/en/latest/tune/index.html)
* Hyperopt - [Documentation](http://hyperopt.github.io/hyperopt/)

These libraries have similar APIs and can be used in a similar way. Here's a general method for using these libraries for hyperparameter tuning:

* **Define the search space**: *The first step is to define the search space for your hyperparameters. This includes specifying the range of values that each hyperparameter can take. For example, you can define a search space that includes the learning rate, number of hidden layers, batch size, and other hyperparameters.*

* **Define the objective function**: *The next step is to define the objective function that you want to optimize. This is typically the performance metric of your model, such as accuracy or loss. You'll need to train and evaluate your model for each combination of hyperparameters and return the performance metric.*

* **Choose a search algorithm**: *Next, you'll need to choose a search algorithm to explore the search space and find the optimal hyperparameters. The available algorithms vary depending on the library you choose, but popular choices include random search, grid search, Bayesian optimization, and other techniques.*

* **Run the search**: *Once you've defined the search space, objective function, and search algorithm, you can start the search. The library will run multiple experiments, each with a different set of hyperparameters, and evaluate the performance of your model. After several iterations, the library will converge on the optimal set of hyperparameters.*

* **Evaluate the best model**: *Finally, once the search is complete, you can evaluate the performance of the best model using the optimal hyperparameters.*

**<font color='blue'>Using one of these library, find the best hyperparameters for the `adult.csv` dataset.**

*Tips: You should try different combinations of number of hidden layers, number of neurons, activation functions, weight initialization, optimizer, learning rate, batch size, regularization techniques (norms, dropout, batch ..) and so on. <br> You can also treat your preprocessing choices as hyperparameters !*