# Worksheet: Neural Network

One of the important questions in practice is tuning the hyperparameters such as the activation function, optimizers, batch size, epochs and etc. A lot of theories were established to select hyperparameters. However, many assumptions were made, which are hard to check in practice. 

In practice, we construct different models using different hyperparameters, and then compare the test performance of each model.

In this worksheet, you will select the optimizer used to train a neural network. You will try **Adam optimizers**, **SGD optimizers**, **Adagrad optimizers**. Details about each optimizer are given below. For all other hyperparameters, they are selected as 

- Neural network structure: Shallow neural network with 200 neurons in the hidden layer. Activation function is ReLU. 

- Loss: This is regression problem so you should use MSE as loss and metric.


- Number of epochs = 10, batch_size = 16


**You should write two functions in this worksheet.** The first function takes training samples and minimizers as inputs, and returns a trained model. The second function takes test samples and trained model as inputs, and returns test error. Each function should work for any training samples, test samples, minimizers, and models.

Train three models with different optimizers mentioned above using two functions you write, then report test error for each model.

**Remark:** You may need to change default parameters in each optimizer to make the training process work, especially the SGD minimizer. 

**Grading policy:**
1. Docstrings and in-line comments are added to explain your functions and codes.
2. All optimizers work properly and 3 test errors are reported.

## Optimizers:

#### Adam

Official paper: https://arxiv.org/abs/1412.6980

Syntax: 

    tf.keras.optimizers.Adam(
        
        learning_rate=0.001, 
        beta_1=0.9, 
        beta_2=0.999, 
        epsilon=1e-07, 
        amsgrad=False,
        name='Adam', 
        **kwargs)
        
Parameters:

learning_rate: rate at which algorithm updates the parameter.  
               Tensor or float type of value.Default value is 0.001

beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
        
beta_2: Exponential decay rate for 2nd moment. Constant Float 
        tensor or float type of value. Default value is 0.999
        
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
         
amsgrad: Whether to use AMSGrad variant or not. 
         Default value is False.
         
name: Optional name for the operation

**kwargs: Keyworded variable length argument length

#### RMSprop
Syntax: 

    tf.keras.optimizers.RMSprop(
    
        learning_rate=0.001, 
        rho=0.9, 
        momentum=0.0, 
        epsilon=1e-07, 
        centered=False,
        name='RMSprop', 
        **kwargs)

Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
               
rho: Discounting factor for gradients. Default value is 0.9

momentum: accelerates rmsprop in appropriate direction. 
          Float type of value. Default value is 0.0
          
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
         
centered: By this gradients are normalised by the variance of 
          gradient. Boolean type of value. Setting value to True may
          help with training model however it is computationally 
          more expensive. Default value if False.
          
name: Optional name for the operation

**kwargs: Keyworded variable length argument length.




#### Adagrad
Syntax: 

    tf.keras.optimizers.Adagrad(
    
        learning_rate=0.001,
        initial_accumulator_value=0.1,
        epsilon=1e-07,
        name="Adagrad",
        **kwargs)
        
Parameters: 

learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
               
initial_accumulator_value: Starting value for the per parameter 
                           momentum. Floating point type of value.
                           Must be non-negative.Default value is 0.1
                           
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07.
         
name: Optional name for the operation

**kwargs: Keyworded variable length argument length

#### SGD and its variations
Syntax: 

    tf.kears.optimizers.SGD(
        
        learning_rate = 0.01,
        momentum=0.0, 
        nesterov=False, 
        name='SGD', 
        **kwargs)
        
Parameters: 

learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.01
               
momentum: accelerates gradient descent in appropriate
          direction. Float type of value. Default value is 0.0
          
nesterov: Whether or not to apply Nesterov Momentum.
          Boolean type of value. Default value is False.
          
name: Optional name for the operation

**kwargs: Keyworded variable length argument length.


#### Some notes on different optimizers:
1. https://www.geeksforgeeks.org/adam-optimizer/
2. https://medium.com/ai³-theory-practice-business/optimization-in-deep-learning-5a5d263172e

#### Which optimizer should you use?

- Convergence speed:

SGD < SGD(momentum) = SGD(momentum, nesterov=TRUE) < Adagrad = RMSprop = Adam

- Convergence quality:

SGD = SGD(momentum) = SGD(momentum, nesterov=TRUE) >= RMSprop = Adam > Adagrad


## Please run the following cell to generate synthetic data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
import os

# Suppress TensorFlow warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'  # Set to '1' to show warnings, '2' to show errors only, '3' to suppress all logs
tf.get_logger().setLevel('ERROR')

# generate data points
datadim = 5              # intrinsic dimension of data
dim = 500                # dimension of the ambient space
N = 5000                 # data set size
eps = 0.25               # noise level in the observaed data

# create low dimensional data points
Xreal = np.random.randn(N, datadim)
y = np.tanh(Xreal[:,0]) + np.cos(Xreal[:,1]) - np.exp(-Xreal[:,4])

# a matrix to embed the given data into a high-dimensional ambient space
transform = np.random.randn(datadim,dim)

# the high-dimensional data is obtained by matrix multiplication
X = Xreal @ transform

# making the observations noisy
X += np.random.normal(0, eps, size = X.shape)
y += np.random.normal(0, eps, size = y.shape)

# Test data points:
Ntest = 500
Xrealtest = np.random.randn(Ntest, datadim)
ytest = np.tanh(Xrealtest[:,0]) + np.cos(Xrealtest[:,1]) - np.exp(-Xrealtest[:,4])
Xtest = Xrealtest@transform




In [2]:
import tensorflow as tf
from tensorflow.keras import layers
from sklearn.metrics import mean_squared_error
import numpy as np


# Define the model training function
def train_model(X_train, y_train, optimizer):
    """
    Trains a neural network model on the provided training data using the specified optimizer.
    
    Parameters:
    X_train (ndarray): Training input data
    y_train (ndarray): Training target data
    optimizer (tf.keras.optimizers.Optimizer): Optimizer to be used for training
    
    Returns:
    model (tf.keras.Model): Trained neural network model
    """
    # Define the model
    model = tf.keras.Sequential([
        layers.Dense(200, activation='relu', input_shape=(X_train.shape[1],)),
        layers.Dense(1)  # Output layer for regression
    ])
    
    # Compile the model
    model.compile(optimizer=optimizer, loss='mse', metrics=['mse'])
    
    # Train the model
    model.fit(X_train, y_train, epochs=10, batch_size=16, verbose=1)
    return model

# Define the model evaluation function
def evaluate_model(X_test, y_test, model):
    """
    Evaluates the trained model on test data and returns the test error.
    
    Parameters:
    X_test (ndarray): Test input data
    y_test (ndarray): Test target data
    model (tf.keras.Model): Trained neural network model
    
    Returns:
    test_error (float): Mean squared error on the test data
    """
    y_pred = model.predict(X_test)
    
    # Ensure there are no NaNs in predictions before calculating MSE
    if np.isnan(y_pred).any():
        raise ValueError("Prediction contains NaN values. Check optimizer parameters.")
    
    test_error = mean_squared_error(y_test, y_pred)
    return test_error

# Initialize optimizers
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
sgd_optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True, clipnorm=1.0)
adagrad_optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.001)
rmsprop_optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False)

# Train models with different optimizers
adam_model = train_model(X, y, adam_optimizer)
sgd_model = train_model(X, y, sgd_optimizer)
adagrad_model = train_model(X, y, adagrad_optimizer)
rmsprop_model = train_model(X, y, rmsprop_optimizer)

# Evaluate each model and print test errors
adam_test_error = evaluate_model(Xtest, ytest, adam_model)
sgd_test_error = evaluate_model(Xtest, ytest, sgd_model)
adagrad_test_error = evaluate_model(Xtest, ytest, adagrad_model)
rmsprop_test_error = evaluate_model(Xtest, ytest, rmsprop_model)

print("Test error with Adam optimizer:", adam_test_error)
print("Test error with SGD optimizer:", sgd_test_error)
print("Test error with Adagrad optimizer:", adagrad_test_error)
print("Test error with RMSprop optimizer:", rmsprop_test_error)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test error with Adam optimizer: 0.3671846143904421
Test error with SGD optimizer: 0.42106051200398154
Test error with Adagrad optimizer: 0.4612084479734086
Test error with RMSprop optimizer: 0.3936385244511009
