# Worksheet: Neural Network

In this worksheet, I want to answer some practical questions we have seen when we train a neural network.

## Activation function


- **Relu activation function** $\mathop{ReLU}(x) = \max(x,0)$


- **Sigmoid function** $\mathop{Sigmoid}(x) = 1 / (1 + \exp(-x))$


- **Leaky ReLU** $\mathop{LeakyReLU}_\alpha(x) = \max(\alpha x, x)$, usually $\alpha=0.01$ but recently people find that $\alpha=0.2$ is better.


- **ELU** $\mathop{ELU}_\alpha(x) = \begin{cases} \alpha(\exp(x)-1) \mbox{ if } x<0 \\ x \mbox{ if } x>0 \end{cases} $

- **SELU** $\mathop{SELU}_\alpha(x) = \begin{cases} \mathrm{scale}*\alpha(\exp(x)-1) \mbox{ if } x<0 \\ \mathrm{scale}*x \mbox{ if } x>0 \end{cases} $

Remark: ReLU stands for rectified linear unit and (S)ELU satnds for (scaled) exponential linear unit.

Which activation function should you use? In general, SELU > ELU > leaky ReLU > ReLU > sigmoid. However, if speed is your priority, ReLU might still be the bset choice since many libraries and hardware accelerators provide ReLU-specified optimizations.

## Optimizers:

#### Adam

Official paper: https://arxiv.org/abs/1412.6980

Syntax: 

    tf.keras.optimizers.Adam(
        
        learning_rate=0.001, 
        beta_1=0.9, 
        beta_2=0.999, 
        epsilon=1e-07, 
        amsgrad=False,
        name='Adam', 
        **kwargs)
        
Parameters:

learning_rate: rate at which algorithm updates the parameter.  
               Tensor or float type of value.Default value is 0.001

beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
        
beta_2: Exponential decay rate for 2nd moment. Constant Float 
        tensor or float type of value. Default value is 0.999
        
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
         
amsgrad: Whether to use AMSGrad variant or not. 
         Default value is False.
         
name: Optional name for the operation

**kwargs: Keyworded variable length argument length

#### RMSprop
Syntax: 

    tf.keras.optimizers.RMSprop(
    
        learning_rate=0.001, 
        rho=0.9, 
        momentum=0.0, 
        epsilon=1e-07, 
        centered=False,
        name='RMSprop', 
        **kwargs)

Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
               
rho: Discounting factor for gradients. Default value is 0.9

momentum: accelerates rmsprop in appropriate direction. 
          Float type of value. Default value is 0.0
          
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
         
centered: By this gradients are normalised by the variance of 
          gradient. Boolean type of value. Setting value to True may
          help with training model however it is computationally 
          more expensive. Default value if False.
          
name: Optional name for the operation

**kwargs: Keyworded variable length argument length.




#### Adagrad
Syntax: 

    tf.keras.optimizers.Adagrad(
    
        learning_rate=0.001,
        initial_accumulator_value=0.1,
        epsilon=1e-07,
        name="Adagrad",
        **kwargs)
        
Parameters: 

learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
               
initial_accumulator_value: Starting value for the per parameter 
                           momentum. Floating point type of value.
                           Must be non-negative.Default value is 0.1
                           
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07.
         
name: Optional name for the operation

**kwargs: Keyworded variable length argument length

#### SGD and its variations
Syntax: 

    tf.keras.optimizers.SGD(
        
        learning_rate = 0.01,
        momentum=0.0, 
        nesterov=False, 
        name='SGD', 
        **kwargs)
        
Parameters: 

learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.01
               
momentum: accelerates gradient descent in appropriate
          direction. Float type of value. Default value is 0.0
          
nesterov: Whether or not to apply Nesterov Momentum.
          Boolean type of value. Default value is False.
          
name: Optional name for the operation

**kwargs: Keyworded variable length argument length.


#### Some notes on different optimizers:
1. https://www.geeksforgeeks.org/adam-optimizer/
2. https://medium.com/ai³-theory-practice-business/optimization-in-deep-learning-5a5d263172e

#### Which optimizer should you use?

- Convergence speed:

SGD < SGD(momentum) = SGD(momentum, nesterov=TRUE) < Adagrad = RMSprop = Adam

- Convergence quality:

SGD = SGD(momentum) = SGD(momentum, nesterov=TRUE) >= RMSprop = Adam > Adagrad


## Now it is your turn to try Tensorflow.

Please run the following cell to generate synthetic data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers

# generate data points
datadim = 5              # intrinsic dimension of data
dim = 500                # dimension of the ambient space
N = 5000                 # data set size
eps = 0.25               # noise level in the observaed data

# create low dimensional data points
Xreal = np.random.randn(N, datadim)
y = np.tanh(Xreal[:,0]) + np.cos(Xreal[:,1]) - np.exp(-Xreal[:,4])

# a matrix to embed the given data into a high-dimensional ambient space
transform = np.random.randn(datadim,dim)

# the high-dimensional data is obtained by matrix multiplication
X = Xreal @ transform

# making the observations noisy
X += np.random.normal(0, eps, size = X.shape)
y += np.random.normal(0, eps, size = y.shape)

# Test data points:
Ntest = 500
Xrealtest = np.random.randn(Ntest, datadim)
ytest = np.tanh(Xrealtest[:,0]) + np.cos(Xrealtest[:,1]) - np.exp(-Xrealtest[:,4])
Xtest = Xrealtest@transform

2024-05-09 14:44:02.987477: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


- Neural network structure: Shallow neural network with 200 neurons in the hidden layer. Activation function is ReLU. 

- Loss: This is regression problem so you should use MSE as loss and metric.

- Optimizers: you should try Adam optimizers, SGD optimizers, Adagrad optimizers. You probably need to change default parameters to make your optimizers working properly, especially SGD optimizers.

- Number of epochs = 10, batch_size = 16

You should report test mse for each model. 

In [2]:
# Model 1

model = tf.keras.models.Sequential(
    
    [

        # first hidden layer
        layers.Dense(200, input_shape = (500, ), activation = "relu", kernel_initializer = "he_normal"), # 200 neurons, 500 features
    
        # output
        layers.Dense(1) # predict 1 number (regression)
    
    ]
)

loss_fn = tf.keras.losses.MeanSquaredError()

optim_fn = tf.keras.optimizers.Adam(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07, amsgrad = False, name = 'Adam')

model.compile(optimizer = optim_fn, loss = loss_fn)

history = model.fit(X, y, epochs = 10, batch_size = 16)

print('\n')

print(f"The test mse is {model.evaluate(Xtest, ytest)}")

Epoch 1/10


2024-05-09 14:44:06.073103: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The test mse is 1.6245051622390747


In [3]:
# Model 2

model = tf.keras.models.Sequential(
    
    [

        # first hidden layer
        layers.Dense(200, input_shape = (500, ), activation = "relu", kernel_initializer = "he_normal"), # 200 neurons, 500 features
    
        # output
        layers.Dense(1) # predict 1 number (regression)
    
    ]
)

loss_fn = tf.keras.losses.MeanSquaredError()

optim_fn = tf.keras.optimizers.Adagrad(learning_rate = 0.001, initial_accumulator_value = 0.1, epsilon = 1e-07, name = "Adagrad")

model.compile(optimizer = optim_fn, loss = loss_fn)

history = model.fit(X, y, epochs = 10, batch_size = 16)

print('\n')

print(f"The test mse is {model.evaluate(Xtest, ytest)}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The test mse is 1.4805461168289185


In [4]:
# Model 3

model = tf.keras.models.Sequential(
    
    [

        # first hidden layer
        layers.Dense(200, input_shape = (500, ), activation = "relu", kernel_initializer = "he_normal"), # 200 neurons, 500 features
    
        # output
        layers.Dense(1) # predict 1 number (regression)
    
    ]
)

loss_fn = tf.keras.losses.MeanSquaredError()

optim_fn = tf.keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.0, nesterov = False, name = 'SGD')

model.compile(optimizer = optim_fn, loss = loss_fn)

history = model.fit(X, y, epochs = 10, batch_size = 16)

print('\n')

print(f"The test mse is {model.evaluate(Xtest, ytest)}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The test mse is 1.548887014389038
