<a href="https://colab.research.google.com/github/Uzmamushtaque/CSCI4962-Projects-ML-AI/blob/main/Lecture_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 6

# Topics for Today

1. Issues to address when designing a ML/AI project
2. Overall Issues/Strategy for a Deep Learning Project

# Machine Learning Strategy

Based on your first iteration through the model and getting results, you can decide on a number of updates to your models. Some of these can be:

1. You want to collect more data
2. You want to train your algorithm longer
3. Try some optimization algorithm
4. Use regularization
5. Tune Hyperparameters


There are efficient strategies that can help you move in the right direction. We will discuss a few of those in today's lecture.


# Tuning ML Process (Orthogonalization)

A typical ML process is:

1. Fit training set on the cost function
2. Fit dev set on the cost function
3. Fit test set

If the model performs well on all three steps then we are in an ideal world. Usually this is not the case, therefore we need to fine tune our model at each step. Each step has their own parameters to be tuned. 

For example if step 1 does not give a good result we may want to focus on getting more data or choosing a good optimization algorithm. 

In step 2 if we do not get good results, then we might need to use regularization. 

In step 3 the remedy could be to get a bigger dev set or maybe choose a different model. 

Thinking of these remedies but according to these 3 different directions is sometimes referred to as orthogonalization. Basically depending on the issue you are facing motivates your remedy.

# Evaluation Metric

When selecting your evaluation metrics there are many aspects that must be taken into consideration.

Sometimes selecting two evaluations metrics can be confusing for certain problems. For example Precision and Recall are defined in such a way that sometimes both have a trade-of. Therefore, for one model you might get a good result for one metric and vice versa. One way to deal with this situation is to come up with a single evaluation metric that would encompass both (or all of your metrics). In this case it could be F1 score which is a harmonic mean of Precision and Recall.

Another, way of combining metrics could be to formulate an optimization problem. If model accuracy and runtime are important for your model, then you can constrain your model by solving the maximizing accuracy problem subject to some maximum runtime constraint. This is useful in large neural networks that are achieving high accuracy at the expense of very high runtime.

# Hyperparameter Tuning

The possible approaches for finding the optimal parameters are:

1. Hand tuning (Trial and Error) - This is based on trial and error experiments and experience of the user, parameters are chosen.

2. Grid Search - In this a grid is created based on parameter values. And then all possible parameter combinations is tried and and the best one is selected.

3. Random Search - In this instead of trying all possible combinations as in Grid Search, only randomly selected subset of the parameters is tried and the best is chosen.

4. Bayesian Optimization (Gausian Proces) - Gaussian Process uses a set of previously evaluated parameters and resulting accuracy to make an assumption about unobserved parameters. Acquisition Function using this information suggest the next set of parameters. [link](https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization.html)


5. Tree-structured Parzen Estimators (TPE) - Each iteration TPE collects new observation and at the end of the iteration, the algorithm decides which set of parameters it should try next. [link](https://medium.com/optuna/multivariate-tpe-makes-optuna-even-more-powerful-63c4bfbaebe2)

One important aspect of hyperparameter tuning is that, in most search based methods the logarithms of the hyperparameters are sampled rather than the actual values. For example if searching for $\eta$ between 0.1 and 0.001 we first sample $log\eta$ uniformly between -1 and -3 and then exponentiate to the power of 10. However, there are certain parameters that are searched in the uniform space.

# Feature Preprocessing

Feature pre-processing in neural networks is not very different from other ML models.

1. *Additive Preprocessing and mean centering:* It is useful to mean center the data to remove any kind of bias from the model. Many algorithms like pCA work with the assumption of mean centered data. In practice a vector of column-wise means is subtracted from each data point.

A similar type of preprocessing is done to get rid of negative values if it is desired. One way is to add the most negative entry to the rest of the values.

2. *Feature Normalization:* A common practice is to divide each feature value by its standard deviation. When this scaling is combines with mean-centering, the data is said to be standardized. The basic idea is that the data is presumed to be drawn from a standard normal distribution with mean zero and unit variance.

Another type of feature normalization is to compute the minimum and maximum value of any attribute. Next subtract the min value from the data point and divide it by the difference between max and min.

Feature normalization ensures better performance as it is common for the relative values of features to vary more than an order of magnitude. By using these techniques we can lower the sensitivity of the learning algorithm for some features versus the others.

3. *Whitening:* Whitening is a technique of creating a new set of de-correlated features. PCA (Principal Component Analysis) is used to achieve this. 

PCA can be thought of as the application of Singular Value Decomposition (SVD) after mean-centering a data matrix. if D is a $n X m$ mean centered Data matrix and C is a $n X n$ covariance matrix that gives the covariance
between dimensions. Therefore, we can say that $C=(D^T D)/n$

The eigenvectors of the covariance matrix provide de-correlated directions in the data. Eigenvalues provide variance along each of the directions. Therefore, if we use top-k eigenvectors (largest k eigenvalues) of the covariance matrix, most of the variance in the data will be retained and noise will be removed. One can choose some threshold eigenvalue when selecting these new dimensions.

Let the final matrix be P which has dimensions $d X k$, where each column contains one of the top-k eigenvectors. The data matrix D can be transformed into the $k$ dimensional axis system by multiplying with the matrix P. the resulting matrix U will be $n X k$. The rows will contain transformed $n$ points. The variances of the columns of U are the corresponding eigenvalues (because this is the property of the de-correlating transformation of principal component analysis). In whitening each column of U is scaled to unit variance by dividing it with the standard deviation. the transformed features are fed into the neural network. This may change the architecture of the network.

The basic idea behind whitening is that the data is assumed to be generated from an independent Gaussian distribution along each principal component. By whitening we assume that each such distribution is a standard normal distribution and each feature has equal importance.

# Initialization

Initialization is more important in Neural Networks due to stability issues in the training process. Activation of each successive layer can either become weaker or stronger. This effect is exponentially related to the depth of the network and is particularly severe in Deep Networks. One way to deal with this issue is to choose good initialization points such that the gradients are stable across different layers.

One possible way of generating values is to generate random values from a Gaussian Distribution with zero mean and a small standard devaition (usually estimated to be around $\sqrt{1/r}$. Here $r$ is the number of inputs to that neuron for which the value is being picked. Bias neorons are always initialized to 0. Additionally weights are also initalized by sampling a value from the uniform distribution $[-\sqrt{1/r},\sqrt{1/r}]$.

More on initialization: [link](https://www.deeplearning.ai/ai-notes/initialization/)

# Vanishing and Exploding Gradient Problem

In the backpropogation algorithm, we have the following updates:

$dZ^{[l]} = dA^{[l]} *(g^{\prime[l]} (Z^{[l]})) $

$dW^{[l]}=dZ^{[l]}.dA^{[l-1]}$

$db^{[l]} = dZ^{[l]} $

$dA^{[l-1]}=W^{[l]}.dZ^{[l]}$

Notice the recurrent relationship between gradients at each step. Let us assume we are using the sigmoid activation for a 0,1 output $f$. The derivative of $f$ is given by $f(1-f)$. the value takes on a maximum at $f=0.5$. Therefore, the value of $g^{\prime[l]} (Z^{[l]}) $ is no more than 0.25 even at its maximum. The absolute value of weights is expected to be equal to one (we are considering one node network only), therefore each weight update will cause the value of $dZ^{[l]}$ to be 0.25 times that of $dZ^{[l+1]}$. So after moving $r$ layers the drop would be of the order of $0.25^r$. This example is a simplified version of what is known as the vanishing gradient problem.

As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. 

One way of dealing with this problem could be to choose an activation with larger gradients or choosing the initial weights to be higher. if we go too far in doing this, we may get an opposite issue of exploding gradients.



# Solution to the gradient problem

1. Initialization: In addition to the techniques listed above, here is a [link](http://proceedings.mlr.press/v15/glorot11a.html) to the paper that proposes a way to deal with this issue. For the proper flow of the signal, the authors argue that:

  a. The variance of outputs of each layer should be equal to the variance of its inputs.

  b. The gradients should have equal variance before and after flowing through a layer in the reverse direction.

  Note: keras has a variance scaling initializer that is worth exploring.


2. Choice of Activation Functions: We observed that the nature of sigmoid is saturating for larger inputs (negative or positive). This turned out to be a major reason behind the vanishing of gradients thus making it non-recommendable to use in the hidden layers of the network.

So to tackle the issue regarding the saturation of activation functions like sigmoid and tanh, we must use some other non-saturating functions like ReLu and its alternatives.

Relu(z) = max(0,z)

Outputs 0 for any negative input.

Range: [0, infinity]

Unfortunately, the ReLu function is also not a perfect pick for the intermediate layers of the network “in some cases”. It suffers from a problem known as dying ReLus wherein some neurons just die out, meaning they keep on throwing 0 as outputs with the advancement in training.

[Leaky Relu](https://cdn-images-1.medium.com/max/800/1*W6BURQnUE62qyxJxMDpdnA.png)

LeakyReLUα(z) = max(αz, z)

The amount of “leak” is controlled by the hyperparameter α, it is the slope of the function for z < 0.

The smaller slope for the leak ensures that the neurons powered by leaky Relu never die; although they might venture into a state of coma for a long training phase they always have a chance to eventually wake up.
α can also be trained, that is, the model learns the value of α during training. This variant wherein α is now considered a parameter rather than a hyperparameter is called parametric leaky ReLu (PReLU).


3. Batch Normalization: [Link](https://arxiv.org/abs/1502.03167) 

The Following key points explain the intuition behind BN and how it works:

  a. It consists of adding an operation in the model just before or after the activation function of each hidden layer.

b. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting.

c. The operation lets the model learn the optimal scale and mean of each of the layer’s inputs.

d. To zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation.

e. It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “Batch Normalization”).


Standardizing the activations of the prior layer means that the assumptions the subsequent layer makes about the spread and distribution of inputs during the weight update will not change, at least not dramatically. This has the effect of stabilizing and speeding-up the training process of deep neural networks.

# Model Selection

[link](https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/)

# Readings

[Paper 1](https://arxiv.org/pdf/1907.13359.pdf)