# Deep Learning and its usage in Categorizing Exoplanets

## Lecture 1: Introduction to Neural Networks

### Neural Networks
**Neural Networks** is a computer system used in deep learning. They utilize an input-output process where the inputs are subject to functions and weights to determine *hidden units* and the output reflects as such. The concept of Neural Networks is inspired by the processes within the human brain and neurons firing to indicate some input-output.

**Hidden Units** are units which we find in each layer of the Neural Network. All inputs of our Neural Network are connected to these hidden values, though with differing dependencies which are denoted as *weights*.

**Weights** in a Neural Network help us to denote the dependency of our hidden values to our inputs. For example, say one of our Hidden Units is a host star's type. Inputs such as the star's temperature and radius will have higher weights in determining the host star's type. The orbital distance of an orbiting exoplanet, however, will likely have a much lower weight, as the orbital distance of this exoplanet will not explicitly help us to identify its host star's type.

![simplified graphic of three layers in a neural network](Resources/weights-image.webp)

**Layers** in a Neural Network consist of a couple parts each: the first layer is some singular or a vector of inputs, whether they be the initial inputs or outputs from the last layer. Next, the inputs are weighed, as discussed above. Afterwards, the data is transformed by some functions which will be discussed further later on. Lastly, the layer has some output which either goes to an *activation function* at the end of the Neural Network, or the inputs are the final output. 

![simplified graphic of three layers in a neural network](Resources/neural_network_w_matrices.png)

An **Activation Function** is the final portion of the Neural Network which decides whether the final output of the "neuron" fires; for a binary classification, for example, this portion makes the decision of whether it is a 1 or a 0, depending on the criteria decided by the inputs and hidden units.

---
### Homework: Perceptron
In basic terms, a **Perceptron** is a single layer Neural Network, meaning it contains all parts of a singular layer listed above. 

![simplified graphic of a perceptron](Resources/perceptron-image.webp)

Perceptrons are usually used for Binary classification, or classifying the data into two parts.

---
### References
*Concepts*. ML Glossary. https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html

Sharma, Shagar. (2017, September 9). *What the Hell is Perceptron?: The Fundamentals of Neural Networks.* Towards Data Science. https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

## Lecture 2: Logistic Regression

### Logistic Regression

**Logistic Regression** is used in binary classification and can be utilized to create a model for predicting these binary outcomes based on similar precedented variables. To be able to make accurate predictions, these models requires *training sets*.

**Training Sets** are small matrix sets of data with known outputs. By using these in our models, we can create progressively more accurate predictions for outputs of our data given some precedented or similar inputs. 

A **Linear Regression Model** is a regression model used to predict trends in data given some input, but which assumes the relationship between input and output is linear. In this relationship, the weighted value of the input is the slope and has some constant b as the y-intercept. However, given a model with binary classification, we will want our y-values to be between or equal to 0 and 1. To do this, we will require a *Sigmoid function*.

A **Sigmoid Function** allows us to make the output a probability as opposed to the literal target variable; this allows us to make binary classifications of our data; for example, rather than having an output pertaining to the literal size of an exoplanet, the Sigmoid Function would transform the data to simply tell us whether or not the size of the planet is *yes*, big enough to be a Hot Jupiter, or *No*, not big enough to be a Hot Jupiter (of course other factors go into making this decision, but let's keep things simple).

To apply the Sigmoid Function, we simply make our linear function the input of the Sigmoid Function, and thus it becomes our *Activation function*.

---
### Homework: 

Consider $n$ inputs and $m$ data in our training set. Find the dimension of $X$, $X^i$, $w^i$, $b$, $z$, and $a$ where $z$ is the input of the Sigmoid function and $a$ is the Activation function.

$X = [(x^1_1, x^2_2,...,x^n_n)]$

$X^i = [X]$

$ω =$ weighted value

$b =$ y-intercept

$z = ω^iX^i + b$

$a = \frac{1}{1 + e^{-(ω^iX^i + b)}}$


### Lecture 2 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $y$ | output ε {0, 1} |
| $Y$ | output matrix of our training set |
| $y^i$ | known output to corresponding input $X^i$ |
| $x$ | input vector |
| $X^i$ | Each column in the input |
| $n$ | length of input vector |
| $m$ | number of inputs for which we have a known correct output |
| $z$ | the linear relationship between the output and input where $z = ω^iX^i+b$ |
| $ω^i$ | weights |
| $b$ | constant bias |
| $σ(z)$ | sigmoid function where $σ(z) = \frac{1}{1+e^{-z}}$ |
| $a$ | activation function where $a = σ(ωX + b)$ |


## Lecture 3: Loss and Cost Functions

### Loss and Cost Functions

A **Loss/Error Function** is used to measure how well a model can make predictions. We say we want to "minimize" our Loss Function, meaning we want to find the parameters necessary for our model to make the best possible predictions.

There are multiple common Loss Functions we can utilize, but for now we will focus on the **Log-Likelihood Loss Function** which states 

$L(a^i,y^i) = -y^ilog(a^i) - (1 - y^i)log(1 - a^i)$

where $a$ is the predicted output and $y^i$ is the known output. To minimize the loss function would mean that the overall difference between the actual and predicted outputs is as small as possible. 

A **Cost Function** gives an idea of the average loss from the Loss Function for the entire training set involved. We define the Cost Function as

$J(ω,b) = \frac{1}{m} * Σ^m_1 L(a^i,y^i)$

where $w$ is our weights, $b$ is bias, and $m$ is the number of input-output sets we have in our training set.

### Homework:

A necessary and sufficient condition for a function $f(x)$ to be convex on a interval is that the second derivative $\frac{d^2f}{dx^2} >= 0$ for all x in the interval. We can show a local minimum of a convex function is also a global minimum. In a neural network we can try to minimize the cost function. Does the cost function have to be convex?

The cost function does not necessarily have to be convex, however if it is convex it will allow us to minimize our cost function to the best possible ability. This is because convex functions have a global minimum, allowing us to have a best overall possible minimum to our cost function, which would be the goal. However, given a non-convex function, we will have multiple local minima but no global minimum which would mean we would be unable to optimize the minimization of our cost function.

---
### References
Mack, Conor. (2017, November 27). *Machine Learning Fundamentals (I): Cost Functions and Gradient Descent*. Towards Data Science. https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220#:~:text=Put%20simply%2C%20a%20cost%20function,to%20as%20loss%20or%20error.)

### Lecture 3 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $L(a^i, y^i)$ | Log-Likelihood Function|
| $a^i$ | Predicted output |
| $y^i$ | Known output |
| $J(ω, b)$ | Cost Function |
| $ω$ | Weights |
| $b$ | Constant bias |
| $m$ | Number of elements in the training set |


## Lecture 4: Gradient Descent
### Gradient Descent
**Gradient Descent** is a minimization algorithm oftem utilized in deep learning. It is used for optimization and for training neural networks. Using Gradient Descent, we are able to train the data to have the correct weight and bias values and minimize the cost function.

Gradient Descent requires a **Learning Scale** to function, or the step size. The Learning Scale is the size of the steps taken to reach that minimum value. The Learning Scale should be reasonable, not too large or too small. If it is too large, our function runs the risk of overshooting the minimum. On the other hand, if it is too small, it may be more precise but it will also be less efficient. A small (not excessively small) value is best for the Learning Scale.

The steps taken to determine the weight and bias using gradient descent are as follows:
1. Weight and Bias are initially defined, randomly or by some guess,
2. We find the slope of the cost function,
3. We adjust the Weight and Bias values by checking along the nearest steepest descent in reference to the slope and changing to the next set of weight and bias values,
4. Repeat until the Weight and Bias values do not have any significant change, therefore indicating we have reached the minimum of the cost function.

This process can be mathematically defined as:

$ω := ω - α(\frac{dJ(ω, b)}{dω})$ and $b := b - α(\frac{dJ(ω, b)}{db})$ 

where $α$ is the learning scale.

The **Backpropogation Method** is the way by which we find the derivative (slope of the tangent line) of the loss function with respect to the weights of our given inputs and the bias. This method finds the derivative with respect to multiple instances: first with respect to $z$, where $z = ω^iX^i + b$, then to $y$, where $y$ is the output for a given input, then top $ω_1$, $ω_2$, and $b$. After deriving these final values, we are able to modify the values as mathematically defined.

After modifying our values, we repeat this process as described in step 4 above.

### Homework:
In programming, the term $\frac{dL}{dq}$ is also denoted by $dq$ where $q$ is a parameter, such as $z$ or $ω_i$. Calculate $da$, $dz$, $dω_1$, $dω_2$, and $db$ using the mathematical definition of a loss function and considering sigmoid function.

$da = \frac{dL}{da} = -y^ilog(a^i) - (1 - y^i)log(1 - a^i)\frac{dL}{da} = -\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}$

$dz = \frac{dL}{dz} = \frac{dL}{da} * \frac{da}{dz} = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * \frac{1}{1 + e^{-z}}\frac{da}{dz} = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * \frac{e^{-z}}{(1 + e^{-z})^2} = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * \frac{e^{-z}}{1 + e^{-z}} * \frac{1}{1 + e^{-z}} = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * e^{-z} * \frac{1 - y^i}{1 - a^i} * \frac{1 - y^i}{1 - a^i} = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * e^{-z} * a^2 = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * (\frac{1}{a} - 1) * a^2 = (-\frac{y^i}{a^i} + \frac{1 - y^i}{1 - a^i}) * a(1 - a) = -y(1 - a) + a(1 - y) = -y + ya + a - ya = a^i - y^i$

$dω_1 = \frac{dL}{dω_1} = \frac{dL}{da} * \frac{da}{dz} * \frac{dz}{dω_1} = (a^i - y^i)x^i_1$

$dω_2 = \frac{dL}{dω_2} = \frac{dL}{da} * \frac{da}{dz} * \frac{dz}{dω_2} = (a^i - y^i)x^i_2$

$db = \frac{dL}{db} = \frac{dL}{da} * \frac{da}{dz} * \frac{dz}{db} = (a^i - y^i)$

### References

Gosavi, Bhushan. (2019 November 4). *Mathematics Behind the Artificial Neural Networks: Part 1*. Medium. https://medium.com/analytics-vidhya/mathematics-behind-artificial-neural-networks-part-1-2214dab225c2

*Weights and Biases*. AI Wiki. https://machine-learning.paperspace.com/wiki/weights-and-biases

*What is Gradiant Descent?*. IBM. https://www.ibm.com/topics/gradient-descent

### Lecture 4 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $ω$ | Weight parameter |
| $b$ | Constant bias parameter |
| $α$ | Learning scale | 
| $x_n$ | some input number n |
| $z$ | the linear relationship between the output and input where $z = w^iX^i+b$ |
| $a^i$ | predicted output |
| $σ(z)$ | sigmoid function where $σ(z) = \frac{1}{1 + e^-z}$ |
| $y^i$ | known output to corresponding input $X^i$ |
| $X^i$ | Each column in the input |
| $d$ | derivative |

## Lecture 5: Gradient Descent over All Elements in Training Set
### Gradient Descent Notes for Programming
As discussed in Lecture 4, the following is the equation used to modify the weight for Gradient Descent:

$\frac{δ}{δω_1}J(ω,b) = \frac{1}{m}Σ^m_{i=1}\frac{δ}{δω^i_1}L(a^i,y^i)$

where $J(ω,b) = \frac{1}{m} * Σ^m_1 L(a^i,y^i)$ and $L(a^i,y^i) = -y^ilog(a^i) - (1 - y^i)log(1 - a^i)$

and where $ω := ω - α(\frac{dJ(ω, b)}{dω})$ and $b := b - α(\frac{dJ(ω, b)}{db})$ 

where $α$ is the learning scale.

*Remember!* in Neural Networks we need to avoid **for-loops**. Programs like this are called **vectorized code**. This is necessary for the speed of our program; when we are training our neural network using Gradient Descent, multiple for loops will only slow down our algorithm, especially since we are already spending a decent bit of time on this algorithm.

### Practice

In [2]:
# Given program, including edits for functionality
import numpy as np
import math

# returns our sigmoid function σ(x)
# math.exp(): returns E to the power of x 
def sigmoid(x):
    return(1 / (1 + math.exp(-x))) 

# returns activation function a
# np.exp(): returns E to the power of x for all elements in array x
# ** operator: exponent
def sigmoid_prim(x):
    a = (np.exp(-x))/(1 + np.exp(-x))**2
    return a

# gradient descent function given one node j and one layer l
def gradient_descent_one_node_one_layer(X_t, Y, n_iteration, learning_rate):
    alpha = learning_rate

    # continuously returns the next weight and bias parameter until we reach the accepted minimum cost weight and bias
    for k in range(n_iteration): # for all k < n_iteration
        z = 0; db = 0; J = 0; # initialize relationship of input and output as 0, bias as 0
        n = X_t.shape[1] # .shape: used to get current shape of an array where, given dimensions [n, m], .shape[0] returns n
        d_omega = np.zeros((1, n)) #np.zeros: create a new array of given shapes and types filled with zero values
                                   #in this case, we've created a 2D array [1][n] of zeros for our derived weight parameter

        omega = np.random.rand(1, n) * np.sqrt(1/n) # initial values for [omega_1, omega_2], our weight parameters
        b = np.random.rand() # initial value for bias parameter
        m = len(Y) # size of output array Y: number of exoplanets

        # takes our initialized variables and uses them to calculate the linear relationship of our inputs and outputs, the activation function given z, and the cost of our current outputs given current parameters
        # looping over all planets
        for i in range (1, m, +1): # iterable i < m, i++, iterating though array Y
            z = np.dot(omega, X_t[i]) + b # np.dot: returns the dot product of the given arrays
                                          # in this case calculating the linear relationahip using the weights array, the inputs, and the bias parameter
            a = sigmoid(z) #activating function z with sigmoid
            
            J += -Y[i] * np.log(a) - (1 - Y[i]) * np.log(1-a) # calculating our cost
            dz = a - Y[i]

            db += dz

            # calculates the next weight parameter by taking a fraction of the derivative, multiplying this by the learning rate, and subtracting from our weight the new rate
            # looping over all predictors for each planet
            for j in range(0, n - 1): # for iterable j < [0] dimension of input array
                d_omega[j] = d_omega[j] / m # still doesn't work because d_omega will always be 0 :(
                omega[j] = omega[j] - alpha * d_omega[j]
            db = db / m # calculating the derivative of the bias
            b = b - alpha * db # calculating the next bias parameter

        return(omega, b)

def main():
    # add more than one dimension to array and transpose it and see if .shape can pull out integer
    X_t = (np.array([[5, 15, 25], [5, 15, 25]])).T
    print("X_t = ", X_t)
    Y = np.array([5, 20])
    n_iteration = 100_000
    learning_rate = 0.0008

    print(gradient_descent_one_node_one_layer(X_t, Y, n_iteration, learning_rate))
if __name__ == "__main__":
    main()

X_t =  [[ 5  5]
 [15 15]
 [25 25]]
(array([[0.52814627, 0.02607148]]), 0.18336402875071617)


In [7]:
# Vectorized 

import numpy as np
import math


def sigmoid(x):
    return(1 / (1 + math.exp(-x)))

def gradient_descent_one_node_one_layer(X_t, Y, n_iteration, learning_rate):
    # init variables
    z = 0; J = 0;
    n = X_t.shape[0] # number of inputs
    m = len(Y)

    omega = np.random.rand(1, n) * np.sqrt(1/n) # array [1, n] of random numbers
    d_omega = np.zeros((1, n)) # array [1, n] of zeros

    b = np.random.rand() # random value

    alpha = learning_rate

    for k in range(n_iteration):
        z = np.dot((np.dot(omega, X_t)), (np.ones(((np.dot(omega, X_t)).shape[1], 1)))) + b # omega*X + b
        a = sigmoid(z)

        dz = (a - (np.dot([Y], np.ones((m, 1))))) # a - y

        d_omega = (1/m) * (dz * np.dot(np.ones((1, (X_t.T).shape[0])), X_t.T)) # 1/m * X(a - y) (dz = a - y) for all X since we need the sum of all X^i(a^i - y^i)
        omega -= alpha * d_omega

        b -= alpha * dz

        # print(omega)

    return("Omega values: ", omega, "B values: ", b)

def main():
    # X_t is a 2D array where the first dimension is the number of exoplanet inputs and the second dimension
    # is the values associated i.e. recessional velocity, mass, and the like.
    X_t = (np.array([[5, 15, 25], [5, 15, 25]])).T
    print("X_t = ", X_t)

    # Y is our known outputs - shouldn't we have two values for Y?
    Y = np.array([5, 20])
    n_iteration = 100_000
    learning_rate = 0.0008

    print(gradient_descent_one_node_one_layer(X_t, Y, n_iteration, learning_rate))
if __name__ == "__main__":
    main()

X_t =  [[ 5  5]
 [15 15]
 [25 25]]
('Omega values: ', array([[ 9600.47970876, 28800.2313609 , 48000.22737131]]), 'B values: ', array([[1920.42006998]]))


### Lecture 5 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $δ$ or $d$ | derivative |
| $a$ | predicted output |
| $σ(x)$ | Sigmoid function |
| $α$ or alpha | learning rate |
| $k$ | iterable |
| $X_t$ | input array $X^i$ where $t$ is the transpose of the input matrix |
| $Y$ | ouput array |
| n_iteration | number of nodes to iterate through |
| $z$ | the linear relationship between the input and output in the training set |
| $db$ | the derivative of the bias parameter |
| d_omega | the derivative of the weight parameter |
| $n$ | [0] dimension of the input array |
| omega | initial values for [omega_1, omega_2], our weight parameters |
| $b$ | initial value for bias parameter |
| $m$ | size of output array $Y$/number of planets |
| $i$ | iterable |
| $J$ | Cost function |
| $dz$ | Difference between predicted output and known output |
| $j$ | iterable |



### References
Bakhvalov, Denis. (10 November 2017). *Vectorization Part 7. Tips for Writing Vectorizable Code*. Easyperf. https://easyperf.net/blog/2017/11/10/Tips_for_writing_vectorizable_code

*Built-In Functions*. Python Documentation. https://docs.python.org/3/library/functions.html#map

Hsu, Jonathan. (2019 December 14). *How to Replace your Python For Loops with Map, Filter, and Reduce*. Better Programming. https://betterprogramming.pub/how-to-replace-your-python-for-loops-with-map-filter-and-reduce-c1b5fa96f43a

Malli. *Python NumPy zeros() Function*. SparkBy{Examples}. https://sparkbyexamples.com/numpy/numpy-zeros-function/#:~:text=NumPy%20zeros()%20function%20is,of%20floating%20values%20by%20default. 

*numpy.sum*. NumPy. https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Pozo Ramos, Leodanis. *Python's map(): Processing Iterables Without a Loop*. Real Python. https://realpython.com/python-map-function/

*Python: What is the difference between math.exp and numpy.exp and why do numpy creators choose to introduce exp again*. Stack Overflow. https://stackoverflow.com/questions/30712402/python-what-is-the-difference-between-math-exp-and-numpy-exp-and-why-do-numpy-c 

Shah, Ahmar. (2022 Jul 22). *Understand Vectorization for Deep Learning*. Medium. https://towardsdatascience.com/understand-vectorization-for-deep-learning-d712d260ab0f

Tarleton, Shane. (2022 February 23). *Use map() instead of for() loops*. Medium. https://blog.bitsrc.io/please-use-map-instead-of-for-loops-5a2f54f088c8

Waqar, Hassaan. (2023). *What is the exp function in NumPy?*. Educative. https://www.educative.io/answers/what-is-the-exp-function-in-numpy 

*What does .shape[] do in "for i in range(Y.shape[0])"?* Stack Overflow. https://stackoverflow.com/questions/10200268/what-does-shape-do-in-for-i-in-rangey-shape0 

## Lecture 6: Deep Neural Network
### Deep Neural Network
There are two parts to involved in hidden layers:
1. We calculate $z = ω^TX + b$, or the output depending on our input and its weights and bias.
2. We calculate our activation function $a = g(z)$

![simplified graphic of a neural network including equations](Resources/neural%20network%20diagram.jpeg)

We know we can have multiple layers within our neural network. These layers can differ in multiple ways:
- They can each have differing numbers of inputs
- The amount of **nodes** in each layer can vary: **Nodes** are the units that have one of more inputs, an activation function, and an output.

A **Deep Neural Network** is made by incorporating more nodes and layers. Oftentimes, Sigmoid functions are not used in deep neural networks, as the training time with these multiple layers are slower. So, we will use ReLU (Rectified Linear Unit) function instead. The two parts to computing nodes in each layers has the general form:

1. $z^{[l]}_j = (ω^T)^{[l]}_ja^{[l - 1]} + b^{[l]}_j$ 

2. $a^{[l]}_j = g^{[l]}(z^{[l]}_j)$

where $a^{[0]}$ represents the input layer, $l$ represents the number of layers, $j$ is the node's index in the given layer, and $a^l$ is the output of each layer.

We can define the output layers as:

$Z^{[l]} = [z^{[l]}_1, z^{[l]}_2,..., z^{[l]}_{n_1}]$

$A^{[l]} = [a^{[l]}_1, a^{[l]}_2,...,a^{[l]}_{n_1}]$

where $n_1$ is the number of nodes in the layer $l$.

### Homework
Consider a deep neural network with L layers and n inputs. Eahc layer has $n_1$ nodes. Find the rank of the matrices $ω^{[1]}_j$, $b^{[1]}_jZ^{[1]}$, $A^{[1]}$, $ω^{[1]}$, $b^{[1]}$, $Z^{[1]}$, $A^{[1]}$.

Rank of $ω^{[l]}_j = (1, n_{l - 1})$

Rank of $b^{[l]}_j = (1, 1)$

Rank of $z^{[l]}_j = (1, 1)$

Rank of $a^{[l]}_j = (1, 1)$

Rank of $ω^{[l]}_j = (n_l, n_{l - 1})$

Rank of $b^{[l]} = (n_l, 1)$

Rank of $Z^{[l]} = (n_l, 1)$

Rank of $A^{[l]} = (n_l, 1)$


### Lecture 6 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $ω$ | Weight parameter |
| $T$ | the transpose of the input matrix |
| $b$ | Constant bias parameter |
| $X$ | some input |
| $z$ | the linear relationship between the output and input where $z = wX^i+b$ |
| $a$ | predicted output |
| $g(z)$ | activation function where $a = g(z)$ |
| $y^i$ | output to corresponding input $X^i$ |
| $X^i$ | Each column in the input |
| $p^l_j$ | Parameter where $l$ is the parameter's layer and $j$ is the parameter's index in the layer |
| $i$ | layer number |

## Lecture 7: Parameters vs. Hyperparameters

**Model Parameters** are the two parameters, weight (per input) and bias, which relate our input and output, and which allow us to make accurate predictions and have a refined model. They are estimated or learned from the data. Importantly, they are defined and *changed within the learning process*. They are learned, not explicitly defined.

An *activation function*, our sigmoid function for example, or the learning rate in the gradient descent model, is a parameter yes but not a Model Parameter. Rather, because it is explicitly defined by the user to control the learning process, it is called a **Hyperparameter**. 

To reiterate, the key difference is that parameters are changed and refined through the learning process; hyperparameters are explicitly defined prior to control the learning process.

### Homework

Find five hyperparameters in a neural network defined by the logistic regression model and gradient descent model.

1. Activation Function
2. Learning Rate
3. Training set
4. Size of training set
5. Number of iterations of gradient descent
6. Regularization parameter


## Lecture 8: Two Major Problems in Deep Learning - Regularization and Vanishing Gradient

**Problem 1. Regularization**

**Regularization** is the suppression of some or all parameters of a model; this means they can only take certain values or ranges of values.

Regularization is used to conteract cases of **overfitting**. Overfitting happens when we devise a model which is too complex, and therefore does not work for more general datasets. This can happen in cases such as when we have a neural network that is too deep, or has too many hidden layers.

To regularize the parameters of our model, we need to add an additional term to our cost function. Our original cost function is as follows: 

$J(ω,b) = \frac{1}{m}Σ^m_{i=1}L(a,y^i)$.

With the added variable:

$J(ω,b) = \frac{1}{m}Σ^m_{i=1}L(a,y^i) + \frac{λ}{2m}|ω|^2$.

$\frac{λ}{2m}|ω|^2$ is called the **L<sub>2</sub>-Regularization**. $λ$ is called the **Regularization parameter**. This is another hyperparameter. 

$|ω|^2$ is the **Euclidean norm** where $|ω|^2 = Σ^{n_x}_{j = 1}ω^2_j$. $n_z$ is the number of nodes in the hidden layer. So for multiple layers:

$J(ω^{[1]},b^{[1]},...,ω^{[L]},b^{[L]}) = \frac{1}{m}Σ^m_{i=1}L(a,y^i) + \frac{λ}{2m}Σ^L_{l=1}|ω^{[l]}|^2$

where $L$ is the number of layers and $|ω^{|l|}|^2$ is the **Frobenius norm**, where 

$|ω^{|l|}|^2 = Σ^{n^{[l - 1]}}_{i=1}Σ^{n^{[l]}}_{j=1}(ω^{[l]}_{ij})^2$.

A **norm** is a function that measures the size of a vector. The Eucliedean Norm, $|ω|^2$, is one where we take the sum of the squared values of all $ω$ elements (we then square root this sum). The Frobenius norm is almost identical to the Euclidean norm, except the Frobenius norm is the Euclidean norm of the matrix when it is considered as an element of R<sub>n</sub><sup>2</sup>.

The way this works is that it reduces the value of the $ω$'s in our model, so as to reduce overfitting, which in turn reduces the variance of estimated regression parameters. We can also increase the value of the $ω$'s, but that would in turn increase the non-linearity of our model. This is all done by decrasing and increasing the value of $λ$.

**Problem 2: Vanishing Gradient**

The **Vanishing Gradient Problem** happens when we find ourselves with a small gradient. This may happen due to the usage of multiple activation functions in a deep neural network. The size of our gradient decreases exponentially through our network. If too small, the gradient will be unable to effectively update the weights and biases of the initial layers with each training session. This problem becomes more significant the more layers we have.

To avoid this, we can initialize our model parameters ($ω$ and $b$) by multiplying them with the square root of their variance such that the variance of $ω$ is defined as

$var(ω^{[l]}) = \frac{1}{(n^{[l-1]})}$

and thus the initial value of $ω$ is defined as

$ω^l = random(ω^l) * \sqrt{1/(n^{[l-1])}}$.

We will also utilize a ReLU function instead of the typical Sigmoid function when utilizing deep neural networks to further amend this issue. This will change our function as such:

$var(w^{[l]}) = \frac{2}{n^{l-1}}$

and thus the initial value of $ω$ is defined as

$ω^{[l]} = random(ω^{[l]})*\sqrt{\frac{2}{n^{[l-1]}}}$

### Homework

Describe the other popular regularization terms commonly used in Neural Networks.

1. L1 - L1 Regularization adds the absolute value of magnitude of the coefficient as a penalty term to the loss function. The difference between this and L2 is that we do not square our summed values nor do we take the square root of the sum -- they are left as they are, the values and the sum.
2. L2 - L2 Regularization ass the squared magnitude of the coefficient as the penalty term to the loss function. This technique works well to combat overfitting, given we choose an appropriate lambda value, and is less generalized than L1.

The way these two methods combat overfitting is by
a. constraining our weights closer to zero, which results in a simpler network which is less prone to overfitting.
b. constraining the input of the z activation function, and this the functions behave more so behave linearly.

3. Dropout - the Dropout method basically necessitates that we randomly remove some nodes in our network, forcibly simplifying the network and leading to a less complex model. This also means that each training sample may be trained on a different network. This resulting network will end up being much simpler than the L2 Regularization result.

### References

Andreoni, Riccardo. (2022 Aug 24). *Regularization Techniques for Neural Networks*. Medium. https://towardsdatascience.com/regularization-techniques-for-neural-networks-379f5b4c9ac3#:~:text=In%20addition%20to%20L1%2FL2,effects%20on%20reducing%20high%20variance.

Brownlee, Jason. (2019 Jan 9). *A Gentle Introduction to the Rectified Linear Unit (ReLU)*. Maching Learning Mastery. https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

Coding Lane. *L2 Regurlarization neural network in Python from Scratch | Explanation with Implementation*. YouTube. https://www.youtube.com/watch?v=R0Dz8R0wgBs

Deep Shallownet. *Vector Norm, L1 Norm, Euclidean Norm, Max Norm, Euclidean Distance*. YouTube. https://www.youtube.com/watch?v=5RYAGXa8ywc

*How to calculate regularization parameter in linear regression*. Stack Overflow. https://stackoverflow.com/questions/12182063/how-to-calculate-the-regularization-parameter-in-linear-regression#:~:text=The%20regularization%20parameter%20reduces%20overfitting,overfitting%20but%20also%20greater%20bias.

Karim, Raimi. (2018 Dec 26). *Intuitions on L1 and L2 Regularisation*. Medium. https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261 

Nagpal, Anuja. (2017 Oct 13). *L1 and L2 Regularization Methods*. Medium. https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

Thakur, Ayush. (2022 May 11). *ReLU vs. Sigmoid Function in Deep Neural Networks*. Weights and Biases. https://wandb.ai/ayush-thakur/dl-question-bank/reports/ReLU-vs-Sigmoid-Function-in-Deep-Neural-Networks--VmlldzoyMDk0MzI 

Wang, Chi-Feng. (2019 Jan 8). *The Vanishing Gradient Problem*. Medium. https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484


### Lecture 8 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $J(ω, b)$ | Cost function with parameters weight $ω$ and bias $b$ |
| $m$ | Number of elements in the training set |
| $i$ | iterable |
| $L(a^i, y^i)$ | Log-likelihood Function with predicted output $a$ and known output $y^i$ |
| $\frac{λ}{2m}ω^2$ | L<sub>2</sub>-Regularization with Regularization parameter $λ$ | 
| $ω^2$ | Euclidean Norm |
| $n_z$ | Number of nodes in the hidden layer |
| $L$ | Number of layers |
| $(ω^{l})^2$ | Frobenius norm |
| $l$ | layer |


## Lecture 9: Activation Function

An **Activation Function**, as noted earlier on, decides whether the neuron, or output, 'fires' or not. An example of this was in binary classification, the usage of Sigmoid functions, which maps the value of the output to either a 0 or a 1.

The Sigmoid function is typically only used for the last layer. Within our hidden layers, we may need something a little more specific than a 0 or a 1. So for hidden layers, $tanh(x)$ works better. $tanh(x)$ is between the interval [-1, 1] as opposed to the Sigmoid functions [0, 1]. The biggest way this affects our neural network is in that $tanh(x)$ results in higher values of gradient (slope) during training and higher updates in the weights of the network.

However, the most common activation function in deep learning is the **Rectified Linear Unit**, brought up last lecture. It's form is the following:

$g(z) = max(0, z)$

Using **ReLU** allows the learning process to happen much faster than if we used $tanh$ for the data where $z > 0$. The disadvantage, however, is that for any data where $z <= 0$, the slope is 0, meaning we come again to the Vanishing Gradient problem, and thus learn quite slowly, if at all.

To mediate this issue, we use the **Leaky ReLU**, which outputs a small slope for $z <= 0$ rather than 0. It has the following form:

$g(z) = max(βz, z)$

### Homework

If you do not use non-linear activation function (such as sigmoid or $tanh$) there is no point in adding more than one layer in your neural network. Prove this mathematically.

Say we have activation function $g(z) = z$. This will effectively do nothing; the slope (gradient) stays the same and this there is little to no learning process.

### References

Antoniadis, Panagiotis. *Activation Functions: Sigmoid vs Tanh*. Baeldung. https://www.baeldung.com/cs/sigmoid-vs-tanh-functions#:~:text=This%20means%20that%20using%20the,use%20the%20tanh%20activation%20function.




### Lecture 9 Variable Cheatsheet
**Used in either the lecture or the notes**

| Variable | Meaning |
| -------- | ------- |
| $g(z)$ | activation function |
| $β$ | slope coefficient (hyperparameter) |
| $βz$ | small slope coefficient applied to $z$ |