# <center> Linear Algebra and Gradient Descents in Artificial Neural Network (ANN) and Deep Learning </center>


## I. Problem Statement


### 1. What is an artificial neural network (ANN) and deep learning?

An “Artificial Neural Network” (ANN) is a powerful mathematical model combining linear algebra, statistics and complex learning algorithms that attempts to mimic (or at least, is inspired by) the neural connections in our nervous system to solve a problem by taking a given number of inputs and then calculates a specified number of outputs aimed at targeting the actual result.

Deep learning belongs to the family of ANN algorithms, and in most cases, the two terms can be used interchangeably.

### 2. What is the architecture of a neural network? 

For a system to be considered an ANN, it must contain a labeled, directed graph structure where each node in the graph performs some simple computation. From graph theory, we know that a directed graph consists of a set of nodes (vertices) and a set of connections (edges) that link together pairs of nodes. 

![simple ANN](https://929687.smushcdn.com/2407837/wp-content/uploads/2021/04/perceptron.png?size=630x459&lossy=1&strip=1&webp=1) 

Figure above depicts a simple NN that takes the weighted sum of the input x and weights w. This weighted sum is then passed through the activation function to determine if the neuron fires.
* Each node performs a simple computation. 
* Each connection then carries a signal (i.e., the output of the computation) from one node to another, labeled by a weight indicating the extent to which the signal is amplified or diminished. 
    > * Some connections have large, positive weights that amplify the signal, indicating that the signal is very important when making a classification. 
    > * Others have negative weights, diminishing the strength of the signal, thus specifying that the output of the node is less important in the final output.

In practice, a neural network can have a complex architecture with multiple hidden layers and the outputs of the previous layers will become inputs for the next layer

![ANN with hidden layer](https://929687.smushcdn.com/2407837/wp-content/uploads/2021/04/intro_nn_feedforward.png?size=630x445&lossy=1&strip=1&webp=1)

Above figure shows a feedforward neural network with input layer with 3 input nodes, the 1st hidden layers with 2 nodes, the 2nd hidden layer with 3 nodes, and the output layer with 2 nodes. 


### 3.  What are the roles of linear algebra and gradient descent in creating and optimizing a neural network?

#### a. Linear Algebra Role

**Linear algebra** is the study of lines and planes, vector spaces and mappings that are required for linear transforms. Linear algebra enables the mathematical representation and operations that a neural network can perform because mathematically, a neural network is constructed from vectors and matrices. 

![ANN in linear algebra](https://miro.medium.com/max/444/1*xmWCqN_VbKQXJg8veEpXug.png)

In the above simple neural network, the values x1, x2 in the vector X are the inputs to the neural network and typically correspond to a single row (i.e., data point) in the input matrix.  The input vector X is then multiplied by the weight matrix to obtain the weighted sum in the hidden or output layer. The weighted sum is then passed through an activation step (step function, RELU, Sigmoid functions, etc.) to determine output for the next layer

*Mathematic representation of input vector multiplies with the weight matrix:* <center> $f(∑_{i=1}^nw_ix_i) + b$ </center>

####  b. Activation Functions Role

After the input vector has been multiplied with the weight matrix to obtain the weighted sum, this weighted sum will be passed through an activation function to obtain an output. The format of the output will be dependent on the activation functions we use. 

![Type of activation functions](https://929687.smushcdn.com/2407837/wp-content/uploads/2021/04/activation_functions-768x585.png?lossy=1&strip=1&webp=1)

Above figure shows 6 popular activation functions used in ANN and deep learning. 

1. **Step function**: The most simple activation function is the `step` function, used by the Perceptron algorithm. The function's output is binary (0 and 1). The output of step function is not differentiable which makes it difficult to train the network

![Step function](https://929687.smushcdn.com/2407837/wp-content/latex/90a/90af9fe0904f486d2ccf18160fc2d816-ffffff-000000-0.png?lossy=1&strip=1&webp=1)


2. **Sigmoid function**: `Sigmoid` function overcomes the shortcoming of step function. The output of sigmoid function is continuous from 0 to 1, differentiable and symmetric around the y-axis, which is good for learning algorithms. However, sigmoid function's output is saturated in a deep neural network and the gradient becomes so small for deep neural network training. Sigmoid function also doesn't provide negative output. 

![sigmoid](https://929687.smushcdn.com/2407837/wp-content/latex/9fb/9fb9ad64edf81dcf824ef1ce6b8bc1bd-ffffff-000000-0.png?lossy=1&strip=1&webp=1)

3. **tanh function (hyperbolic tangent)**: `tanh` has similar shape as sigmoid function, but provides negative output from -1 to 1. However,  the output of `tanh` also saturates in the deep neural network

<center>$f(z) = tanh(z) = (e_z − e_{−z}) / (e_z + e_{−z})$</center>

4. **ReLU (Rectified Linear Unit)**: is a more superior function than sigmoid and tanh because the output is not saturable and also very computationally efficient. `ReLU` is the most popular activation function in deep learning as it outperforms `tanh` and `sigmoid` in many applications. However, `ReLU` is still 0 for negative inputs. 

<center>$f(x) = max(0, x)$</center>

5. **Leaky ReLU and ELU (exponential linear units)**: 2 variants of `ReLU` that allows negative values and perform better than the standard `ReLU` 

    `Leaky ReLU`

![leaky](https://929687.smushcdn.com/2407837/wp-content/latex/79b/79bf41ddaddcf7b9bfcbfb332ce2162c-ffffff-000000-0.png?lossy=1&strip=1&webp=1)

    
    `ELU`

![ELU](https://929687.smushcdn.com/2407837/wp-content/latex/331/3311aae5da977567ae0b6afe0ac68fdd-ffffff-000000-0.png?lossy=1&strip=1&webp=1)


####  b. Gradient Decent Role
While **linear algebra** provides a mechanism on how neural network is presented and how mathematical operations can be applied to manipulate inputs and outputs of an ANN. **Gradient Descent** empowers the learning and optimizating process of a neural network. Optimization algorithms are the engines that power neural networks and enable them to learn patterns from data. Obtaining a high accuracy classifier is dependent on finding a set of weights `W` (weight matrix) and `b`(bias vector) such that our data points are correctly classified. **Gradient Descent** is a very popular algorighm that enables us to improve `W` and `b` instead of randomly guess the best values to optimize our neural network. The algorithm aims to iteratively evaluate the parameters (weight and bias), compute the loss (based on a predefined loss function such as SoftMax/SVM), then take a small step in the direction that will minimize that loss. 

![Gradient Descent](https://929687.smushcdn.com/2407837/wp-content/uploads/2021/04/naive_loss_bowl_loss.png?size=630x291&lossy=1&strip=1&webp=1)

Our goal when applying gradient descent is to navigate to the global minimum at the bottom of the loss landscape by following the slope of the gradient. In one-dimensional functions, the slope is the instantaneous rate of change of the function at any point you might be interested in. The gradient is a generalization of slope for functions that don’t take a single number but a vector of numbers. Additionally, the gradient is just a vector of slopes (more commonly referred to as derivatives) for each dimension in the input space. The mathematical expression for the derivative of a 1-D function with respect its input is:

![gradient equation](https://929687.smushcdn.com/2407837/wp-content/latex/cb0/cb088516f5270cc0ac19d24b5c7e663b-ffffff-000000-0.png?lossy=1&strip=1&webp=1)


### 4. What are the difference between ANN and traditional Machine Learning? What are application of artificial neural network and deep learning in real-life?

ANN and Deep Learning are very powerful algorithms in classifying unstructured data such as images, audios, medias etc. 

In traditional machine learning algorithms, we need to perform feature engineering and feature extraction on the inputs before feeding them through the algorithms. Deep learning, and specifically Convolutional Neural Networks, take a different approach. Instead of hand-defining a set of rules and algorithms to extract features from the inputs, these features are instead automatically learned from the training process. 

![DL and ML](https://929687.smushcdn.com/2407837/wp-content/uploads/2021/03/feature_extraction_vs_dl-606x1024.png?lossy=1&strip=1&webp=1)

In this project, I will implement  a small neural network in Python including the activation function and gradient descent algorithm based on the mathematic model and concepts introduced in the previous section. 

Then I will use this basic neural network to train and classify handwriting from MNIST data set. I will also validate the outcome of this neural network

Finally, I will introduce more complex neural networks such as Convolutional Neural Networks for Image Classification, and then implement PyTorch deep learning model to classify a color image dataset. 


## II. Gradient Descent Algorithm

## III. Simple Feedforward Neural Network Implementation

## Reference
[Stanford University - CS231n: Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/)

[Adrian Rosebrock, PhD - Gradient Descent with Python](https://www.pyimagesearch.com/2016/10/10/gradient-descent-with-python/?_ga=2.50450089.1649789072.1638069243-1249829026.1638069243) 