<div style="font-size:60px;font-weight: bolder;padding:40px 0">Artificial Neural Network</div>

<div>
<div style="float:left"><img src="images/fig-ai-ml-dl-nlp.png" /></div>
<div><img src="images/fig-ai-ml-ann-dl.png" /></div>
</div>   


<img src="images/fig-ai-ml-dl.jpg" style="display:none"/>

# Artificial Neural Networks or ANN

Artificial Neural Networks or ANN is an information processing paradigm/model that is inspired by the way the biological nervous system such as brain process information. It is composed of large number of highly interconnected processing elements(neurons) working in unison to solve a specific problem.

<p><strong class="iu jv">Topics to cover:</strong></p>
<ol>
    <li>Neurons</li>
    <li>Activation Functions</li>
    <li>Types of Activation Functions</li>
    <li>How do Neural Networks work</li>
    <li>How do Neural Networks learn(Backpropagation)</li>
    <li>Gradient Descent</li>
    <li>Stochastic Gradient Descent</li>
    <li>Training ANN with Stochastic Gradient Descent</li>
</ol>

Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.

## Neurons

Biological Neurons (also called nerve cells) or simply neurons are the fundamental units of the brain and nervous system, the cells responsible for receiving sensory input from the external world via dendrites, process it and gives the output through Axons.

<img src="images/fig-ann-1.png" />


The following diagram represents the general model of ANN which is inspired by a biological neuron. It is also called **Perceptron**.

## Perceptron Model

A single layer neural network is called a **Perceptron**. It gives a single output.

<img src="images/fig-ann-2.png" />


In the above figure, for one single observation, $ x_0, x_1, x_2, x3 \dots x_n $ represents various **inputs**(independent variables) to the network. Each of these inputs is multiplied by a connection **weight** or synapse. The weights are represented as $w_0, w_1, w_2, w_3 \dots w_n $ . Weight shows the strength of a particular node.
b is a **bias** value. A bias value allows you to shift the activation function up or down.
In the simplest case, these products are summed, fed to a transfer function (**activation function**) to generate a result, and this result is sent as **output**.

Mathematically, 

$$ \hat{y} = (x_0.w_0 + b_0) + (x_1.w_1 + b_2) + (x_2.w_2 + b_2) + \dots (x_n.w_n + b_n) = \sum x_i.wi + b_i$$

We update the weight when we found an error in classification or miss-classified. 

Weight update equation is this:

$$ \text{weight} = \text{weight} + \text{learning_rate} * (expected - predicted) * x $$ 

## Neural Networks

A single perceptron won’t be enough to learn complicated systems.
Fortunately, we can expand on the idea of a single perceptron, to create a multi-layer perceptron model. 

To build a network of perceptrons, we can connect layers of perceptrons, using a **multi-layer perceptron (MLP)**  model.

## Multilayer Perceptron(MLP) or Feed Forward Neural Network(FFNN)

The outputs of one perceptron are directly fed into as inputs to another perceptron.

In the Multilayer perceptron, there can more than one linear layer (combinations of neurons). If we take the simple example the three-layer network, first layer will be the input layer and last will be output layer and middle layer will be called hidden layer. We feed our input data into the input layer and take the output from the output layer. We can increase the number of the hidden layer as much as we want, to make the model more complex according to our task.

<img src="images/fig-ann-9.png" style = "width:60%; height:auto;" />

#### Neural Networks become "Deep Neural Networks" if then contain 2 or more hidden layers.

<img src="images/fig-ann-10.jpg" style = "width:60%; height:auto;" />




# Activation function

Activation functions also known non-linearity, describe the input-output relations in a non-linear way. 

The Activation function is important for an ANN to **learn** and make sense of something really complicated. Their main purpose is to convert an input signal of a node in an ANN to an output signal. This output signal is used as input to the next layer in the stack.

#### Activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The motive is to introduce non-linearity into the output of a neuron.

If we do not apply activation function then the output signal would be simply linear function(one-degree polynomial). Now, a linear function is easy to solve but they are limited in their complexity, have less power. Without activation function, our model cannot learn and model complicated data such as images, videos, audio, speech, etc.


#### Neural Network is considered "Universal Function Approximators". It means they can learn and compute any function at all.

https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions

## Types of Activation Functions:

### 1. Threshold Activation Function — (Binary step function)
A Binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

<img src="images/fig-ann-3.png" />

Activation function A = "activated" if Y > threshold else not. 

or A = 1 if y > threshold 0 otherwise.

The problem with this function is for creating a binary classifier ( 1 or 0), but if you want multiple such neurons to be connected to bring in more classes, Class1, Class2, Class3, etc. In this case, all neurons will give 1, so we cannot decide.

### 2. Sigmoid Activation Function — (Logistic function)
A Sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve which ranges between 0 and 1, therefore it is used for models where we need to predict the probability as an output.

<img src="images/fig-ann-4.png" />

The Sigmoid function is differentiable, means we can find the slope of the curve at any 2 points.
The drawback of the Sigmoid activation function is that it can cause the neural network to get stuck at training time if strong negative input is provided.

### 3. Hyperbolic Tangent Function — (tanh)
It is similar to Sigmoid but better in performance. It is nonlinear in nature, so great we can stack layers. The function ranges between (-1,1).

<img src="images/fig-ann-5.png" />

The main advantage of this function is that strong negative inputs will be mapped to negative output and only zero-valued inputs are mapped to near-zero outputs.,So less likely to get stuck during training.

### 4. Rectified Linear Units — (ReLu)
ReLu is the most used activation function in CNN and ANN which ranges from zero to infinity. $ [0,\infty) $

<img src="images/fig-ann-6.png" />


It gives an output 'x' if x is positive and 0 otherwise. It looks like having the same problem of linear function as it is linear in the positive axis. Relu is non-linear in nature and a combination of ReLu is also non-linear. In fact, it is a good approximator and any function can be approximated with a combination of Relu.

ReLu has been found to have very good performance, especially when dealing with the issue of **vanishing gradient**.
We’ll often default to ReLu due to its overall good performance.

ReLu is 6 times improved over hyperbolic tangent function.


It should only be applied to hidden layers of a neural network. So, for the output layer use softmax function for classification problem and for regression problem use a Linear function.
Here one problem is some gradients are fragile during training and can die. It causes a weight update which will make it never activate on any data point again. Basically ReLu could result in dead neurons.

To fix the problem of dying neurons, **Leaky ReLu** was introduced. So, Leaky ReLu introduces a small slope to keep the updates alive. Leaky ReLu ranges from $ -\infty $ to $ +\infty $

<img src="images/fig-ann-7.jpeg" />

Leak helps to increase the range of the ReLu function. Usually, the value of a = 0.01 or so.

When a is not 0.01, then it is called **Randomized ReLu**.

## Multi-Class Activation Functions

Notice all these activation functions make sense for  a single output, either a continuous label or trying to predict a binary classification (either a 0 or 1). But what should we do if we have a multi-class situation?

There are 2 main types of multi-class situations:

* **1. Non-Exclusive Classes** - A data point can have multiple classes/categories assigned to it. E.g. Photos can have multiple tags (e.g. beach, family, vacation, etc)


* **2. Mutually Exclusive Classes** - Only one class per data point. E.g. Photos can be categorized as being in grayscale (black and white) or full color photos. A photo can not be both at the same time.

#### Organizing Multiple Classes: 
    The easiest way to organize multiple classes is to simply have 1 output node per class.
    

<img src="images/fig-ann-11.png" />

# Cost Functions and Gradient Descent

### How do Neural networks learn?

We now understand that neural networks take in inputs, multiply them by weights, and add biases to them. Then this result is passed through an activation function which at the end of all the layers leads to some output. This output $ \hat{y} $ is the model’s estimation of what it predicts the label to be. We need to take the estimated outputs of the network and then compare them to the real values of the label. Based on the difference between the actual value and the predicted value, an error value also called **Cost Function** is computed and sent back through the system. The **cost function** (often referred to as a **loss function**) must be an average so it can output a single value. 

#### Cost Function: One half of the squared difference between actual and output value.


We'll use the following variables:
* y to represent the true value
* a to represent neuron’s prediction

In terms of weights and bias:
* $wx + b = z$
* Pass z into activation function $ \sigma(z) = a $

One very common cost function is the quadratic cost function:

$$ C = \frac{1}{2n} \sum {\Big| \Big|   y(x) - a^L(x)  \Big| \Big|}^2  $$

We simply calculate the difference between the real values y(x) against our predicted values a(x).

For each layer of the network, the cost function is analyzed and used to adjust the threshold and weights for the next input. Our aim is to minimize the cost function. The lower the cost function, the closer the actual value to the predicted value. In this way, the error keeps becoming marginally lesser in each run as the network learns how to analyze values.

We feed the resulting data back through the entire neural network. The weighted synapses connecting input variables to the neuron are the only thing we have control over.

As long as there exists a disparity between the actual value and the predicted value, we need to adjust those wights. Once we tweak them a little and run the neural network again, A new Cost function will be produced, hopefully, smaller than the last.

We need to repeat this process until we scrub the cost function down to as small as possible.

<img src="images/fig-ann-12.png" />

## Backpropagation

The procedure described above is known as **Back-propagation** and is applied continuously through a network until the error value is kept at a minimum.

Backpropagation is a common method for training a neural network.

Once we get our cost/loss value, how do we actually go back and adjust our weights and biases? This is backpropagation. 

Fundamentally, we want to know how the cost function results changes with respect to the weights in the network, so we can update the weights to minimize the cost function.

<img src="images/fig-ann-13.gif" />


## There are basically various ways to adjust weights: -

## Brute-force method

Best suited for the single-layer feed-forward network. Here you take a number of possible weights. In this method, we want to eliminate all the other weights except the one right at the bottom of the U-shaped curve.


Optimal weight can be found using simple elimination techniques. This process of elimination work if you have one weight to optimize. What if you have complex NN with many numbers of weights, then this method fails because of the Curse of Dimensionality.

The alternative approach that we have is called Batch Gradient Descent.

## Batch-Gradient Descent

Gradient Descent is an optimization technique that is used to improve deep learning and neural network-based models by minimizing the cost function.

<img src="images/fig-ann-13.png" style="display:none" />
<img src="images/fig-ann-15.png" />

In Gradient Descent, instead of going through every weight one at a time, and ticking every wrong weight off as you go, we instead look at the angle of the function line.

#### If slope → Negative, that means you go down the curve.
#### If slope → Positive, Do nothing

This way a vast number of incorrect weights are eliminated. For instance, if we have 3 million samples, we have to loop through 3 million times. So basically you need to calculate each cost 3 million times.

## Stochastic Gradient Descent(SGD)

Gradient Descent works fine when we have a convex curve just like in the above figure. But if we don't have a convex curve, Gradient Descent fails.

The word ‘stochastic‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration.

<img src="images/fig-ann-14.jpg" />

In SGD, we take one row of data at a time, run it through the neural network then adjust the weights. For the second row, we run it, then compare the Cost function and then again adjusting weights. And so on…

SGD helps us to avoid the problem of local minima. It is much faster than Gradient Descent because it is running each row at a time and it doesn’t have to load the whole data in memory for doing computation.

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually took a higher number of iterations to reach the minima, because of its randomness in its descent. Even though it requires a higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much less expensive than typical Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning algorithm.

## Adam optimization

The **Adam optimization** algorithm is an extension to stochastic gradient descent. Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.

Adam optimization maintains a per-parameter learning rate that improves performance on problems with sparse gradients.

Adam is a popular algorithm in the field of deep learning because it achieves good results fast.

# Deep Learning Frameworks

**Keras**, **TensorFlow** and **PyTorch** are among the top three frameworks that are preferred by Data Scientists as well as beginners in the field of  Deep Learning.
<div style="width:100%;">
    <div style="float:left"><img src="images/keras.png" /></div>
    <div style="float:left;margin-left: auto;margin-right: auto;width: 60%;"><img src="images/tensorflow.png" /></div>
    <div style="float:right"><img src="images/pytorch.png" /></div>
</div>

**Keras** is an open source neural network library written in Python. It is capable of running on top of TensorFlow. It is designed to enable fast experimentation with deep neural networks.

**With TensorFlow 2.0 , Keras is running on top of TensorFlow, CNTK and Theano.**

**TensorFlow** is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library that is used for machine learning applications like neural networks.


**PyTorch** is an open source machine learning library for Python, based on Torch. It is used for applications such as natural language processing and was developed by Facebook’s AI research group.

Keras is a **high-level API** capable of running on top of TensorFlow, CNTK and Theano. It has gained favor for its ease of use and syntactic simplicity, facilitating fast development.

TensorFlow is a framework that provides both high and low level APIs. Pytorch, on the other hand, is a lower-level API focused on direct work with array expressions. It has gained immense interest in the last year, becoming a preferred solution for academic research, and applications of deep learning requiring optimizing custom expressions.

The performance is comparatively slower in Keras whereas Tensorflow and PyTorch provide a similar pace which is fast and suitable for high performance.

We will also focus on how to identify and deal with overfitting through Early Stopping Callbacks and Dropout Layers.

* **Early Stopping** - Keras can automatically stop training based on a loss condition on the validation data passed during the model.fit() call.

* **Dropout Layers** - Dropout can be added to layers to “turn off” neurons during training to prevent overfitting. Each Dropout layer will “drop” a  user-defined percentage of neuron units in the previous layer every batch.

### Popularity

<img src="images/fig-ann-16.png" />

# Deep Learning Model Life-Cycle

The five steps in the life-cycle are as follows:

* **1. Define the model** - Defining the model requires that you first select the type of model that you need and then choose the architecture or network topology. Models can be defined either with the Sequential API or the Functional API.

* **2. Compile the model** - Compiling the model requires that you first select a loss function that you want to optimize, such as mean squared error or cross-entropy. It also requires that you select an algorithm to perform the optimization procedure, typically stochastic gradient descent, or a modern variation, such as Adam. 

    The three most common loss functions are:
    * binary_crossentropy -  for binary classification.
    * sparse_categorical_crossentropy - for multi-class classification.
    * mse - (mean squared error) for regression.
    

* **3. Fit the model** - Fitting the model requires that you first select the training configuration, such as the number of **epochs** (loops through the training dataset) and the **batch size** (number of samples in an epoch used to estimate model error). While fitting the model, a progress bar will summarize the status of each epoch and the overall training process. This can be simplified to a simple report of model performance each epoch by setting the **verbose** argument to 2. All output can be turned off during training by setting “verbose” to 0.

* **4. Evaluate the model** - Evaluating the model requires that you first choose a holdout dataset used to evaluate the model. This should be data not used in the training process so that we can get an unbiased estimate of the performance of the model when making predictions on new data.

* **5. Make predictions** - Making a prediction is the final step in the life-cycle. It is why we wanted the model in the first place. It requires you have new data for which a prediction is required, e.g. where you do not have the target values.

In [1]:
# check version
import tensorflow
print(tensorflow.__version__)

2.0.0


Now that we are familiar with the model life-cycle, let’s take a look at the two main ways to use the tf.keras API to build models: sequential and functional.


# 1. Sequential Model API (Simple)

The sequential model API is the simplest and is the API that I recommend, especially when getting started.

It is referred to as “sequential” because it involves defining a [Sequential class](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) and adding layers to the model one by one in a linear manner, from input to output.

The example below defines a Sequential MLP model that accepts eight inputs, has one hidden layer with 10 nodes and then an output layer with one node to predict a numerical value.

In [2]:
# example of a model defined with the sequential api
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# define the model
model = Sequential()
model.add(Dense(10, input_shape=(8,)))
model.add(Dense(1))

Note that the visible layer of the network is defined by the **"input_shape"** argument on the first hidden layer. That means in the above example, the model expects the input for one sample to be a vector of eight numbers.


The sequential API is easy to use because you keep calling model.add() until you have added all of your layers.

For example, here is a deep MLP with five hidden layers.

In [3]:
# example of a model defined with the sequential api
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# define the model
model = Sequential()
model.add(Dense(100, input_shape=(8,)))
model.add(Dense(80))
model.add(Dense(30))
model.add(Dense(10))
model.add(Dense(5))
model.add(Dense(1))

# 2. Functional Model API (Advanced)

The functional API is more complex but is also more flexible.

It involves explicitly connecting the output of one layer to the input of another layer. Each connection is specified.

First, an input layer must be defined via the Input class, and the shape of an input sample is specified. We must retain a reference to the input layer when defining the model.

define the layers

x_in = Input(shape=(8,))

Next, a fully connected layer can be connected to the input by calling the layer and passing the input layer. This will return a reference to the output connection in this new layer.

x = Dense(10)(x_in)

We can then connect this to an output layer in the same manner.

x_out = Dense(1)(x)

Once connected, we define a Model object and specify the input and output layers. The complete example is listed below.


In [5]:
# example of a model defined with the functional api
from tensorflow.keras import Model
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense

# define the layers
x_in = Input(shape=(8,))
x = Dense(10)(x_in)
x_out = Dense(1)(x)

# define the model
model = Model(inputs=x_in, outputs=x_out)

As such, it allows for more complicated model designs, such as models that may have multiple input paths (separate vectors) and models that have multiple output paths (e.g. a word and a number).

# References:
* https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/
* https://www.edureka.co/blog/tensorflow-tutorial/
* https://www.youtube.com/watch?v=5pG9HYdFd8M
* https://towardsdatascience.com/introduction-to-artificial-neural-networks-ann-1aea15775ef9
* https://www.edureka.co/blog/keras-vs-tensorflow-vs-pytorch/