# `Deep Learning Fundamentals`

### Syllabus

`Module 1 - Introduction to Deep Learning`
- Introduction to Deep Learning
- Biological Neural Networks
- Artificial Neural Networks - Forward Propagation

`Module 2 - Artificial Neural Networks`
- Gradient Descent

`Module 3 - Deep Learning Libraries`
- Introduction to Deep Learning Libraries
- Regression Models with Keras
- Classification Models with Keras

`Module 4 - Deep Learning Models`
- Shallow and Deep Neural Networks
- Convolutional Neural Networks
- Recurrent Neural Networks
- Autoencoders

------
`Neural networks are universal function approximators`

## Artificial Neural Network

In [2]:
%%html
<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/1200px-Colored_neural_network.svg.png' width=70%>

In [3]:
%%html
<img src='https://i.stack.imgur.com/LgmYv.png' width=70%>

## Forward Propogation 

Process through which information propogates from input layer to output layer

1. Start with the input layer as the input to the first hidden layer.
2. Compute the weighted sum at the nodes of the current layer.
3. Compute the output of the nodes of the current layer.
4. Set the output of the current layer to be the input to the next layer.
5. Move to the next layer in the network.
5. Repeat steps 2 - 4 until we compute the output of the output layer.

### Perceptron

Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks. Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the given input data.

<img src='https://upload.wikimedia.org/wikipedia/commons/6/60/ArtificialNeuronModel_english.png' width=70%>

### Keras Sequential Network with Dense Layers Example 

In [4]:
import keras

ModuleNotFoundError: No module named 'keras'

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('metal_price.csv',index_col='Date')

In [None]:
target = df['Price']

In [None]:
a=[]
for i in df.columns:
    if i!='Price':
        a.append(i)
predictors = df[a]

In [None]:
from keras.models import Sequential

In [None]:
from keras.layers import Dense

Two models in Keras

- Sequential 
- Model Class

In [None]:
model=Sequential()

In [None]:
n_columns=predictors.shape[1]
model.add(Dense(5,activation='relu', input_shape=(n_columns,)))
model.add(Dense(5,activation='relu'))
model.add(Dense(1))

In [None]:
model.compile(optimizer='adam',loss='mean_squared_error')

In [None]:
model.fit(predictors,target,epochs=4)

In [None]:
predictions=model.predict(test_data)

## Backpropogation

## Activation Functions

Activation functions introduce non-linearity to the networks that is why we call them non-linearities. Neural Networks are trained using backpropapagation which requires `differentiable` activation functions.

If we do not apply a Activation function then the output signal would simply be a simple linear function. A Neural Network without Activation function would simply be a Linear regression Model. Also without activation function our Neural network would not be able to learn and model other complicated kinds of data such as images, videos , audio , speech etc. That is why we use Artificial Neural network techniques such as Deep learning to make sense of something complicated ,high dimensional,non-linear, big datasets, where the model has lots and lots of hidden layers in between and has a very complicated architecture which helps us to make sense and extract knowledge form such complicated big datasets.

### Types of Activation Functions

`Identity / Linear`

Problem with linear activation function is that its derivative is a constant and its gradient will be a constant too and the descent will be on a constant gradient.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*gklL4_EwFpXPSzFC4sPT1g.png'>
* Derrivative
<img src=''>
* Graph
<img src='https://i.stack.imgur.com/NKESX.png'>

`Step / Heaviside`

is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable. These functions are useful for binary classification tasks.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*LfKVxBfSYSFyUwEw5YInFg.png'>

* Graph
<img src='https://i.stack.imgur.com/vRdzT.png'>

`Bipolar`

* Graph
<img src='https://i.stack.imgur.com/5tQJ9.png'>

`Piecewise Linear`

* Graph
<img src='https://i.stack.imgur.com/cguIH.png'>

`Sigmoid / Logistic Activation Function`

Sigmoid non-linearity squashes real numbers to range between [0,1]. In particular, large negative numbers become 0 and large positive numbers become 1. It is mostly used for binary classification problems.

It has two major drawbacks:
* Sigmoids saturate and kill gradients (Vanishing Gradient Problem)

If the local gradient is very small, it will effectively "kill" the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. 

Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.

* Sigmoid outputs are not zero-centered


* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*Bhzlu8WmM1UFo8TltoRHOQ.png'>

* Derrivative
<img src='https://cdn-images-1.medium.com/max/1000/1*Q_fCmWPcz4F8IoNXm9tqcg.png'>

* Graph
<img src='https://i.stack.imgur.com/COTWF.png'>

`Complementary log-log`

aij=σ(zij)=1−exp(−exp(zij))
* Graph
<img src='https://i.stack.imgur.com/LcZHq.png'>

`Hyperbolic Tangent (Tanh)`

The tanh non-linearity squashes real numbers to range between [-1,1]. It looks like a scaled sigmoid function. Data is centered around zero, so the derivatives will be higher. Tanh quickly converges than sigmoid and logistic activation functions. Downside is that it suffers from Vanishing Gradient Problem as well.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*rACT-asoF6gANDE2fW07IQ.png'>

* Graph
<img src='https://i.stack.imgur.com/gQ9zn.png'>

`Absolute`

aij=σ(zij)=∣zij∣
* Graph
<img src='https://i.stack.imgur.com/BADmK.png'>

`Rectifier / ReLU`

Also known as Rectified Linear Unit (ReLU), Max, or the Ramp Function. It is zero when x < 0 and then linear with slope 1 when x > 0. It trains 6 times faster than tanh.

It was found to greatly accelerate the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.

Unfortunately, ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any data-point again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. With a proper setting of the learning rate (not too high) this is less frequently an issue.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*ZH-D-NXMq82joIHyJocZ3w.png'>
* Graph
<img src='https://i.stack.imgur.com/a7hU1.png'>

`Modified ReLU`

* Leaky ReLU

Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so).

<img src='https://cdn-images-1.medium.com/max/1000/1*qVrYlFchG7YiTX5WWx1r7A.png'>

* Parametric Rectified Linear Unit (PReLU)
<img src='https://cdn-images-1.medium.com/max/1000/1*pZ5_JgEGDHEWsTFoVfK_2g.png'>

* Randomized Leaky Rectified Linear Unit (RReLU)
<img src='https://cdn-images-1.medium.com/max/1000/1*2hEeawGNKs0WwGz9I0JqAg.png'>

`Exponential Linear Unit (ELU)`

Exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs. α is a hyper-parameter that needs to be tuned.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*rc2g2ZIm4lRCNt8gWMhjyA.png'>

* Graph
<img src='https://cdn-images-1.medium.com/max/1000/1*gfEr6eAKDZT8hHf2t7u7Lw.png'>

`SoftPlus`

The derivative of the softplus function is the logistic function. ReLU and Softplus are largely similar, except near 0(zero) where the softplus is enticingly smooth and differentiable. It's much easier and efficient to compute ReLU and its derivative than for the softplus function which has log(.) and exp(.) in its formulation.

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*EyVsonpsBRp5fdNa-djnBw.png'>

* Derivative
<img src='https://cdn-images-1.medium.com/max/1000/1*D3YKEgImpixP1lst0_uljQ.png'>

* Graph
<img src='https://cdn-images-1.medium.com/max/1000/1*w275Sin5bKAIaWBaJ6zXcA.png'>

`Maxout`

A maxout layer is simply a layer where the activation function is the max of the inputs.

The Maxout neuron computes the function max(wT1x+b1,wT2x+b2). Both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have w1,b1=0). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).

However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

`Softmax`

Also known as the Normalized Exponential. The softmax function is often used in the final layer of a neural network-based classifier. The output of a neuron is dependent on the other neurons in that layer. 

Softmax functions convert a raw value into a posterior probability. This provides a measure of certainty. It squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1.

The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true

* Function
<img src='https://cdn-images-1.medium.com/max/1000/1*XipIlq5eCmQMUDxr4PNH_g.png'>

-----

It is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so.

ReLU and it’s variants should be preferred over sigmoid or tanh activation functions. As well as ReLUs are faster to train. If ReLU is causing neurons to be dead, use Leaky ReLUs or it’s other variants. Sigmoid and tanh suffers from vanishing gradient problem and should not be used in the hidden layers. ReLUs are best for hidden layers. Try tanh, but expect it to work worse than ReLU/Maxout.

Activation functions which are easily differentiable and easy to train should be used.

# -----------

Dropout - Regularization

Loss function

Feature Scaling

Model Initialization

Types of Neural Networks

CNN

RNN

LSTM

Autoencoders

ML Algorithms

Decsion Trees

Random Forest

K-Means

K-Folds

Algorithms for Linear Regression