# Intro to Deep Learning
### Table of Contents
<p>
    <div class="lev1 toc-item">
        <a href="## Why Deep Learning and Why Now?" data-toc-modified-id="## Why Deep Learning and Why Now?">
            <span class="toc-item-num">1&nbsp;&nbsp;</span>Why Deep Learning and Why Now?
        </a>
    </div>
</p>

## 1. Why Deep Learning and Why Now?

### 1.1. Why Deep Learning
* In traditional `Machine Learning`, the `features are hand engineered` this approach is time consuming, brittle and not scalable.
* The key idea of `Deep Learning` is to learn the underlying patterns directly from data.
* Ex: Learning features like lines, edges from the image data.

### 1.2. Why Now?
* The `Neural Networks` are dated back decades but due to reasons mentioned below it's gained a lot of momentum at the moment.
    1. Big Data
        * Larger Datasets
        * Easier for collection and storage
    2. Hardware
        * GPU (Graphics Processing Units)
        * Massively Parallelizable
    3. Software
        * Improved Techniques
        * New Models
        * Toolboxes

## 2. What is the fundamental building block of a Neural Network?
* A Neuron is the fundamental building block of a Neural Network.
* A Neuron is also called as the `Perceptron`.
* The Perceptron is the structural building block of deep learning.

<img src='images/perceptron.png'>

* The idea of Perceptron is very simple
* Forward Propogation of infromation through a Neuron
    1. Define a set of `Inputs` to that Neuron : $X_{1}$,...,$X_{m}$
    2. Each of these inputs have a corresponding `Weight`: $W_{1}$,...,$W_{m}$
    3. With each of these inputs and weights, multiply them correspondingly together and take a sum $\sum$ of all them.
    4. Take the summation value and pass it throught what is called a `Non-Linear Activation` function and that produces the final output $\hat{y}$
* We usually have what's called a `Bias` term in this Nueron.
<img src='images/perceptron_with_bias_term.png'>

### 2.1. What is the purpose of a Bias Term?
* The `Bias` term shifts the activation function either towards left or right regardless of the input values.
* So, the `Bias` term is not affected by the input ${X}$

$\hat{y} = {g}\big{(} w_{0} + \sum^{m}_{i=1} x_{i} w_{i}\big{)}$

The above summation equation can be rewritten using linear algebra in terms of vectors and dot products.

$\hat{y} = g(w_{0}+X^{T}W)\\[1em]$

\begin{align}
where: X = \begin{bmatrix}
X_{1} \\.\\.\\.\\ X_{m}
\end{bmatrix},
W = \begin{bmatrix}
W_{1}\\.\\.\\.\\W_{m}
\end{bmatrix}
\end{align}

* So, to compute the output of a single perceptron we need to take the 
    1. Dot Product of $X^{T}$ and $W$ which represents the elementwise multiplication and Summation
    2. Apply the Non-Linearity

<img src='images/perceptron_with_activation_func.png'>

* Sigmoid: Outputs the probability of each class

<img src='images/activation_func.png'>

### 2.2. What is the Importance of an Activation Function?
* The purpose of activation function is to introduce non-linearities into the network
* Linear Activation function produce linear decisions no matter the network size.
* Non-linearities allow us to approximate arbitrarily complex functions.
<img src='images/activate_fun_ex.png'>
<img src='images/activation_func_ex2.png'>

<img src='images/perceptron_example.png'>
<hr>

<img src='images/perceptron_example_1.png'>

<img src='images/perceptron_ex_2.png'>

## 3. Building Neural Networks with Perceptrons

<img src='images/perceptron_simplified.png'>
<hr>

<img src='images/ps_1.png'>
<hr>

### Multi Output Perceptron
* Because all the inputs are densely connected to all outputs, these layers are called as `Dense` layers.
<img src='images/ps_2.png'>

### Build a Dense Layer from Scratch

```python
import tensorflow as tf

class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self,input_dim,output_dim):
        super(MyDenseLayer,self).__init__()
        # initialize weights and biases
        self.W=self.add_weight([input_dim,output_dim])
        self.b=self.add_weight([1,output_dim])
    
    def call(self,inputs):
        # forward propogate the inputs
        z = tf.matmul(inputs,self.W)+self.b
        # feed through a non-linear activation
        output = tf.math.sigmoid()
        return output

```
We can achieve the same by usin below code

```python
import tensorflow as tf
layer = tf.keras.layers.Dense(units=2)
```

### Single Layer Neural Network
<img src='images/single_layer_nn.png'>
<img src='images/single_layer_nn1.png'>

### Multi Output Perceptron
<img src='images/multi_output_perc.png'>

#### Code to Implement it:
```python
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(n),
    tf.keras.layers.Dense(2)
])
```

### Deep Neural Network
<img src='images/deep_nn.png'>

#### Code to implement:
```python
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(n1),
    tf.keras.layers.Dense(n2),
    .
    .
    .
    tf.keras.layers.Dense(2)
])
```


## 4. Applying Neural Networks

#### Example Problem: Will I pass this class?

Let's start with a simple 2 feature model
* $x_{1}$ = Number of lectures attended
* $x_{2}$ = Hours spent on final project
<img src='images/ex_problem.png'>

<img src='images/ex_problem1.png'>

* When we pass on the data $x^{(1)}=[4,5]$ the perceptron just outputs 0.1 though the actual value is $1$. Because the network is not trained.

### Quantifying Loss
* The loss of our network measures the cost incurred from incorrect predictions.
<img src='images/q_loss.png'>

### Empirical Loss / Objective Function / Cost Function / Empirical Risk
<img src='images/el.png'>

* The example problem that we are discussing is a Binary Classification problem.
* In Binary Classification task we use `Binary Cross Entropy Loss` as the loss function.
### Binary Cross Entropy Loss
* Compares the predicted vs. actual values distribution the difference would be the loss from actual to predicted.
<img src='images/bcel.png'>

```python
import tensorflow as tf
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y,predicted))
```

* Instead of predicting whether a student will pass or fail, if we want to predict the students grade/score. 
* The task would become a Regression Problem, hence the loss function we use also change.

### Mean Squared Error Loss
<img src='images/mser.png'>

```python
import tensorflow as tf
loss = tf.reduce_mean(tf.square(tf.subtract(y,predicted)))
```

## 5. Training Neural Networks

### Loss Optimization
* We want to find the network wights that achieve the lowest loss.
<img src='images/lo.png'>
<img src='images/lo1.png'>

### Gradient Descent
<img src='images/gd.png'>

```python
import tensorflow as tf
weights = tf.Variable([tf.random.normal()])
while True: # loop forever
    with tf.GradientTape() as g:
        loss = compute_loss(weights)
        gradient = g.gradient(loss,weights)
    weights = weights-lr*gradient
```    

### Computing Gradients: Back Propogation
<img src='images/bp.png'>
<img src='images/bp1.png'>
<img src='images/bp2.png'>

## 6. Neural Networks in Practice: Optimization
* Training a Neural Network Can be difficult because Loss Functions can be difficult to optimize.
<img src='images/ogd.png'>

### How can we set the learning rate?
* Small learning rate converges slowly and gets stuck in false local minima.
* Large learning rates overshoot, become unstable and diverge.
* Stable learning rate converges smoothly and avoid local minima.

### How to deal with this?
#### Idea 1:
* Try lots of different learning rates and see what works **just right**

#### Idea 2:
* Do something smarter
* Design an adaptive learning rate that **adapts** to the landscape.

### Adaptive Learning Rates
* Learnng rates are no longer fixed
* Can be made larger or smaller depending on:
    * how large gradient is
    * how fast learning is happening
    * size of particular weights
    
### Gradient Descent Algorithms
<img src='images/gda.png'>

## Putting it all together
```python
import tensorflow as tf
model = tf.keras.Sequential([...])
# pick your favorite optimizer
optimizer = tf.keras.optimizer.SGD()
while True:
    # forward pass through the network
    prediction = model(x)
    with tf.GradientTape() as tape:
        # compute the loss
        loss = compute_loss(y,prediction)
  # update the weights using the gradient
  grads = tape.gradient(loss,model.trainable_variables)
  optimizer.apply_gradients(zip(grads,model.traianble_variables)))
```

## 7. Neural Networks in Practice: Mini-Batches

* Gradient Descent is computationally very intensive to compute.
* Computing Gradient Descent for each data point is noisy.
* So, to mitigate this what we do is that instead of computing gradient for all the points, we  compute for a batch of points.

#### Stochastic Gradient Descent
* Fast to compute and much better estimate of the true gradient

<img src='images/sgd.png'>


### Mini-batches while training

* More accurate estimation of gradient
    * Smoother convergence
    * Allows for large learning rates
* Mini-batches lead to fast training
    * Can parallelize computation + achieve significant speed increases on GPU's

## 8. Nueral Networks in Practice : Overfitting
* Overfitting : Problem of generalization

### The Problem of Overfitting
<img src='images/over_fitting.png'>

* To handle overfitting in Neural Networks we use rely on a couple of techniques:
* Regularization

### Regularization
* What is it?
    * Regularization is a technique that constraints our optimization problem to discourage learning complex models
* Why do we need it?
    * This improves generalization of our model on unseen data
* Most popular technique of Regularization in Deep Learning are:
    * Dropout.
    * Early Stopping

#### Dropout
* During training randomly set some activations to 0.
    * Typically drop 50% of activations in a layer.
    * Forces network to not rely on any 1 node.

<img src='images/drop_out.png'>

#### Early Stopping
* Stop trainig before we have a chance to overfit.
<img src='images/es.png'>

## Core Foundation Review
<img src='images/foundation_review.png'>