### Concepts
1. Intro to neural network and it's mathematics
1. Intro to PyTorch
1. Transfer learning
1. CNN
1. Style transfer
1. RNN

### data and exercises:
1. MNIST
1. CIFAR-100
1. Cat-Dog
1. Flower Images
1. imdb movie database 

and exercises related to these

#### Finally, we will do one project, where we will build, train, validate and test flower image classifier model from scratch

## To display the quizzes, run following code cell once every new kernel

In [None]:
import sys
sys.path.append('../helper')
from quizzy import list_quiz, show_quiz

## Perceptron

<img src="assets/linear_boundary.png" width=900px>

## 3-D

3-D visualization

<img src="assets/three_dimension.png" width=900px>

## Higher dimensional
<img src="assets/higher_dimension.PNG" width=900px>

## Why neural network
Perceptron looks like neurons in the brain.

<img src="assets/neurons.PNG" width=900px>

In [None]:
show_quiz(name='percetron_trick')

## Non-Linear Regions

What if we want to reject the student?

<img src="assets/non_linear_reject_student.PNG" width=900px>


No matter what s/he gets in test, if grades are terrible, s/he will be rejected. So, data will look like this:


<img src="assets/non_linear_data.PNG" width=900px>


How can this be separated?


Circle
<img src="assets/non_linear_circle.PNG" width=900px>

Two lines
<img src="assets/non_linear_multi_lines.PNG" width=900px>

Curve
<img src="assets/non_linear_curve.PNG" width=900px>

Let's go with curve...



Unfortunately perceprton algorithm won't work this time. We need something more complex, we need to redefine perceptron algorithm for lines to generalize for other problems


## Error Funtions

Error function tells us that how far we are from solution. It will tell us the distance and then we will look around and see which step takes us closer to the solution. Take that step and repeat.

<img src="assets/mount_clouds.png" width=900px>


Here, key metric used to solve the problem is height. We will call the height - the error

<img src="assets/mount_height.png" width=900px>

## Let's try to split data using error as no. of misclassified points:

<img src="assets/split_data_error.PNG" width=900px>

When we try to descent, the error is still 2 in discrete
<img src="assets/error_function_discrete_continuous.PNG" width=900px>


## Gradient Descent

<img src="assets/gd_error.PNG" width=900px>

In [None]:
show_quiz(name='conditions_for_gradient_descent')

## Discrete to Continuous Prediction

<img src="assets/discrete_to_continuous_prediction.PNG" width=900px>

<img src="assets/activation_function_step_to_sigmoid.PNG" width=900px>

<img src="assets/sigmoid_prediction.PNG" width=900px>

<img src="assets/logit_to_sigmoid.PNG" width=900px>

<img src="assets/sigmoid_perceptron.PNG" width=900px>

In [None]:
show_quiz(name='perceptron_boundry')

## Multi-Class Classification and Softmax
<img src="assets/softmax_1.PNG" width=900px>

<img src="assets/softmax_2.PNG" width=900px>

<img src="assets/softmax_3.PNG" width=900px>

### What could be the problems with this method?
<img src="assets/softmax_4.PNG" width=900px>

In [None]:
show_quiz(name='convert_number_to_positive')

<img src="assets/softmax_5.PNG" width=900px>

## Softmax function and it's definition
<img src="assets/softmax_6.PNG" width=900px>

## Quiz: Coding Softmax
And now, your time to shine! Let's code the formula for the Softmax function in Python.

Refer exercise notebook for softmax coding!

## One-Hot Encoding

So far, all the algorithms are numerical. This means, we need to input numbers but the input data will not always look like numbers.

<img src="assets/one_hot_encoding1.png" width=900px>

<img src="assets/one_hot_encoding2.png" width=900px>

## Maximum Likelihood

Probability will be one of our best friends as we go through Deep Learning. In this lesson, we'll see how we can use probability to evaluate (and improve!) our models.


Let's say I have two models giving 80% and 55% probability of getting accepted, which one is good model?

The best model is the model which gives higher probabilities to the events happen to us, whether its acceptance or rejection.

This method is called ***Maximum Likelihood***. We pick the model which gives the existing labels the highest probability. Thus by maximizing the probability, we can pick the best possible model.


<img src="assets/maximum_likelihood1.png" width=900px>


### Goal is to maximize the probability:

<img src="assets/maximum_likelihood2.png" width=900px>

This method is called maximum likelihood.

In [None]:
show_quiz(name='maximum_likelihood')

## Maximizing Probabilities

In this lesson and quiz, we will learn how to maximize a probability, using some math. Nothing more than high school math, so get ready for a trip down memory lane!

*Is there any connection between error function and Probability?* Let's see.


### Issue with product

<img src="assets/product_to_sum.png" width=900px>

In [None]:
show_quiz(name='product_to_sum')

## Cross Entropy

<img src="assets/cross_entropy1.PNG" width=900px>

**Low Value -> Good cross entropy**

**High Value -> Bad cross entropy**


* *Negative of logarithm of large number is small number*

#### Value of misclassified points is more

<img src="assets/cross_entropy2.PNG" width=900px>

If we have a bunch of events and probabilities. How likely is that those events happen based on probabilities?

**If it's very likely -> small cross entropy**

**If it's unlikely -> large cross entropy**


#### Let's elaborate

<img src="assets/cross_entropy3.png" width=900px>

<img src="assets/cross_entropy4.png" width=900px>

<img src="assets/cross_entropy5.png" width=900px>

<img src="assets/cross_entropy6.png" width=900px>

***The formula really encompasses the sums of negatives of logarithms which is precisely the cross entropy.***

***So, the cross entropy really tells us, when two vectors are similar or different.***

## Multi-Class Cross-Entropy

<img src="assets/multi_class_cross_entropy1.PNG" width=900px>

<img src="assets/multi_class_cross_entropy2.PNG" width=900px>

<img src="assets/multi_class_cross_entropy3.PNG" width=900px>

Let's add some parameters:

<img src="assets/multi_class_cross_entropy4.PNG" width=900px>


In [None]:
show_quiz(name='cross_entropy')

In [None]:
show_quiz(name='cross_entropy_probability_relation')

## Quiz: Coding Cross-entropy
Now, time to shine! Let's code the formula for cross-entropy in Python.

Refer exercise notebook for softmax coding!

# Logistic Regression
Now, we're finally ready for one of the most popular and useful algorithms in Machine Learning, and the building block of all that constitutes Deep Learning. The Logistic Regression Algorithm. And it basically goes like this:

* Take your data
* Pick a random model
* Calculate the error
* Minimize the error, and obtain a better model
* Enjoy!

## Calculating the Error Function

<img src="assets/cross_entropy7.PNG" width=900px>

We concluded that 2nd model is better, because the cross entropy is much smaller.

<img src="assets/error_function1.PNG" width=900px>

<img src="assets/error_function2.PNG" width=900px>

<img src="assets/error_function3.PNG" width=900px>

<img src="assets/error_function4.PNG" width=900px>

<img src="assets/error_function5.PNG" width=900px>

<img src="assets/error_function6.PNG" width=900px>

***Given that we have formula for two classes and m classes, these formulae look different. But it's a nice exercise to convince yourself that the two are the same.***

## Minimizing the error function

<img src="assets/error_function7.PNG" width=900px>

***We will use Gradient Descent to minimize the error function.***

<img src="assets/error_function8.PNG" width=900px>

## Gradient Descent

In this lesson, we'll learn the principles and the math behind the gradient descent algorithm.

<img src="assets/gradient_descent1.PNG" width=900px>

Gradient is given by vector some of partial derivatives of $w_1$ w.r.t E and $w_2$ w.r.t E. This gradient tells us the direction we should move, if we want to increase the error function the most. Thus -ve of the gradient will tell us how to decrease the error function the most.

---

<img src="assets/gradient_descent2.PNG" width=900px>

Once we take a step, we will be in a lower position. So, we'll do it again and again, until we are able to get to the bottom of mountain.

---

<img src="assets/gradient_descent3.PNG" width=900px>

This is how we calculate the gradient. As before, we don't want to make any dramatic changes, so we introduce the learning rate $\alpha$ and we will multiply the gradient by that number.

By updating the weights and bias, we can conclude that the prediction we have now is better than previous prediction. This is precisely the ***Gradient Descent*** step.

## Gradient Calculation
In the last few lessons, we learned that in order to minimize the error function, we need to take some derivatives. So let's get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,

$$\sigma^{\prime}(x) = \sigma(x) (1 - \sigma(x))$$

Now, let's do some mathematics...

In [None]:
show_quiz(name='scalar_gradient_relation')

So, a small gradient means we'll change our coordinates by a little bit, and a large gradient means we'll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We'll see it in a bit.

## Gradient Descent Step

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights and bias in the shown way.

*Note:* Since we've taken the average of the errors, the term we are adding should be $\frac{1}{m} \cdot \alpha$ instead of $\alpha$, but as $\alpha$ is a constant, then in order to simplify calculations, we'll just take $\frac{1}{m} \cdot \alpha$  to be our learning rate, and abuse the notation by just calling it $\alpha$.

## Logistic Regression Algorithm

<img src="assets/logistic_regression1.PNG" width=900px>

<img src="assets/logistic_regression2.PNG" width=900px>

---

<img src="assets/logistic_regression3.PNG" width=900px>

Have we seen something like that before?

We look at each point and what each point is doing is adding a multiple of itself into the weights of the line, in order to  get the line to move closer towards it if it's misclassified.
This is pretty much what perceptron algorithm was doing. We'll look at the similarities.

# Notebook: Implementing Gradient Descent
In the following notebook, you'll be able to implement the gradient descent algorithm on the following sample dataset with two classes.

<div style="text-align:center;">
    <img src="assets/dataset.png" width=450px>
    Red and blue data points with some overlap.
</div>

### Workspace
To open this notebook, you have following option:

> * Clone the repo from [Github](https://github.com/ashwinwadte/deep-learning-v2-pytorch) and open the notebook ***GradientDescent.ipynb*** in the ***intro-neural-networks > gradient-descent*** folder. You can either download the repository via the command line with `git clone https://github.com/udacity/deep-learning-v2-pytorch.git`, or download it as an archive file from [this link](https://github.com/ashwinwadte/deep-learning-v2-pytorch/archive/master.zip).

### Instructions
In this notebook, you'll be implementing the functions that build the gradient descent algorithm, namely:

`sigmoid`: The sigmoid activation function.

`output_formula`: The formula for the prediction.

`error_formula`: The formula for the error at a point.

`update_weights`: The function that updates the parameters with one gradient descent step.

When you implement them, run the `train` function and this will graph the several of the lines that are drawn in successive gradient descent steps. It will also graph the error function, and you can see it decreasing as the number of epochs grows.

This is a self-assessed lab. If you need any help or want to check your answers, feel free to check out the solutions notebook in the same folder.

## Perceptron vs Gradient Descent

<img src="assets/gd_vs_perceptron1.PNG" width=900px>

---

<img src="assets/gd_vs_perceptron2.PNG" width=900px>

## Continuous Perceptron

<img src="assets/continuous_perceptron1.PNG" width=900px>

---

<img src="assets/continuous_perceptron2.PNG" width=900px>

## Non-Linear Models

<img src="assets/non_linear_models1.PNG" width=900px>

---

<img src="assets/non_linear_models2.PNG" width=900px>

***Everything will be the same as before except this boundary equation will be not be linear. That's where neural network comes into play.***

## Neural Network (NN) Architecture
Ok, so we're ready to put these building blocks together, and build great Neural Networks! (Or Multi-Layer Perceptrons, however you prefer to call them.)

<img src="assets/nn_arch1.PNG" width=900px>
<img src="assets/nn_arch2.PNG" width=900px>
<img src="assets/nn_arch3.PNG" width=900px>
<img src="assets/nn_arch4.PNG" width=900px>
<img src="assets/nn_arch5.PNG" width=900px>
<img src="assets/nn_arch6.PNG" width=900px>
<img src="assets/nn_arch7.PNG" width=900px>
<img src="assets/nn_arch8.PNG" width=900px>
<img src="assets/nn_arch9.PNG" width=900px>
<img src="assets/nn_arch10.PNG" width=900px>
<img src="assets/nn_arch11.PNG" width=900px>

In [None]:
show_quiz(name="perceprton_weights_bias")

### Multiple layers
Now, not all neural networks look like the one above. They can be way more complicated! In particular, we can do the following things:

* Add more nodes to the input, hidden, and output layers.
* Add more layers.

We'll see the effects of these changes.

#### Architecture of neural network

<img src="assets/nn_arch12.PNG" width=900px>

###### More hidden nodes
<img src="assets/nn_arch13.PNG" width=900px>

###### More input nodes
<img src="assets/nn_arch14.PNG" width=900px>

###### More output nodes
<img src="assets/nn_arch15.PNG" width=900px>

###### More layers - deep neural network
<img src="assets/nn_arch16.PNG" width=900px>

##### In general we can do this many times and obtain highly complex model with lots of hidden layers
<img src="assets/nn_arch17.PNG" width=900px>

### Multi-Class Classification
And here we elaborate a bit more into what can be done if our neural network needs to model data with more than one output.

<img src="assets/nn_arch18.PNG" width=900px>
But it seems like ovekill.

---

<img src="assets/nn_arch19.PNG" width=900px>

In [None]:
show_quiz(name='output_layer')

## Feedforward
Feedforward is the process, neural networks use to turn the input into an output. Let's study it more carefully, before we dive into how to train the networks.

<img src="assets/feed_forward1.PNG" width=900px>
<img src="assets/feed_forward2.PNG" width=900px>
<img src="assets/feed_forward3.PNG" width=900px>

### Error Function
Just as before, neural networks will produce an error function, which at the end, is what we'll be minimizing. Let's see the error function for a neural network.

<img src="assets/feed_forward4.PNG" width=900px>

## Backpropagation
Now, we're ready to get our hands into training a neural network. For this, we'll use the method known as backpropagation. In a nutshell, backpropagation will consist of:

* Doing a feedforward operation.
* Comparing the output of the model with the desired output.
* Calculating the error.
* Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
* Use this to update the weights, and get a better model.
* Continue this until we have a model that is good.


Sounds more complicated than what it actually is. Let's take a look. First we will see a conceptual interpretation of what backpropagation is.

#### Multi - layer perceptron
For single perceptron, we saw the gradient descent, we will apply same logic for multi - layer perceptron.

<img src="assets/back_propagation1.PNG" width=900px>

**NOTE**: We are only considering weights for simplicity, but in reality, we will update the bias as well.


<img src="assets/back_propagation2.PNG" width=900px>
<img src="assets/back_propagation3.PNG" width=900px>

### Chain Rule

<img src="assets/back_propagation4.PNG" width=900px>

**Note** : Feedforwading is literally composing the functions and back propagation is literally taking the derivative at each piece.

### Back propagation in details
<img src="assets/back_propagation5.PNG" width=900px>
<img src="assets/back_propagation6.PNG" width=900px>
<img src="assets/back_propagation7.PNG" width=900px>

# Notebook: Analyzing Student Data
Now, we're ready to put neural networks in practice. We'll analyze a dataset of student admissions at UCLA.

To open this notebook, you have following option:

> * Clone the repo from [Github](https://github.com/ashwinwadte/deep-learning-v2-pytorch) and open the notebook ***StudentAdmissions.ipynb*** in the ***intro-neural-networks > student_admissions*** folder. You can either download the repository with `git clone https://github.com/udacity/deep-learning-v2-pytorch.git`, or download it as an archive file from [this link](https://github.com/ashwinwadte/deep-learning-v2-pytorch/archive/master.zip).

### Instructions
In this notebook, you'll be implementing some of the steps in the training of the neural network, namely:

* One-hot encoding the data
* Scaling the data
* Writing the backpropagation step


This is a self-assessed lab. If you need any help or want to check your answers, feel free to check out the solutions notebook in the same folder.

# Training
Now we know how to build and train neural network to fit our data, but sometimes we try to train and see that nothing works as planned. Why? Because there are many things which can fail, like our architecture may be poorly chosen, data can be noisy, our model may be taking years to run. We need to find ways to optimize the training. Let's see, how we can do that.

# Testing

<img src="assets/testing1.PNG" width=900px>
<img src="assets/testing2.PNG" width=900px>
<img src="assets/testing3.PNG" width=900px>
<img src="assets/testing4.PNG" width=900px>

# Overfitting and Underfitting

So let's talk about life! In life there are two mistakes one can make. One may try to kill godzilla with fly swatter i.e. oversimplifing the problem (in ML, it is called **Underfitting**) or fly with bazuka i.e. over complicated when we can use much simpler solution instead (in ML, it is called **Overfitting**).

<img src="assets/fitting1.PNG" width=900px>
<img src="assets/fitting2.PNG" width=900px>

---
<img src="assets/fitting3.PNG" width=900px>

Sometimes we refer **underfitting** as **Error due to bias**.
 
---
<img src="assets/fitting4.PNG" width=900px>
<img src="assets/fitting5.PNG" width=900px>
<img src="assets/fitting6.PNG" width=900px>

Sometimes we refer **overfitting** as **Error due to variance**.

<img src="assets/fitting7.PNG" width=900px>
<img src="assets/fitting8.PNG" width=900px>

In real life, we will always end up in underfitting or overfitting. Should we go for smaller pant or bigger pant? It's less bad to go for bigger pant and try to get a belt?

That's what we are going to do, we will err on the side of overly complicated models and then we'll apply certain techniques to prevent overfitting on it.

# Early stopping
So let's start with a complicated network architecture which would be more complicated than we need but we need to live with it.

<img src="assets/fitting9.PNG" width=900px>

<img src="assets/fitting10.PNG" width=900px>
<img src="assets/fitting11.PNG" width=900px>


So, we do gradient descent until the testing error stops decreasing and starts to increase. At that moment, we stop. This algorithm is called **Early stopping** and is widely used to train neural networks.

# Regularization

<img src="assets/regularization1.PNG" width=900px>

In [None]:
show_quiz(name='regularization')

<img src="assets/regularization2.PNG" width=900px>
So, there is overfitting but in subtle way.

<img src="assets/regularization3.PNG" width=900px>
It is much harder to do gradient descent in 2nd case, since the derivatives are mostly close to zero and then very large when we get to the middle of the curve. Therefore, in order to do gradient descent properly, we want the model in the left more than the one in the right.

This is very well summarized by:
<img src="assets/regularization4.PNG" width=900px>

### L1 and L2 Regularization
<img src="assets/regularization5.PNG" width=900px>

<img src="assets/regularization6.PNG" width=900px>

# Dropout

<img src="assets/dropout1.PNG" width=900px>
<img src="assets/dropout2.PNG" width=900px>
<img src="assets/dropout3.PNG" width=900px>
<img src="assets/dropout4.PNG" width=900px>

# Local Minima

<img src="assets/local_minima1.PNG" width=900px>

# Random restart

<img src="assets/local_minima2.PNG" width=900px>
<img src="assets/local_minima3.PNG" width=900px>

# Vanishing Gradient

<img src="assets/vanishing_gradient1.PNG" width=900px>
<img src="assets/vanishing_gradient2.PNG" width=900px>
<img src="assets/vanishing_gradient3.PNG" width=900px>

# Other Activation Functions

The best way to fix this vanishing gradient issue is to change the activation functions.

<img src="assets/activation_functions1.PNG" width=900px>
<img src="assets/activation_functions2.PNG" width=900px>
<img src="assets/activation_functions3.PNG" width=900px>
<img src="assets/activation_functions4.PNG" width=900px>

Note that the last unit is a sigmoid, since our final output still needs to be a probability between 0 and 1. However, if we let the final unit be a ReLU, we can actually end up with regression models that predict a value. This will be of use in the recurrent neural network section.

# Batch vs Stochastic Gradient Descent

<img src="assets/bsgd1.PNG" width=900px>
<img src="assets/bsgd2.PNG" width=900px>
<img src="assets/bsgd3.PNG" width=900px>
<img src="assets/bsgd4.PNG" width=900px>
<img src="assets/bsgd5.PNG" width=900px>
<img src="assets/bsgd6.PNG" width=900px>

Notice that with the data, we took four steps whereas, when we did normal gradient descent, we took only one step with all the data. Of course, the four steps we took were less accurate but in practice, it's much better to take a bunch of slightly inaccurate steps than to take one good one.

# Learning Rate Decay

<img src="assets/momentum1.PNG" width=900px>
<img src="assets/momentum2.PNG" width=900px>

# Momentum
<img src="assets/momentum3.PNG" width=900px>