**Content**

1.   Overview of Deep Learning Module.
2.   Intuition behind Neural Networks.
3.   Programming Languages used in Deep Learning.
4.   Biological Neuron Vs Artificial Neuron.
5.   Logistic Regression as Neural Network.
6.   Perceptron Model.
7.   Training of Perceptron Model.
8.   Multi-Layered Perceptron (MLP).
9.   Training of MLP.

#### Deep-Learning Module Breakup:

![](imgs/img1.png)

#### Objective:

Objective of Deep Learning Module is:
- Equip learners with strong foundations so as to understand and work with any new model.
- Cover most widely used models.
- Enable learners to read and understand research papers.
  ![](imgs/img2.png)


***Assumption:***

Learners are comfortable with Classical Machine Learing (ML) and Maths.

#### Why Deep Learning?:

- In practice, when we have tabular data the goto strategy is to use GBDT/XgBoost/RF. (This is not a hard & fast rule. Sometimes even Logistic Regression performs better.
- But for Image/Speech/Complex Time series/Textual data, Deep Learning is used where Neural-Networks is the foundation.
- We will study Images in Computer Vision (CV) module and Textual data in Natural Language processing (NLP) module.
- We will also study Neural Collaborative Filtering (NCF) that is used in State of the Art (SOTA) Recommender Systems (RecSys).
  ![](imgs/img3.png)

*Note:*
- If we have Image/Speech/Textual Data, Deep Learning (DL) is the de-facto approach used nowadays.



#### Neural Networks:

- Simple model in Neural-Networks (N-N) is called as Perceptron model built in 1957 by Rosenblatt.
- Whole area of N-N is loosely inspired from Human/Mammalian brain. But it cannot simulate understanding of a modern Human brain.
  
  ![](imgs/img4.png)
- Next significant progress happened in 1980 when a paper was released on Back-propogation by Geoff Hinton. He also wrote paper about foundations of DL. He wrote on how to build modern DL algorithms.
- This idea only gained traction in 2012, when a seminal paper was released (AlexNet). This is will covered in CV module.

***Trivia:***

Andrew Ng and his team used graphic processors to speed-up computations in DL around 2007-08.


#### Programming Language used in DL:

- 2 major languages used are:
  - TensorFlow (TF) and Keras developed by Google.
  - PyTorch developed by Facebook.
![](imgs/img5.png)

*Note:*
- TensorFlow will be used in our Classes.
- Conceptually if you understand one, the other is quite simple to work with.


#### Inspiration:

- N-N as loosely inspired from humans/mammalian brain. This is not an exact connection but an approximate one.
- A **Neuron** is basically a brain cell. It has a lot of connections with other neurons. The incoming connection is known as dendrites. They send electro-chemical signals to the neurons, which process this information and send information to other neurons through **Axons**.
- This is loosely how brain cells work.
  ![](imgs/img6.png)

*Note:*

We will not be studying this Biological Neuron, but Artificial Neurons which are mathematical.



***An example from Biological Domain:***

- Imagine we touched a hot plate. Electro-chemical signals go from fingers through spinal cord to the brain. Moment that happens, brain sends a signal to move the muscles.
- Over a period of time, brain learns from good and bad experiences.

#### Artificial Neuron:

- Imagine we have inputs $x_1,x_2, x_3$ coming into a neuron and every edge has weights $w_1,w_2,w_3$ associated with it.
  ![](imgs/img7.png)
- Each input to the neuron is multiplied by the corresponding weight of the edge. Note, that this weight tells how important any input is.
- This function $(f)$ is often referred to as an $Activation \ Function$. This function takes input $(w_1x_1 +w_2x_2 +w_3x_3)$ and generates output $O_1$
- In some cases, it generates more outputs. This output can be sent to lot of other neurons.
- If the function was a Sigmoid function, then this would be similar to Logistic Regression. Hence, the bias term $b$.

*Note:*

$O_1$ is the output of $Neuron_1$.



#### Terminology Alert:

- $O_i$ is the output which is a function of: $$\sum_{j=1}^d x_{ij}, w_j$$ where $x_i$ is a d-dimensional datapoint $x_i\in \mathbb{R}^{d}$).
- $f$ is called the activation function and $w_j \to$ weights on edges.
- For each $x_i$, there is an output $O_i$.
  ![](imgs/img8.png)



#### Artificial Neural Network: Perceptron

- We will first study ANN: Perceptron, then Logistic Regression from N-N perspective, followed by Multi-Layered Perceptron (MLP) and code.
  ![](imgs/img9.png)
- Logistic Regression and Perceptron are very related concepts.
- Fundamental working of Logistic Regression is as follows.
- Given $x_i's$ which are d-dimensional $(x_i \in \mathbb{R}^{d})$ and $y_i$, we predict $\hat{y_i}$ which is a Sigmoid function of $(w^Tx_i+b)$. Here $w$ is d-dimensional vector and $b$ is scalar.



- Let's see how can we represent this concept as N-N.
- $x_{i1}$ is an input to the activation function. Along with $x_{i1}$ (first feature of $x_i$), we have other features like $x_{i2}, x_{i3}...x_{id}$ as inputs.
  ![](imgs/img10.png)
- Each of these inputs is associated with weights $w_1, w_2, w_3...w_d$ and constant $1$ with $b$.
- The activation function is a Sigmoid Function and we need to learn these weights.
- Sigmoid function generates output $O_i$ which is $\hat{y_i}$.
- Parameters to train are $w_1, w_2, w_3,...w_d$ and $b$.
- We can find these parameters using Stochastic Gradient Descent (SGD) or batch Gradient Descent.


#### N-N representation of Logistic Regression:

- Let say we have inputs $x_{i1}, x_{i2},...x_{id}$ with weights $(w_1, w_2,...w_d)$, bias $b$ and an activation function Sigmoid. Let $O_i$ be with output.
  ![](imgs/img11.png)
- $\forall \ x_i's$ we can pass it to the activation function and get output. This is also called as **Forward Propogation**.
- Sigmoid function is represented as $f_\sigma (w^Tx+b)$.
- In Logistic Regression, we have a loss function Log-loss: $Loss_\mu(y_i,O_i)$ where $O_i \to \hat{y_i}$


##### Dataset Representation:

  ![](imgs/img12.png)




##### Evaluation of Querypoint $x_q$:

- Let's assume that we found best weights using SGD. We have a querypoint $x_q$, a d-dimensional point such that $x_q \in \mathbb{R}^{d}$.
  ![](imgs/img13.png)
- All we have to do is, take each of the dimensions $(x_{q1}, x_{q2},..x_{qd})$ and pass it to function $f_\sigma$. We generate output $O_q$ as $O_q = \hat{y_q} = \sigma({w^Tx_q+b})$
- $\hat{y_q}$ could also be thought of as $\hat{y_q} = P(y_q=1 | x_q,w)$.




#### Diagramatic Representation of Perceptron:

- Let $x_i \in \mathbb{R}^{d}, y_i \in \{0,1\}$.
  ![](imgs/img14.png)
- In Perceptron, the activation function is different, but rest of the structure remains the same.
- $O_i = f_{perceptron}(x_i,w,b)$ where

\begin{equation}
  f_{perceptron}(x_i,w,b)=\begin{cases}
    1, & \text{if $w^Tx_i+b>0$}.\\
    0, & \text{otherwise}.
  \end{cases}
\end{equation}


*Note:*

- We have a single neuron in Perceptron similar to Logistic Regresson.
- Only difference is Logistic Regression uses Sigmoid Activation and Perceptron uses another activation function.




#### Various ways to represent Perceptron Model:

We could represent Logistic Regression and Perceptron using either equations, geometrically or using N-N diagram.
![](imgs/img40.png)




- Following is the geometric representation of Perceptron Model.
- Geometrically, Perceptron is a hyperplane $(\Pi^d)$ based separator without any squashing of output.
  ![](imgs/img15.png)


***Disadvantages:***
- It has certain disadvantages as well.
- Impact of outliers is massive as there is no squashing.
- It's a Linear Model, hence we cannot get non-linear decision boundaries.
- No probabilities.  For points close to decision boundary and far away will have same values.





#### Training of Perceptron Model:

- It is simple to train Perceptron model.
  ![](imgs/img16.png)
- We have $x_i$ corresponding to each $y_i$, where $y_i \in \{0,1\}$. $\hat{y_i}$ is computed as $f_p(w^Tx_i+b)$, where $\hat{y_i} \in \{0,1\}$.
- We can come up with a loss function like Mean Squared Error (MSE) that will output only $0$ and $1$ (No probabilities).
- Optimization problem would like:
$$\underset{w,b}{min} \ \sum_{i-1}^n Loss(y_i,\hat{y_i}) + \lambda L_2-reg \ (w_j)$$
- This optimization problem could be solved using Gradient Descent (GD).







- Till now, we saw a simple Single Neuron model like:
  - Logistic Regression with Log-loss and $L_2$ regularization.
  - Perceptron (Not used much).
- Similarly, a Linear SVM could also be thought of as a single neuron model.
  ![](imgs/img17.png)
- In SVM, we have hinge loss with some $L_2$ regularization.

*Note:*

These models have single neuron. What if we had such multiple models? How do we connect them?







#### Multi-Layered Perceptron (MLP):

- Imagine we have $(x_{i1}, x_{i2}, x_{i3}, x_{i4})$ representation of a single point $x_i$ such that $x_i \in \mathbb{R}^{4}$.
- Suppose we have multiple activation functions $(f)$ in the 1st layer. Each of the inputs is connected to every activation function in $1^{st}$ Layer.
  ![](imgs/img18.png)
- Outputs of the $1^{st}$ layer are inputs to the activation functions of the $2^{nd}$ layer.
- Theoretically, all $(f's)$ can be different. But in practice, we have same activation function.
- Outputs of the $2^{nd}$layer are inputs to the activation functions of the final layer which generates output $O_i$.
- This is how a Multi-Layered Perceptron Model (MLP) would look like.
- A MLP has multiple neurons arranged in multiple layers.
- $1^{st} \ layer \to (12 \ weights + 3 \ bias)$, $2^{nd} \ layer \to (6 \ weights + 2 \ bias)$ and $3^{rd} \ layer \to (2 \ weights + 1 \ bias)$.
- All these weights are different. If we know the weights, we can do Forward Propogation (FP).
- FP is nothing but going from input layer to output while doing a bunch of matrix multiplication at each layer.

*Note:*

We are ignoring $b$ for diagramatic simplicity.









#### Why do we need MLP?

- Suppose we have simple functions in Algebra $ f_1, f_2,....f_7$ as shown below.
  ![](imgs/img19.png)
- $f_1$ could be written as $f_1(x_{i1}, x_{i2}) \to x_{i1} + x_{i2}$.
- Then, $f_6(f_1(x_{i1}, x_{i2})) \to Sin(x_{i1} + x_{i2})$.










#### Terminology Alert:

- Composition of functions is $fog(x) = f(g(x))$ and $gof(x) = g(f(x))$.
- So in MLP, we are doing composition of functions.
- This is a great strategy to build complex functions using simple functions.
  ![](imgs/img20.png)









- Note that $(w^Tx_i+b)$ is already linear function on top of which we compose another function.
- If we keep on compositing functions, we can build extremely complex functions.
  ![](imgs/img21.png)









#### Misc. Questions:

***Question:*** Till how long we will compose non-linear functions?

***Anwser:***
- Imagine there is 100-layered MLP, which means we are doing 100 function compositions. There is also a 2-layered MLP.
- 100-layered will certainly be more complex.
  ![](imgs/img22.png)

*Note:* 100 layered MLP will tend to overfit more. So depth is actually a hyper parameter.









***Question:*** What happens if each activation function in MLP is a linear function?

***Anwser:***
- No, these functions will always be non-linear in Deep Learning.
  ![](imgs/img23.png)

- Imagine we have a MLP as shown below:
  ![](imgs/img24.png)
- If we expand the output $O_i$, we get:
$$O_i = \alpha^{2}w_1 (w^Tx_i+b) + \alpha^{2}w_2 (w'^Tx_i+b')$$
- This is still a linear function. If all the activation functions are linear, then MLP will be a Linear Model.
- Hence, we always use non-linear activation functions in Deep Learning.







***Question:*** Should we use different activation functions in different layers?

***Anwser:***
- Theoretically we can. But studies have shown that it is not of much value.
  ![](imgs/img25.png)











***Question:*** Is number of neurons in a layer also a hyper parameter?

***Anwser:***
- Yes, number or neurons is also a hyper parameter.










***Question:*** Does it make sense to keep all activation functions in every layer the same ?

***Anwser:***
- Yes in practise, we go for that.

***Trivia:***

Sigmoid was very popular in $1980's \ and \ 90's$. Nowadays RELU is more popular. (We will study this in future classes).










#### Underlying Maths:

- Imagine we have a Multi-Layered Perceptron. We have have 4-dimensional input $x_i$.
  ![](imgs/img26.png)
- Weights for each layer is represented as $w^k_{ij}$ where $i\to$ node from previous layer, $j\to$ node to next layer and $k\to$ next layer.
- Each activation function has output $O_{ij}$, where $i \to$ current layer and $j \to$ index of the nueron.
- They are also known as fully connected MLP.











- Here is an example of how weights are labelled at each layer.
- In Deep Learning, weights are matrices as shown below.
  ![](imgs/img27.png)












#### How to train MLP:

- Let, we have a training dataset: $D_{Tr} = \bigg\{ (x_i,y_i)\bigg\}_{i=1}^{n}$ where $x_i \in \mathbb{R}^{d}$.
  ![](imgs/img28.png)
- $x_i$ is passed through MLP where we do Forward propogation and we get predictions $\hat{y_i}$.
- Forward propogation is basically bunch of matrix multiplications and applying activation function.
- We need to define a loss function which takes ground truth and prediction.
- The loss function could be MSE for Regression and Log-Loss for Classification.
- During training, we determine these unkown weights.













##### Training Steps:

1.  Initialize all weights randomly. We want to minimize:
$$\sum_{i=1}^{n} \ Loss_i$$

    (Trick we could use here, is Calculus)
    ![](imgs/img29.png)

2.  $\forall \ x_i$ we will do FP and pass each $x_i$ to MLP and get output $O_i$ which in turn is used to determine $Loss_i$. We want to minimize $(w)$, summating over all losses.












3.  $W_{ij}$ is updated as:
$${W^k_{ij}}_{new} = {W^k_{ij}}_{old} - \eta * \frac{\partial Loss}{\partial W^k_{ij}}\biggr\vert_{{W^k_{ij}}_{old}}$$
  ![](imgs/img30.png)
4.  Repeat step 3 $\forall i,j,k$ till convergence.













*Note:*

- This is a Brute Force approach and highly sub optimal.
- For this to work, all activation functions and loss function have to be differentiable.
- There is a better strategy, Back Propogation (Will be discussed in next class).

#### Misc. Questions:

***Question:*** We keep referring to SGD when we mention about GD. In practice is it more common to use SGD or mini batch GD?

***Answer:***
- Imagine we have 1 million data points.
  ![](imgs/img31.png)
- In Gradient Descent, we compute $\frac{\partial Loss}{\partial w}$ for all the n data points.
- In Stochastic Gradient Descent, we compute $\frac{\partial Loss}{\partial w}$ using one random point.
  ![](imgs/img32.png)
- In mini-batch $(k)$ Gradient Descent, we compute $\frac{\partial Loss}{\partial w}$ for random batch of k points.
- Mini-batch Gradient Descent is widely used.











***Question:*** How is determining weights in Logistic Regression different than MLP while using Gradient Descent in both places?

***Answer:***
- Imagine we have a simple example as mentioned below.
- We have input $x_i$ with 2 activation functions $f_1, f_2$ and we get the predictions $\hat{y_i}$.
  ![](imgs/img33.png)
- Assume, we use a simple MSE as loss function. So $Loss_i = (y_i - \hat{y_i})^2$.
$$\frac{\partial Loss}{\partial w_1} = \sum_{i=1}^{n}\frac{\partial Loss_i}{\partial w_1}$$
  ![](imgs/img34.png)
- On further ellaborating:
$$\frac{\partial Loss}{\partial w_1} = \frac{\partial }{\partial w_1} \ (y_i - f_2(f_1(x_i,w1)*w_2))^2$$
- Note that $f_1 \ and \ f_2$ are differentiable and we solve this using chain rule.















***Question:*** Isn't DL similar to Stacking?

***Answer:***
- In stacking all models $M_1, M_2, M_3$ were built independently.
- Outputs of these models is passed to $M_4$ which was built after.
  ![](imgs/img35.png)
- In DL, we are learning the weights of all activation functions together.
  ![](imgs/img36.png)
- So in DL, we have more flexibility to fine tune the weights simultaneously.

















***Question:*** Can you give a similar distinction b/w boosting and MLP. In boosting also, we build sequential trees which tries to work upon the mistakes made be previous trees.

***Answer:***
- In boosting, we have a base learner $h_i(x)$. Using errors made by this model, we build next model $h_{i+1}(x)$.
- While building $h_{i+1}(x)$, $h_{i}(x)$ is fixed. Similarly while building $h_{i+2}(x)$, $h_{i+1}(x)$ is fixed.
  ![](imgs/img37.png)
- In DL, an activation function is connected to another activation. All of them are trained simultaneously.


















***Question:*** Since we are using multiple layers, does our loss function remain convex i.e. 1 minima or is there a change here?

***Answer:***
- Since we have multiple composition functions, it turns out to be very complex and often non-convex.
  ![](imgs/img38.png)
- Hence, we will multiple local minimas.




















***Question:*** How would the TC look like for NN? in comparison to traditional ML algorithms? This looks super expensive.

***Answer:***
- DL are computationally expensive as there are lot of parameters to train.
- A traditional Linear Regression model would take few secs. But DL models take minutes or even hours.
  ![](imgs/img39.png)
- It needs dedicated hardware (GPUs).
















***Question:*** Do we also get feature importance for NN?

***Answer:***
- It is very difficult to get feature importances easily.
- There are complex technique like LIME which will be covered in CV module.





