# Deep Learning (2021)

## Theory of Deep Neural Networks

This notebook covers practical machine learning knowledge and deep neural networks, including feedforward and convolutional neural networks. 

### Theory and Knowledge
<span style="color:red">Activation function plays an important role in modern Deep NNs. For Sigmoid and tanh, the range  and derivative are shown below. </span>

#### <u>Sigmoid</u>: 
$$\sigma(x) = \dfrac{1}{1+e^{-x}}$$

The input range for $\sigma(x)$ is $-\infty\leq x\leq \infty$, so we can check three cases to find the output range of $\sigma(x)$:

Case 1. $x\rightarrow-\infty$:

As $x$ approaches $-\infty$, the $e^{-x}\rightarrow\infty$, which implies $\sigma(x)\rightarrow 0$.

Case 2. $x = 0$:

As $x=0$, the $e^{0}=1$, which implies $\sigma(0)=\dfrac{1}{2}$.

Case 3. $x\rightarrow\infty$:

As $x$ approaches $\infty$, the $e^{-x}\rightarrow 0$, which implies $\sigma(x)\rightarrow 1$.

Therefore, we see $\sigma(x)\in(0,1)$

The derivative for $\sigma(x)$ is as follows:

\begin{equation}
\begin{split}
\dfrac{d\sigma}{dx} & = \dfrac{d}{dx}(1+e^{-x})^{-1}\\
& = e^{-x}(1+e^{-x})^{-2}\\
& = \dfrac{e^{-x}}{(1+e^{-x})^2}\\
& = \sigma(x)\dfrac{e^{-x}}{1+e^{-x}}\\
& = \sigma(x)\left(\dfrac{1+e^{-x}}{1+e^{-x}}-\dfrac{1}{1+e^{-x}}\right)\\
& = \sigma(x)\left(1-\sigma(x)\right)\\
\end{split}
\end{equation}

where the chain rule was used $\dfrac{d}{dx}(u(v(x))=\dfrac{d}{dv}(u)\dfrac{d}{dx}(v)$

#### <u>Tanh</u>: 
$$\sigma(x) = \dfrac{e^{x} - e^{-x}}{e^{x} + e^{-x}}= \dfrac{1-e^{-2x}}{1+e^{-2x}}$$

Case 1. $x\rightarrow-\infty$:

As $x$ approaches $-\infty$, the $e^{-x}\rightarrow\infty$, which implies $\sigma(x)\rightarrow -1$.

Case 2. $x = 0$:

As $x=0$, the $e^{0}=1$, which implies $\sigma(0)=0$.

Case 3. $x\rightarrow\infty$:

As $x$ approaches $\infty$, the $e^{-x}\rightarrow 0$, which implies $\sigma(x)\rightarrow 1$.

Therefore, we see $\sigma(x)\in(-1,1)$

The derivative for $\sigma(x)$ is as follows:

\begin{equation}
\begin{split}
\dfrac{d\sigma}{dx} & = \dfrac{d}{dx}\left(\dfrac{e^x-e^{-x}}{e^x+e^{-x}}\right)\\
& = \dfrac{(e^x+e^{-x})(e^x+e^{-x})- (e^x-e^{-x})(e^x-e^{-x})}{(e^x+e^{-x})^2}\\
& = 1 - \dfrac{(e^x-e^{-x})^2}{(e^x+e^{-x})^2}\\
& = 1 - \sigma(x)^2
\end{split}
\end{equation}

where the quotient rule was used $\dfrac{d}{dx}\left(\dfrac{u}{v}\right)= \dfrac{\tfrac{d}{dx}(u)v-u\tfrac{d}{dx}(v)}{v^2}$

<span style="color:red">Softmax activation aims to transform discriminative values to prediction probabilities.</span>

 Consider a classification task with $M=4$ classes and a data example $x$ with a ground-truth label $y=2$. Assume that at the output layer of a feed-forward neural network, we obtain the logits $h^{L}=[2,-1,5,0]$.

The softmax function is given by

\begin{equation}
\text{softmax}(h_m)= \dfrac{e^{h_m}}{\sum_{i=1}^Me^{h_i}},\, \text{ for } m=1,\cdots,M
\end{equation}

Therefore, we can calculate the corresponding probabilities:

\begin{equation}
\begin{split}
p(x) & = 
\begin{bmatrix}
\dfrac{e^2}{e^2+e^{-1}+e^5+e^0}\\
\dfrac{e^{-1}}{e^2+e^{-1}+e^5+e^0}\\
\dfrac{e^5}{e^2+e^{-1}+e^5+e^0}\\
\dfrac{e^0}{e^2+e^{-1}+e^5+e^0}\\
\end{bmatrix}\\
& = 
\begin{bmatrix}
0.0470\\
0.0023\\
0.9443\\
0.0064\\
\end{bmatrix}\\
& = 
\begin{bmatrix}
4.70\%\\
0.23\%\\
94.43\%\\
0.64\%\\
\end{bmatrix}\\
\end{split}
\end{equation}

We can calculate the cross-entropy loss caused by the feed-forward neural network at $(x,y)$.

The general formula for cross entropy is given by

\begin{equation}
CE = - \sum_i y_i\log(p_i)
\end{equation}
where $y_i$ is the true label and $p_i$ is the probability.

Given the ground truth is $y=2$, we can say for $y_2=1$ and $y_1=y_3=y_4=0$. Therefore, the cross entropy for this prediction is 

\begin{equation}
\begin{split}
CE & = -\log(0.0023)\\
& = 6.057
\end{split}
\end{equation}

In [2]:
import numpy as np
from math import exp
from math import log

def softmax(data):
    result = np.zeros(len(data))
    total = 0
    for row in range(len(data)):
        result[row] = exp(float(data[row]))
        total += result[row]
    return result/total

def CE_loss(y, p):
    result = 0
    for row in range(len(y)):
        result -= y[row]*log(p[row])
    return result
    
truth = np.array([0,1,0,0])
h = np.array([2,-1,5,0])
p = softmax(h)
loss = CE_loss(truth,p)
print("Question a)\n\t{}".format(p))
print("\nQuestion b)\n\t{}".format(loss))

Question a)
	[0.04701312 0.00234065 0.9442837  0.00636253]

Question b)
	6.0573286242556375


<span style="color:red">Linear operation and element-wise activation are two building-blocks for conducting a layer in a feedforward neural network.</span>


Assuming that hidden layer $1$ has value 
$$h^1(x)= \left[\begin{array}{ccc}
1.5 & 2.0 \end{array}\right]^T$$

and the weight matrix and bias at the second layer are:
$$W^{2}=\left[\begin{array}{cc}
-1 & -1\\
-1 & 1\\
-1 & 0
\end{array}\right]$$

We can calcualte the value of the hidden layer $\bar{h}^{2}(x)$ after applying *the linear operation* with the matrix $W^2$ and the bias $b^2$ over $h^1$.


The value at $\bar{h}^{2}(x)$ is given by

\begin{equation}
\bar{h}^{2}(x) = W^2h^1+b^2
\end{equation}
and given there is no bias for hidden layer 2, we can calculate the value to be

\begin{equation}
\begin{split}
\bar{h}^{2}(x) & =
\begin{bmatrix}
-1 & -1\\
-1 & 1\\
-1 & 0
\end{bmatrix}
\begin{bmatrix}
1.5 \\
2.0 
\end{bmatrix}\\
& = 
\begin{bmatrix}
-3.5 \\
0.5 \\
-1.5
\end{bmatrix}
\end{split}
\end{equation}

Assuming that we apply *the ReLU activation function* at the second layer. What is the value of the hidden layer $h^2(x)$ after we apply the activation function?

The ReLU function is

\begin{equation}
\text{ReLU}(z) = \max(0,z)
\end{equation}

Therefore, the values to the hidden layer after the activation is applied is

\begin{equation}
\begin{split}
h^2(x) = 
\begin{bmatrix}
0 \\
0.5 \\
0
\end{bmatrix}
\end{split}
\end{equation}

In [3]:
h1 = np.array([[1.5],
               [2]])
W2 = np.array([[-1,-1],
               [-1,1],
               [-1,0]])

h2bar = np.matmul(W2,h1)
h2 = np.maximum(h2bar,np.zeros(h2bar.shape))
print("Question a)\n{}".format(h2bar))
print("\nQuestion b)\n{}".format(h2))

Question a)
[[-3.5]
 [ 0.5]
 [-1.5]]

Question b)
[[0. ]
 [0.5]
 [0. ]]


<span style="color:red">Multilayered feedforward neural network for a regression</span> 

Consider that have a network for a regression problem to predict to real-valued $y_1, y_2$, and $y_3$.

The architecture of this network ($3 (Input)\rightarrow4(ReLU)\rightarrow 3(Output)$) is shown in the following figure:


<img src="FeedforwardNN.png" 
    style="display: block; 
           margin-left: auto;
           margin-right: auto;
           width: 65%;
           background: white;"
           />

We now feed a feature vector $x=\left[\begin{array}{ccc}
1.2 & -1 & 2\end{array}\right]^{T}$ with ground-truth label $y=\left[\begin{array}{ccc} 1.5 & -1 & 2\end{array}\right]^{T}$ to the above network. 

**Forward propagation**

What is the value of $\bar{h}^{1}(x)$?

\begin{equation}
\begin{split}
\bar{h}_1^1 & = W^1x+b^1\\
& = \begin{bmatrix} 1 & -1 &0\\0 & -1 & 1\\ 1 & 1 & -1\\1 & 1 & 1\end{bmatrix}
\begin{bmatrix} 1.2\\ -1\\ 2\end{bmatrix}
+\begin{bmatrix} 0\\ 1\\ 1\\ -1\end{bmatrix}\\
& = \begin{bmatrix} 2.2\\ 3\\ -1.8\\ 2.2\end{bmatrix}
+\begin{bmatrix} 0\\ 1\\ 1\\ -1\end{bmatrix}\\
& = \begin{bmatrix} 2.2\\ 4\\ -0.8\\ 1.2\end{bmatrix}
\end{split}
\end{equation}

What is the value of $h^{1}(x)$?

The ReLU function is

\begin{equation}
\text{ReLU}(z) = \max(0,z)
\end{equation}

Therefore, the values to the hidden layer after the activation is applied is

\begin{equation}
\begin{split}
h^1(x) & = \text{ReLU}\left(\bar{h}^1\right)\\
& = \begin{bmatrix}
2.2 \\
4 \\
0\\
1.2
\end{bmatrix}
\end{split}
\end{equation}

What is the predicted value $\hat{y}$?

\begin{equation}
\begin{split}
\hat{y} & = W^2h^1+b^2\\
& = \begin{bmatrix} 1.5 & 1 &1 & -1\\0 & 0 & 1 & 1\\ -1 & 1 & 1 & -1\end{bmatrix}
\begin{bmatrix} 2.2\\ 4\\ 0\\ 1.2\end{bmatrix}
+\begin{bmatrix} 1\\ 0\\ 0.5\end{bmatrix}\\
& = \begin{bmatrix} 6.1\\ 1.2\\ 0.6\end{bmatrix}
+\begin{bmatrix} 1\\ 0\\ 0.5\end{bmatrix}\\
& = \begin{bmatrix} 7.1\\ 1.2\\ 1.1\end{bmatrix}
\end{split}
\end{equation}

What is the value of the L2 loss $l$?

\begin{equation}
\begin{split}
\ell_2 & = \sqrt{\sum_{i=1}^3\left(y_i-\hat{y}_i\right)^2}\\
& = \sqrt{\left(1.5-7.1\right)^2+\left(-1-1.2\right)^2+\left(2-1.1\right)^2}\\
& = 6.08
\end{split}
\end{equation}

**Backward propagation**

For backward propagation, we need to calculate the $\dfrac{\partial l}{\partial h^{2}},\dfrac{\partial l}{\partial W^{2}}$, and $\dfrac{\partial l}{\partial b^{2}}$

Using $h^2 =\hat{y} = W^2h^1+b^2$

\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial h^2} & =\dfrac{\partial \ell_2}{\partial \hat{y}}\\
& =\begin{bmatrix} \dfrac{y_1-\hat{y}_1}{\ell_2}& \dfrac{y_2-\hat{y}_2}{\ell_2} & \dfrac{y_3-\hat{y}_3}{\ell_2} \end{bmatrix}\\
& = \dfrac{1}{6.08}\begin{bmatrix}-5.6 & -2.2 & 0.9 \end{bmatrix}\\
& = \begin{bmatrix}-0.92 & -0.36 &  0.15\end{bmatrix} 
\end{split}
\end{equation}


Using $h^2 = W^2 h^1 + b^2$,

\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial W^2} & = \dfrac{\partial \ell_2}{\partial h^2}\dfrac{\partial h^2}{\partial W^2} \\
& = \begin{bmatrix}-0.92 \\ -0.36 \\  0.15\end{bmatrix}  \begin{bmatrix} 2.2 &4 &0 &1.2\end{bmatrix}\\
& = \begin{bmatrix} -2.03& -3.68& 0.    &     -1.11\\-0.80& -1.45&  0.     &    -0.43\\ 0.33 & 0.59&  0.  &        0.18\end{bmatrix}\\
\end{split}
\end{equation}


\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial b^2} & = \dfrac{\partial \ell_2}{\partial h^2}\dfrac{\partial h^2}{\partial b^2}\\
& = \begin{bmatrix}-0.92 \\ -0.36 \\  0.15\end{bmatrix} 
\end{split}
\end{equation}

Now we need to calculate the derivatives $\dfrac{\partial l}{\partial h^{1}}, \dfrac{\partial l}{\partial \bar{h}^{1}},\dfrac{\partial l}{\partial W^{1}}$, and $\dfrac{\partial l}{\partial b^{1}}$? 

Using $h^2 = W^2h^1+b^2$,

\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial h^1} & =\dfrac{\partial \ell_2}{\partial h^2}\dfrac{\partial h^2}{\partial h^1}\\
& = \begin{bmatrix}-0.92 & -0.36 &  0.15\end{bmatrix}   \begin{bmatrix} 1.5 & 1 &1 & -1\\0 & 0 & 1 & 1\\ -1 & 1 & 1 & -1\end{bmatrix}\\
&  = \begin{bmatrix}-1.53 &-0.77 & -1.13 & 0.41\end{bmatrix}
\end{split}
\end{equation}


Using  

\begin{equation}
\begin{split}
h^1 & = \text{ReLU}\left(\bar{h}^1\right)\\
\dfrac{\partial h^1}{\partial \bar{h}^1} & = \begin{bmatrix}\text{ReLU}^\prime\left(\bar{h}^1\right) & 0 &0\\0 &\text{ReLU}^\prime\left(\bar{h}^1\right) & 0\\ 0 & 0 & \text{ReLU}^\prime\left(\bar{h}^1\right)\end{bmatrix}\\
\end{split}
\end{equation}

where 

\begin{equation}
\text{ReLU}^\prime(x) =\left\{
\begin{matrix}
0 &\text{if }x \leq 0\\
1 &\text{if }x > 0\\
\end{matrix}\right.
\end{equation}


\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial \bar{h}^1} & = \dfrac{\partial \ell_2}{\partial h^1}\dfrac{\partial h^1}{\partial \bar{h}^1}\\
& = \begin{bmatrix}-1.53 &-0.77 & -1.13 & 0.41\end{bmatrix}  \begin{bmatrix}\text{ReLU}^\prime\left(2.2\right) & 0& 0 &0\\0 &\text{ReLU}^\prime\left(4\right) & 0& 0\\ 0 & 0 & \text{ReLU}^\prime\left(-0.8\right)& 0\\0 & 0 & 0 & \text{ReLU}^\prime\left(1.2\right)\end{bmatrix}\\
& = \begin{bmatrix}-1.53 & -0.77 & 0 & 0.41 \end{bmatrix}\\
\end{split}
\end{equation}

Using $\bar{h}^1= W^1x +b^1$,

\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial W^1} & =\dfrac{\partial \ell_2}{\partial \bar{h}^1}\dfrac{\partial \bar{h}^1}{\partial W^1}\\
& =\begin{bmatrix}-1.53 \\ -0.77 \\ 0 \\ 0.41 \end{bmatrix} \begin{bmatrix} 1.2 & -1 & 2\end{bmatrix}\\
& = \begin{bmatrix}-1.83 &  1.53 & -3.06\\-0.93 &  0.77 & -1.55\\0 &  0 & 0\\ 0.49 & -0.41 & 0.82\end{bmatrix}
\end{split}
\end{equation}

\begin{equation}
\begin{split}
\dfrac{\partial \ell_2}{\partial b^1} & =\dfrac{\partial \ell_2}{\partial \bar{h}^1}\dfrac{\partial \bar{h}^1}{\partial b^1}\\
& =\begin{bmatrix}-1.53 \\ -0.77 \\ 0 \\ 0.41 \end{bmatrix} 
\end{split}
\end{equation}

**SGD update**

Assume that we use SGD with learning rate $\eta=0.01$ to update the model parameters. What are the values of $W^2, b^2$ and $W^1, b^1$ after updating?

Significant figures given to 3 decimanl.

\begin{equation}
\begin{split}
W_2^2 & = W_1^2 - \eta \dfrac{\partial \ell_2}{\partial W_1^2}\\
& = \begin{bmatrix} 1.5 & 1 &1 & -1\\0 & 0 & 1 & 1\\ -1 & 1 & 1 & -1\end{bmatrix} - 0.01\begin{bmatrix} -2.03& -3.68& 0    & -1.11\\-0.80& -1.45&  0     &    -0.43\\ 0.33 & 0.59&  0  &  0.18\end{bmatrix}\\
& = \begin{bmatrix} 1.520 & 1.037 &  1.000 & -0.989\\0.008 & 0.014 &  1.000 & 1.004\\-1.003 &  0.9946 & 1.000 & -1.002\end{bmatrix}\\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
b_2^2 & = b_1^2 - \eta \dfrac{\partial \ell_2}{\partial b_1^2}\\
& = \begin{bmatrix} 1\\ 0\\ 0.5\end{bmatrix} - 0.01\begin{bmatrix}-0.92 \\ -0.36 \\  0.15\end{bmatrix} \\
& = \begin{bmatrix} 1.009 \\0.004 \\0.499 \end{bmatrix}\\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
W_2^1 & = W_1^1 - \eta \dfrac{\partial \ell_2}{\partial W_1^1}\\
& = \begin{bmatrix} 1 & -1 &0\\0 & -1 & 1\\ 1 & 1 & -1\\1 & 1 & 1\end{bmatrix} - 0.01\begin{bmatrix}-1.83 &  1.53 & -3.06\\-0.93 &  0.77 & -1.55\\0 &  0 & 0\\ 0.49 & -0.41 & 0.82\end{bmatrix}\\
& = \begin{bmatrix} 1.018 & -1.015 &  0.031 \\0.009 & -1.008 &  1.016\\1.000 &  1.000 & -1.000\\0.995 & 1.004 & 0.992 \end{bmatrix}\\
\end{split}
\end{equation}

\begin{equation}
\begin{split}
b_2^1 & = b_1^1 - \eta \dfrac{\partial \ell_2}{\partial b_1^1}\\
& = \begin{bmatrix} 2.2\\ 4\\ -0.8\\ 1.2\end{bmatrix} - 0.01\begin{bmatrix}-1.53 \\ -0.77 \\ 0 \\ 0.41 \end{bmatrix}  \\
& = \begin{bmatrix} 0.015\\ 1.008\\ 1.011\\ -1.004\end{bmatrix}\\
\end{split}
\end{equation}

In [4]:
import numpy as np
x = np.array([[1.2],
              [-1],
              [2]])
y = np.array([[1.5],
              [-1],
              [2]])
dy =  np.array([-5.6,-2.2,0.9])
b1 = np.array([[0],
               [1],
               [1],
               [-1]])
W1 = np.array([[1,-1,0],
               [0,-1,1],
               [1,1,-1],
               [1,1,1]])
b2 = np.array([[1],
               [0],
               [0.5]])
W2 = np.array([[1.5,1,1,-1],
               [0,0,1,1],
               [-1,1,1,-1]])
h1bar = W1.dot(x) + b1
h1 = np.maximum(h1bar,np.zeros(h1bar.shape))
h2bar = np.matmul(W2,h1) + b2
yhat = h2bar
loss = np.sqrt(np.sum((y-yhat)**2))
dl2dh2 = np.transpose((1/loss)*(y-yhat))
dl2dh1 = dl2dh2.dot(W2)
dh1dhbar1 = np.array([[1,  0,  0,  0 ],
                      [0,  1,  0,  0 ],
                      [0,  0,  0,  0 ],
                      [0,  0,  0,  1 ]])
dl2dhbar1 = dl2dh1.dot(dh1dhbar1)
dl2dW1 = np.transpose(dl2dhbar1).dot(np.transpose(x))
dl2dW2 = np.transpose(dl2dh2).dot(np.transpose(h1))
W22 = W2-0.01*dl2dW2
dl2db2 = np.transpose(dl2dh2)
b22 = b2-0.01*dl2db2
W21 = W1-0.01*dl2dW1
dl2db1 = np.transpose(dl2dh1)
b21 = b1-0.01*dl2db1
dl2dh1 = dl2dh2.dot(W2)
dl2dW2 = np.transpose(dl2dh2).dot(np.transpose(h1))

print("Question a)\n{}".format(h1bar))
print("\nQuestion b)\n{}".format(h1))
print("\nQuestion c)\n{}".format(yhat))
print("\nQuestion d)\n{}".format(loss))
print("\nQuestion e)\ndl/dh2\n{}\ndl/dW2\n{}\ndldb2\n{}".format(dl2dh2,dl2dW2,dl2db2))
print("\nQuestion f)\ndl/dh1\n{}\ndl/dhbar1\n{}\ndl/dW1\n{}\ndldb1\n{}".format(dl2dh1,dl2dhbar1,dl2dW1,dl2db1))
print("\nQuestion f)\nW2 at t=2\n{}\nb2 at t=2\n{}\nW1 at t=2\n{}\nb1 at t=2\n{}\n".format(W22,b22,W21,b21))

Question a)
[[ 2.2]
 [ 4. ]
 [-0.8]
 [ 1.2]]

Question b)
[[2.2]
 [4. ]
 [0. ]
 [1.2]]

Question c)
[[7.1]
 [1.2]
 [1.1]]

Question d)
6.0835844697020525

Question e)
dl/dh2
[[-0.92050994 -0.36162891  0.1479391 ]]
dl/dW2
[[-2.02512188 -3.68203978  0.         -1.10461193]
 [-0.79558359 -1.44651563  0.         -0.43395469]
 [ 0.32546602  0.59175639  0.          0.17752692]]
dldb2
[[-0.92050994]
 [-0.36162891]
 [ 0.1479391 ]]

Question f)
dl/dh1
[[-1.52870401 -0.77257085 -1.13419975  0.41094194]]
dl/dhbar1
[[-1.52870401 -0.77257085  0.          0.41094194]]
dl/dW1
[[-1.83444482  1.52870401 -3.05740803]
 [-0.92708502  0.77257085 -1.54514169]
 [ 0.          0.          0.        ]
 [ 0.49313033 -0.41094194  0.82188388]]
dldb1
[[-1.52870401]
 [-0.77257085]
 [-1.13419975]
 [ 0.41094194]]

Question f)
W2 at t=2
[[ 1.52025122  1.0368204   1.         -0.98895388]
 [ 0.00795584  0.01446516  1.          1.00433955]
 [-1.00325466  0.99408244  1.         -1.00177527]]
b2 at t=2
[[1.0092051 ]
 [0.003

--- 