# Part1. Hyperparameter Tuning

## Tuning Process

Soo many params to tune for deep learning.

Top priority:

- learning rate


Second priority:

- momentum term ($\beta$)
- mini-batch size 
- number of hidden units


Third (low) priority:

- number of layers
- learning rate decay


Almost never tune:

- Adam params: $\beta_1 (0.9)$, $\beta_2 (0.999)$, $epsilon (10^{-8})$



### How do you select values to explore? 


#### Random > Grid

In earlier days, we did like grid search and systematically explore the values. Works okay when # of params are small


Now it is better to choose randomly!! Reason:

- Difficult to know in advance which params are critical.
- If we use grid we will try way less distinct values for each param.
- If for example $\epsilon$ has little impact in your results, you might test 25 combinations (5x5 for two params) and you only explored 5 distinct values for the other param (e.g., learning rate)


#### Coarse to Fine

- You can first explore the whole region 
- If you find a specific region that works well, sample more densely in the zoomed in region.
- Sounds like explore vs exploit


## Using an Appropriate Scale to pick Hyperparams

Sampling randomly in the above notes doesn't mean we should always sample uniformly at random!!

It is important to select the correct scale.

To illustrate, if your hyperparameter range is from 2 to 5, it is fine to sample randomly or even do grid search.

But how about for $\alpha = [0.0001, 1]$ range. In this case, it doesn't make sense to sample uniformly since 90% of the attempts will be in the region $[0.1 - 1]$.

In such cases, we should sample from the log scale. Below is implementation:

In [None]:
import numpy as np
a = -4
r = a * np.random.rand() 
alpha = 10**r 
alpha

As a general rule, if you want to sample from $10^a$ to $10^b$ in the logarithmic scale:

- Sample uniformly at random in range $r \in [a,b]$
- Then set your param to $alpha=10^{r}$


#### Sampling the hyperparam for exponentially weighted averages

We usually want to sample $\beta \in [0.9,0.999]$. In this case, we use the same logarithmic scale for $(1-\beta)$. 

So that we can spend equal amount of resources for the range $[0.99,0.999]$ as you spend in $[0.9,0.99]$!!


**Intuition**

- For values very close to 1, small changes to $\beta$ actually make a big differnce:
    - Consider $0.9000 -> 0.9005$. Difference would be minimal
    - But for $0.999 -> 0.9995$, the difference (impach of the change) is quite big.

## Hyperparameters tuning in Practice
(Pandas vs Caviar)

Tips and tricks for hyperparam search process..


- Intuitions do get stale: Re-evaluate occasionally. Do not assume that certain params simply work better..


Two schools of thought: 

- Babysitting one model
- Train many models in parallel


**Babysitting one model**

(panda approach of raising kids )

- Everyday you look at the results and manually set the next values over many course of days.

- This is necessary if you have limited compute and have to find good region without many runs.


**Training many models in parallel**

(caviar approach: millions of eggs and dont pay attention)

- Train many models in parallel
- Do not pay too much human attention to babysitting each model


# Part 2. Batch Normalization


## Normalizing Activations in a Network

Batch normalization:

- Makes your model much more robust to small hyperparameter changes
- Also makes it easier to train much deeper models

Lets see how it works...


- Remember that normalizing your input features helps speed up your learning. This was the case because it makes your gradient contours to have much more circular shape.


How about the inner activations?? 

- Would be nice if we could normalize the activations (which are inputs to deeper layers). How can we do that? That would help deeper layers train faster.

- Technically, we will normalize the outputs before the activation. So we normalize $Z^{[l]}$ not $A^{[]}$.


#### Implementing Batch Norm.

- Given some values from a specific layer $Z^{[l]}$: $Z^{[l](1)},Z^{[l](2)},..., Z^{[l](m)}$ Compute the mean:

$$
\mu = \dfrac{1}{m} \sum_i Z^{[l](i)}\\
\sigma = \dfrac{1}{m} \sum_i (Z^{[l](i)}-\mu)^2\\
Z_{norm}^{[l](i)} = \dfrac{Z^{[l](i)}-\mu}{\sqrt{\sigma^2+\epsilon}}
$$

$\epsilon$ is added for numerical stability.


Sometimes we don't want hidden units to have mean=0 and var=1. In that case what we do is:


$$\tilde{Z^{[l](i)}} = \gamma^{[l]} Z_{norm}^{[l](i)} + \beta^{[l]}$$

where $\gamma^{[l]}$ and $\beta^{[l]}$ are learnable parameters for layer $l$. $\gamma$ adjusts the variance, $\beta$ adjusts the mean. Plug the values and try it yourself!!


Intuition: We know normalizing inputs is helpful. So we use batch norm to normalize the hidden layer outputs. The only difference is that we don't ALWAYS want to fully normalize to mean=0 and variance=1. we use the gamma and beta params so that they can be adjustable. Anyway, this normalization helps with standardizing your inner layer outputs!!

## Fitting Batch Norm into a Neural Network

With batch norm:

- $Z^{[1]} = W^{[1]} X + b^{1}$
- $\tilde{Z^{[1]}}= \gamma Z_{norm}^{[l](i)} + \beta$
- $A^1 = g^{[1]}(\tilde{Z^{[1]}}) $

and same for every layer. So parameters of your network are:

- for every layer: $\beta$ and $\gamma$: $\beta^{1},\beta^{2},\beta^{3},\beta^{4}, \gamma^{1}$
- in addition to your $W$ and $b$ parameters
- You will also train these new parameters using gradient descent!! 

$$
\beta^{[l]} = \beta^{[l]} -  \alpha * d\beta^{[l]} 
$$

Both $\beta^{[l]}$ and $\gamma^{[l]}$ have also dimension $(n^{[l]},1)$ similar to the bias term. This is the case because we want to learn a separate learnable parameter for scaling for each hidden unit!

So after calculating $Z^{[l]}$ which is of shape  $(n^{[l]},m)$ where m is the batch size, we multiply by the $\gamma$ scaling factor and add the $\beta$ bias factor to each hidden unit.

Note that usually in modern libraries we usually never implement the batch norm operation such as pytorch or (tf.nn.batch_normalization).


#### In practice batch norm is applied with mini-batch GD

- Take your mini-batch
- Compute mean and variance just on this mini-batch


One important detail:

- In typical setting we always have $Z = WX+b$ 
- But when you think about it, since we normalize Zs by substracting the mean $b$ is kind of useless because it will be subtracted from all samples (it is same for all samples).
- For this reason we can think of $\beta^{[l]}$ playing the role of our good old bias parameter $b$, since:

$$
Z_{norm}^{[l]} = \gamma^{[l]} * Z_{norm}^{[l]} + \beta^{[l]}
$$

So when we apply batch norm, we can actually remove the bias parameter.


Remember that gamma and beta are separate values for EACH hidden unit (so they are arrays with one learnable parameter for each hidden unit!! You are normalizing each hidden unit separately across the samples.

## Why does Batch Norm work? 

Makes your intermediate layers more robust to changes in the inputs.

To set context: Imagine you trained your cat classifier on black cats. However during test time you have colored cat pictures. This is referred to as Covariate shift: Even though maybe the underlying function (being cat vs non-cat) did not change, distribution of your data changes. This reduces the performance of your model.


Now think of covariate shift happening at each iteration:

- In a deep neural network, imagine the fourth layer. After each backpropagation, layers 1 to 3 have updated their weights. That means that the input to layer 4 will change at each iteration.

- By applying batch norm to the output of previous layers, we can not make the values have the same distribution. But at least we can ensure that they have a similar scale (mean and variance).

- This helps with training each layer more independently from oscillations in former layers and help with speed up.

- Make your hidden layers more robust to changes.


**Batch Norm as regularization**

Secondary benefit of batch norm: 

- Since the normalization is applied on mini-batches, each mini-batch will have slightly different mean and variance. 
- This kinda acts as dropout in that, it will introduce some noise to each layer output. 
- This can be seen as a way of introducing regularization to your network.
- Increasing mini-batch size you would reduce the regularization effect.

## Batch Norm at Test Time

Maybe in prediction time our mini-batch size is 1. What will we do??


Different way of calculating $\mu$ and $\sigma$ are needed since we dont have a mini-batch now.

In typical applications: 

- Estimate them using a seperate exponentially weighted average (across mini-batches): calculate the running average of means and variance during training.


- During test time, use these running average values of $\mu$ and $\sigma$ values to calculate the normalization following the same batch norm formula.


In practice, apparently batch norm is quite robust to minor changes in these values so no need to worry too much about the exact methodology in how we estimate these numbers. 

# Part 3. Multi-class Classification

## Softmax Regression

Softmax regression is a way to go up to N class from 0-1 binary classification.

Well softmax is something we knooow very well no need to take massive notes here I believe.


In [None]:
import numpy as np

Z_l = np.random.randn(4,1) # last layer output without activation

In [None]:
Z_l

In [None]:
soft_zl = np.exp(Z_l)
soft_zl = soft_zl / (np.sum(soft_zl)) # this becomes the last layer activation

In [None]:
soft_zl

In [None]:
soft_zl.sum()

- Unusual thing about this activation function is that it is not element-wise anymore.

- Since we have to normalize all exponentials by the total activation takes as input a (4,1) vector and output (4,1) vector. This is different than sigmoid which receives a single value.


You can use a zero-hidden layer neural network with softmax function to have many linear decision boundaries for e.g. 3 classes. 

$$
y_{pred} = softmax(WX+b)
$$

The decision boundaries will always be linear since we only did linear combination of inputs

## Training a Softmax Classifier

Deepen how we can train a neural network.

Name is coming from difference from "hard-max" in which case we would just get $[1,0,0,0]$ if first element had the highest value. 

Actually when number of classes (c) equals to 2, softmax actually reduces to logistic regression. So softmax is a generalization of logistic regression to N classes.

Convention:

$$
Z^{[L]} = [5,2,-1,3]\\
t = [e^5,e^2,e^{-1},e^3]
$$


Loss Function:

target output $y = [0,1,0,0]$, prediction $y_{pred}  = [0.3,0.2,0.1,0.4]$

$$
L(y_{pred},y) = - \sum_{j=1}^4 y_j * log(y_{pred j})\\
= -y_2 * log(y_{pred 2}) =  -log(y_{pred 2})
$$

So we should make $y_{pred 2}$ as close to 1 as possible.

Reminder that each of your target $y$ are one-hot encoding of the target class for that sample.

Loss with respect to output of the last layer (before softmax):
$$
dZ^{[L]}  = y_{pred} - y
$$

where $y_{pred}$ is the softmax output vector and $y$ is the one-hot encoding target vector!!

**!!!!   DO NOT SKIP THE DERIVATION OF THE SOFTMAX GRADIENT   !!!!**

READ: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

### (Bonus) Proof for Derivative through Softmax ($\dfrac{\partial \mathcal{L}}{\partial Z}$)

First to proof to make sense let's agree on some notation. Let's consider an example where we have 4 classes and the correct answer is class 3. So $Y$ will look like this:
$$
Y = \begin{bmatrix}
0\\
0\\
1\\
0
\end{bmatrix}
$$


$$
Z = WX+B\\
A = \text{softmax}(Z)\\
\mathcal{L} = - \sum_{i=1}^4 Y_i \cdot log(A_i)  = - log(A_3)
$$

last equality follows from Y being a one-hot encoding with only $Y_3=1$ as shown above.

We are interested in knowing $\dfrac{\partial \mathcal{L}}{\partial Z}$. Recall that we can use the chain rule to get this value:

$$
\dfrac{\partial \mathcal{L}}{Z} = \dfrac{\partial \mathcal{L}}{\partial A} \cdot \dfrac{\partial A}{\partial Z}
$$

**Step 1. $\dfrac{\partial \mathcal{L}}{\partial A}$**


This step is trivial since we already shown the derivation above. Since it only depends on $A_3$ it will look like this:

$$
\dfrac{\partial \mathcal{L}}{\partial A} = \begin{bmatrix}
0\\
0\\
-\dfrac{1}{A_3}\\
0
\end{bmatrix}
$$

since $\dfrac{\partial log(x)}{\partial x}=\dfrac{1}{x}$.


**Step 2. Derivative for Softmax**

Recall softmax operation $s(A)_i = \dfrac{e^{A_i}}{\sum_{j=1}^N e^{A_j}}$.  And we are interested in the derivative of $s(A)_3$ with respect to all $A_j$ where $j=\{1,2,3,4\}$ (as other values will cancel out since dl/dA being zero for those indices).

For this we will need the quotient rule.

##### Recall Quotient Rule.

$$
f(x) = \dfrac{g(x)}{h(x)}\\
f^{'}(x) = \dfrac{g^{'}(x)\cdot h(x) - h^{'}(x)\cdot g(x)}{[h(x)]^2}
$$

For simplicity lets refer to denominator of softmax ($\sum_{i} e^{A_i}$) as $T$ and $s(A)_j$ simply as $S_j$ so $S_j=\dfrac{e^{A_j}}{T}$. 

Applying the Quotient rule to softmax for each index $j$ to get $\dfrac{\partial s(A)_3}{\partial A_j}$.


**Case 1.$j=3$**


Then the quotient rule gives:

$$
\begin{align}
\dfrac{e^{A_3} * T - e^{A_3}\cdot e^{A_3}}{T^2} &= \dfrac{e^{A_3}\cdot (T-e^{A_3})}{T^2}\\ &= \dfrac{e^{A_3}}{T} \cdot \dfrac{T - A_3}{T}\\ &= S_3 \cdot (1-S_3)
\end{align}
$$


**Case 1.$j=3$**

# Part 4. Intro to Programming Frameworks

## Deep Learning Frameworks

More practical to use some deep learning frameworks.. So old it is mentioning stuff like Caffe instead of just TensorFlow and PyTorch.

Criteria:

- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)



## TensorFlow

Super basic structure and intro to TensorFlow. 



In [1]:
import numpy as np 
import tensorflow as tf

2023-12-24 21:31:53.875550: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
w = tf.Variable(0,dtype=tf.float32)
optimizer = tf.keras.optimizers.Adam(0.1)

def train_step():
    with tf.GradientTape() as tape:
        cost = w**2 - 10*w + 25
    trainable_variables = [w]
    grads = tape.gradient(cost,trainable_variables)
    optimizer.apply_gradients(zip(grads,trainable_variables))

print(w)


<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.0>


In [6]:
train_step()
print(w)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.09999931>


In [7]:
for i in range(1000):
    train_step()
print(w)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=5.000001>


And voila!! with only using the gradientTape thing and an optimizer which we give input the trainable parameter $w$.

To summarize the above: 

- We declare the cost function within GradientTape context.
- Then calculate grads for the trainable params, given the cost we want to minimize.
- Finally, apply the gradients using the optimizer

In [9]:
# Another function
# What if we have both input (x) and parameters (w)
w = tf.Variable(0,dtype=tf.float32)
x = np.array([1.0,-10,25.0],dtype=np.float32)
optimizer = tf.keras.optimizers.Adam(0.1)

def cost_fn():
    cost = x[0]*w**2 + x[1]*w + x[2]
    return cost

print(w)
optimizer.minimize(cost_fn,[w])
print(w)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.0>
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.09999931>


In [12]:
# optimizer.minimize() is an alternative representation 
# of the first implementation
# Similar to what we built in micrograd with Karpathy
def training(x,w,optimizer):
    def cost_fn():
        cost = x[0]*w**2 + x[1]*w + x[2]
        return cost
    for i in range(1000):
        optimizer.minimize(cost_fn,[w])
        
    return w 

In [13]:
training(x,w,optimizer)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=5.0000005>