# Hyperparameter tuning, batch normalization and frameworks

## Hyperparameter tuning

* tree of importance
    * learning rate
        * momentum
        * batch size
        * no of hidden units
            * layers
            * learning rate decay

* optimization strategies
    * grid search
    * random search
    * coarse to fine - zoom in on potentially interesting subsets of space

* sampling scales
    * uniform (no of layers, hidden units)
    * log scale (learning rate, momentum -> due to the nature of the smoothing with $ 1/(1-\beta)$)

* babysitting one model (panda) vs training many models in parallel (caviar)


## Batch normalization

* Can we normalize activation inputs/outputs of each layer to speed up the learning process?
* algorithm (one layer description)
    * activation input in a layer $z^{(1)},...,z^{(m)}$
    * calculate z-score on those units $z^{(i)}_{norm} = \frac{z^{(i)}-\mu_z}{\sqrt{\sigma^2_z+\epsilon}}$
    * modify that using $\tilde{z}^{(i)} = \gamma z^{(i)}_{norm} + \beta$, where $\gamma$ and $\beta$ are learnable params (more flexibility for ie sigmoid)
* algorithm (multi-layer)
    * forward step -> traditional &  $\tilde{z}^{[l]}$
    * backward step -> add parameter grads for $\beta$ & $\gamma$, update parameters $W$,$\beta$,$\gamma$
    * NOTE: bias not needed here as $\mu$ subtracted in every step
    * dimensions $\beta^{[l]}$ : $(n^{[l]},1)$ ; $\gamma^{[l]}$ : $(n^{[l]},1)$

* test time
    * estimating $\mu$, $\sigma$ during test time through exponentially weighted averages (across mini-batches) within the normalized $\tilde{z}$

* intuition
    * scaling within the network will improve training time (making optimization bowl more symmetric?)
    * covariant shift -> reusing model in cases where data distribution (and even ground truth func) shifts
    * stabilizing the param distribution within a layer
    * slight regularization effect for mini-batches (noise introduced by param updates on small amount of data)


## Multiclass

* generalization of logistic regression to multiple-class problems -> softmax regression
* number of neurons in output layer equals to number of classes $(n_{classes},1)$
* softmax activation function -> $a^{[l]}=\frac{e^{z^{[l]}}}{\sum_{i=1}^{n_{classes}} e^{z^{[l]}}_i}$
* loss function -> $L(\hat{y},y) = -\sum_{i=1}^{n_{classes}} y_i log(\hat{y_i})$ (MLE equivalent)
* cost function -> $\frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)},y^{(i)})$
* backprop step -> $dz^{[l]}=\hat{y}-y$, size $(n_{classes},1)$

## Frameworks

* development & deployment vs speed vs open/closed source
* tensorflow
    * allows for user to define just forward pass and leave the backward pass to the underlying engine (see code example below)
    * constructs computational graph in forward-prop fashion, up to the cost function, use the graph for back-prop

In [1]:
import numpy as np
import tensorflow as tf

w = tf.Variable(0, dtype=tf.float32)
optimizer = tf.optimizers.Adam(0.1)

def train_step():
    # gradient tape recording the order of the operations if forward-prop to "play it backwards" for the grads
    with tf.GradientTape() as tape:
        cost = w**2-10*w+25 # quadratic func with the minimum of 5
    trainable_variables = [w]
    grads = tape.gradient(cost, trainable_variables)
    optimizer.apply_gradients(zip(grads, trainable_variables))

print(f"initial w {w}")
for i in range(1000):
    train_step()
print(f"final w {w}")

2024-09-14 10:11:19.569415: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-14 10:11:19.571834: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-14 10:11:19.578735: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-14 10:11:19.589558: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-14 10:11:19.592823: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-14 10:11:19.601988: I tensorflow/core/platform/cpu_feature_gu

initial w <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.0>
final w <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=5.000001>


In [2]:
w = tf.Variable(0, dtype=tf.float32)
optimizer = tf.optimizers.Adam(0.1)
x = np.array([1.0,-10.0,25.0], dtype=np.float32)

def training (x,w,optimizer):
    print(f"initial w {w}")
    def cost_fn():
        return x[0]*w**2+x[1]*w+x[2]
    for i in range(1000):
        with tf.GradientTape() as tape:
            cost = cost_fn()
        grads = tape.gradient(cost, [w])
        optimizer.apply_gradients(zip(grads, [w]))
        # example relies on legacy api
    print(f"final w {w}")

training(x,w,optimizer)

initial w <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.0>
final w <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=5.000001>


## Reading list


* [https://www.tensorflow.org/guide/autodiff](https://www.tensorflow.org/guide/autodiff)
* [https://www.tensorflow.org/api_docs/python/tf/GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape)
