# 1. Hyperparameter Tuning
- Purpose: Find the best hyperparameters for the model.
- Note: Hyperparameters are not learned during training, but are set before training.

### 1.1 Grid Search
- Try all possible combinations of hyperparameters and choose the best one.
- Challenges:
    - Computationally expensive.
    - Not suitable for continuous hyperparameters.

### 1.2 Random Search
- Try random combinations of hyperparameters and choose the best one.
- Advantages:
    - Computationally efficient.
    - Suitable for continuous hyperparameters.

### 1.3 Bayesian Optimization
- Build a probabilistic model of the loss function and use it to choose the next hyperparameters to try.
- Advantages:
    - Computationally efficient.
    - Suitable for continuous hyperparameters.
    - Can handle noisy loss functions.
    
### 1.4 Tuning Process
- **Grid vs. Random Sampling:** Instead of a grid, random sampling of hyperparameters is often more effective.
- **Coarse to Fine Sampling:** When you find some hyperparameters values that give you a better performance - zoom into a smaller region around these values and sample more densely within this space.
- **Scale for Sampling:** Use a logarithmic scale for searching hyperparameters instead of linear scale. For instance, a uniform random sample from the range $(0.0001, 1)$ might give us 90% of the values between $0.1$ and $1$. Therefore, using log-scale sampling:

\begin{align}
0.0001 &= 10^{-4} \rightarrow a=-4 \\
1 &= 10^0 \rightarrow b=0\\
\\
r &= (b-a) \times np.random.rand() + b \\
\text{result} &= 10^r \\
\end{align}

In the example, the range would be $[-4, 0]$. It uniformly samples values in log-scale from $[a,b]$.

<hr>

# 2. Batch Normalization
- Purpose: Normalize the inputs to each layer, so that the inputs to the activation function are not too large/small. It maintains a consistent distribution of inputs to each layer.

$$Z_{norm}^{(i)} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

where $\mu$ is the mean of the inputs to the layer, $\sigma^2$ is the variance of the inputs to the layer, and $\epsilon$ is a small number to avoid division by zero.
- Placement: After FC (fully connected) or CONV (convolutional) layers, but before the activation function.
- Note: For non-standard activations, like tanh, which might not want a unit gaussian input:
- Updated formula: 

$$Z_{norm}^{(i)} = \gamma \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

where $\gamma$ and $\beta$ are learnable parameters of the model.

### 2.1 Batch Normalization Algorithm
- Given: A mini-batch of $m$ examples.
- Calculate: $\mu = \frac{1}{m} \sum_{i=1}^{m} Z^{(i)}$ and $\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (Z^{(i)} - \mu)^2$.
- Normalize: $Z_{norm}^{(i)} = \frac{Z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$.
- Scale and shift: $Z_{norm}^{(i)} = \gamma Z_{norm}^{(i)} + \beta$, where $\gamma$ and $\beta$ are learnable parameters of the model.
- Output: $Z^{(i)} = Z_{norm}^{(i)}$.

<hr>