# Training Neural Networks

## Part 1: Architectures

### Overview

- **Naming conventions**: Neural networks are organized into layers: output of previous layer is connected to next layer.
    - If you have $n$ layers, this doesn't count the input vector (but does count the final output layer).
- **Output layer**: Output layer has no activation since it represents raw scores (softmax / svm loss will be applied to raw scores).
- **Sizing**: count number of learning parameters in network.

Universal approximators
- Neural networks can approximate any real-valued function

Choosing an architecture
- **Larger models have more representational power**
- **Prefer larger models**: don't use smaller models to avoid overfitting. Instead keep a larger model and use other techniques to fix overfitting: dropout, l2 regularization, input noise.

### Activation functions

#### Sigmoid
$\sigma(x) = \frac{1}{1+e^{-x}}$

Pros
- Squashes inputs to \[0,1\]

Cons
- Saturation: gradient tends to 0 at either end, so during backprop gradients may die
- Outputs are not zero-centered: if the output of a layer is always positive, then in the *next* layer, all of its weights will update in the same direction (positive or negative), depending on the sign of the upstream gradient.
    - This is because in the *next* layer, the gradient is $\frac{d (Wx+b)}{d W} = x$, but $x>0$ (since the previous layer went through a sigmoid). So the gradient on the weights will be `local_grad * dout` = $x \cdot \texttt{dout}$.
    - If you plot weight versus loss, the weights will only update in the same direction (hence zigzag pattern) and converge more slowly than if weights move in different, optimal directions towards minimum loss.

#### Tanh

This is essentially sigmoid but zero-centered. Hence, it is always preferred to sigmoid.

#### ReLU
$f(x) = \max(0,x)$

Pros
- No saturation
- Efficient to compute
- Converges quickly

Cons
- Not zero-centered
- Dying ReLU: if inputs are negative, local grad will be zero
    - To fix this, you can initialize bias to be positive (so that the activation will trigger)

#### Leaky ReLU
- Purpose is to fix dying ReLU problem
- Has small positive slope when input is negative
- Has all benefits of ReLU, but gradients will not die

#### Maxout
$\max(w_1^T x + b_1, w_2^T x + b_2)$

Generalizes ReLU and Leaky ReLU. Need more weights.

### GeLU / ELU / SeLU / Swish
GeLU: multiply input by 0 or 1 randomly. Useful for transformers.

ELU is basically ReLU with closer-to zero-mean outputs. Downside is that you need to compute exp (more expensive than max).

SeLU is a fancier version of ELU that self-normalizes.

Swish: outperformed others in CIFAR-10.



#### Summary
In general: use ReLU. If you have dying ReLU: try leaky ReLU or maxout (for marginal gains). Try tanh as a last resort. Never use sigmoid.

For transformers: use GeLU.


## Part 2: Preprocessing and Loss Functions

### Preprocessing
Want: 1) zero-centered data with 2) unit variance.

Compute mean only using training data (not validation/test sets), then apply preprocessing to train/val/test.

Additional techniques
- PCA: convert (N,D) to (N,100): pick out dimensions with highest variance
- Whitening: achieves decorrelation (features are not correlated) and normalization (each feature has zero mean, unit variance)


#### Weight initialization
1. Constant weight initialization: bad because weights will learn exact same thing
2. Random weight initialization: output variance may depend on neuron size
    - For example, one neuron might output $w^T x + b$, so if $x$ has lots of elements, the output scalar has a larger variance
3. Normalization: multiply by $\frac{1}{\sqrt{n}}$, where $n$ is the number inputs of the neuron.
4. Kaiming / Xavier: multiply by $\sqrt{\frac{2}{n}}$
    - When using ReLU especially, this ensures variance is normalized

5. Batchnorm: normalization is differentiable. Insert a normalization layer *after* fully connected layer, but *before* nonlinearity. Normalization is done automatically by the network.

    - Learnable parameters: $\gamma, \beta$: $x_{i,j} \rightarrow \gamma (x_{i,j} - \mu) / \sqrt{\sigma} + \beta$
    - Batchnorm for FC nets: avg, stdev for each feature across all examples
    - Batchnorm for convnets: avg, stdev for each channel (eg: rgb) across all pixels of all images
    - Layernorm, instancenorm: just averaging over different values
        - Layernorm: average with axis=1
        - Instancenorm: each example and channel gets an average
        - This'll make more sense in code (probably)

#### Regularization
1. L2: weights tend to small, diffuse numbers
2. L1: weights tend to zero
3. Max norm constraint: weights are capped at a constant $c$
4. Dropout: keep neuron alive with probability $p$, dead otherwise.
    - During prediction, hidden layer outputs must be scaled by probability $p$
    - Inverted dropout: scale at train time to keep prediction fast

Note: Don't usually regularize bias

Overall, use dropout ($p=0.5$) with L2 regularization.


### Loss Functions
Various problem types:
* Classification: discrete outputs
    * Problem: lots of categories makes softmax layer slow. You can do fancy stuff with trees here: organize classes into a tree, and the output layer consists of decisions in the tree to go left or right.
* Attribute classification: each input might belong to multiple classes (like an image might have many hashtags)
    * Train a binary classifier for each attribute independently.
    * Example: dog hashtag or not, cat hashtag or not, etc...
* Regression: continuous outputs
    * Find L2 norm between predicted value and actual value.
    * Loss is much harder to optimize, since the network needs to be relatively close to the "correct" answer. For example, if desired output is 100, if the network outputs $1e4$, this will cause a huge gradient.
 
    * Try to discretize problem before doing regression (it's fragile with L2 norm). For example, if you're predicting review stars from 1-5, do a classification instead of regression.  


## Part 3: Learning and Evaluation

### Gradient Checks

- Use centered formula: $\frac{f(x+h)-f(x-h)}{2h}$
- Use relative error: $\frac{|a-b|}{\max(a,b)}$
- Use doubles instead of floats
- Be careful of kinks: numerical gradient may disagree with analytical gradient. Analytical gradient may be zero, but numerical gradient might cross over the kink and give a slightly positive value.
- Use a small number of datapoints to avoid kinks
- Be careful of too-small $h$: might lead to instability
- Do the check during a "characteristic mode of operation": doing gradcheck on initial weights might not be representative, so instead do gradcheck after running a few forward passes
- Turn off randomization (dropout), or initialize a random seed. This is to ensure numerical and analytical computations are equivalent.


### Sanity Checks

- Ensure loss is the expected value with random weights
- Increase regularization, see if it increases loss (it should)
- Overfit on small set (20 examples). If this doesn't work, it's not worth training the whole network.

### Babysitting Learning Process

Mainly track *loss*, *accuracy*.

#### Loss
- Loss function should go down (ideally exponentially)
- x-axis: epoch, y-axis: loss
- What's an **epoch**?
    - Training happens on *iterations* of *minibatches*. One epoch is when the entire training set has been processed (in expectation). So for example, if minibatch size is 5 and there's 20 total examples in training set, an epoch would consist of 4 minibatches.
 
#### Accuracy
- The closer validation acc is to training acc, the better generalization (and therefore less overfitting).
- I think you can measure validation acc after each minibatch.

#### Weight/Update Ratio
- Measure good learning rate by finding ratio between weight and update.
- Update / weight (per weight) should be around `1e-3`.

#### Activation per Layer 
- Activations should cover whole range (ie: for tanh, it should be spread out between [-1,1]).s

#### First-layer visualizations
- For images, check first-layer weights: should be able to see smooth and defined features.


### Optimizers

#### Vanilla SGD:
`W -= lr * dW`

Issues with SGD:
* **Slow convergence**: Gradient landscape might favor one direction over another
* **Gets stuck**: Local minima or saddle points prevent finding the best weights
    * Saddle points more common in multidimensional space
    * Gradient is noisy approximation of optimal direction

#### Momentum:
```
# accumulate past updates in velocity v (note decay factor mu)
v  = mu*v - lr * dW
W += v
```

* Keep updating even at local minima / saddle points with **momentum**
* Momentum: keep track of past gradients to use as part of weight update

#### Nesterov Momentum
Compute gradient by looking ahead: instead of gradient at `W`, find gradient at `W+v*dW`. Has stronger theoretical guarantees for convex convergence.

#### Anneal Learning Rate
- Slow down learning rate over time
- Cosine, linear, invsqrt, constant decay
- Linear warmup: slowly warm up the learning rate, because a learning rate that is too high may lead to bad local minima

#### Adagrad (adaptive gradient)
- Each parameter has a learning rate
- Keep a weighted average of previous gradients, use this for the update
- Smaller weights have higher lr, vice versa

#### RMSProp
- With Adagrad, weights go to zero since gradient sum accumulates in the denominator
- Decay the gradient so that weights don't go to zero

#### Adam
- Adam = RMSProp + Momentum

### Model Ensembles

Train many models, average their predictions. May offer marginal gains.
- Try different initializations, hyperparameters
- Use checkpoints of model
- Maintain copy of model in memory