## Training
### Weight initializers 

| Initializer       | Best For                | Key Advantage                                      | Major Downside                                 |
|-------------------|-------------------------|----------------------------------------------------|-----------------------------------------------|
| Zero              | Bias Initialization     | Simplicity                                         | Fails for weights (symmetry problem)          |
| Random            | General                 | Breaks symmetry                                    | May cause vanishing/exploding gradients       |
| Xavier (Glorot)   | Tanh, Sigmoid           | Keeps variance consistent                          | Not optimal for ReLU                          |
| He                | ReLU, Leaky ReLU        | Prevents vanishing/exploding gradients with ReLU   | Can still explode gradients in deep nets      |
| LeCun             | SELU                    | Self-normalizing for specific activations          | Limited activation support                    |
| Orthogonal        | Recurrent Networks (RNNs)| Maintains long-term gradient flow                  | Only works with square matrices               |

#### <u>1 - Zero initialization</u>
In this method, all weights are initialized to zero.  

**<u>Pros:</u>**  
- Simple to implement.  
- Works well for bias initialization.  

**<u>Cons:</u>**  
- If all weights are initialized to zero, all neurons in each layer will learn the same thing, effectively making the network symmetric.  
- This leads to the model failing to break symmetry, and all neurons will update in the same way, preventing learning.

**Conclusion**: This method should not be used for weights, only for biases.  

#### <u>2 - Random initialization</u>
Weights are initialized to small random values, often drawn from a uniform or normal distribution.  

**<u>Pros:</u>**  
- Helps break symmetry, as weights are randomly different for different neurons.

**<u>Cons:</u>**  
- If the random values are too large, it may lead to exploding gradients (gradients become too large).
- If the values are too small, it may lead to vanishing gradients (gradients become too small).

**Conclusion:** This method works better than zero initialization but may suffer from issues like vanishing/exploding gradients, especially in deep networks.

#### <u>3 - Xavier (Glorot) Initialization</u>
Weights are initialized using a distribution with zero mean and a variance of 
$\frac{1}{n_{in} + n_{out}}$, where $n_{in}$ is the number of input units and $n_{out}$ is the number of output units for the layer. This method works well for sigmoid and tanh activation functions.  

- <b>Xavier Uniform</b>: Weights are drawn from a uniform distribution.
- <b>Xavier Normal</b>: Weights are drawn from a normal distribution.

**<u>Pros:</u>**  
- Ensures that the variance of activations and gradients remains consistent across layers, preventing vanishing/exploding gradients.
- Good for shallow to moderately deep networks.

**<u>Cons:</u>**  
- May not work well for very deep networks.
- Not optimal for activation functions like ReLU because ReLU can have a different behavior regarding gradient flow compared to tanh or sigmoid.

**Conclusion:** Great for activation functions like <b>tanh</b> and <b>sigmoid</b>, but less suited for <b>ReLU</b>.

#### <u>4 - He Initialization</u>
Similar to Xavier initialization but modified for ReLU and Leaky ReLU activation functions. Weights are initialized using a variance of 
$\frac{2}{n_{in}}$, where $n_{in}$ is the number of input units.

- <b>He Uniform</b>: Weights are drawn from a uniform distribution.
- <b>He Normal</b>: Weights are drawn from a normal distribution.

**<u>Pros:</u>**  
- Specifically designed for <b>ReLU</b> activation, helping maintain the gradient flow in deep networks.
- Prevents vanishing/exploding gradients in deep networks with ReLU activation.

**<u>Cons:</u>**  
- Can still suffer from exploding gradients in very deep networks.

**Conclusion:** Optimal for ReLU-based networks, widely used in modern deep learning architectures.

#### <u>5 - LeCun Initialization</u>
Weights are initialized using a variance of $\frac{1}{n_{in}}$, which is good for activation functions like sigmoid or tanh but works best with Leaky ReLU or SELU (Scaled Exponential Linear Units).

**<u>Pros:</u>**  
- Works well with <b>SELU</b> activation functions.
- Ensures that activations don't explode or vanish.

**<u>Cons:</u>**  
- Limited to specific activation functions, so it's not universally applicable.

**Conclusion:** Best for <b>SELU</b> activation, used in self-normalizing neural networks.

#### <u>6 - Orthogonal Initialization</u>
Weights are initialized as orthogonal matrices, typically used in recurrent neural networks (RNNs).  

**<u>Pros:</u>**  
- Helps avoid exploding and vanishing gradient problems.
- Maintains the flow of gradients over long sequences in recurrent networks.

**<u>Cons:</u>**  
- Only applicable when the number of input and output units is the same (square matrices).
- Computationally more complex than other initialization methods.

**Conclusion:** Commonly used in RNNs to help maintain long-term dependencies.

#### <u>7 - Layer-Specific Initialization</u>
- <b>Bias Initialization</b>: Often initialized to zero. In some cases, biases may be initialized to small positive values to avoid dead neurons (e.g., for ReLU).

**<u>Pros:</u>**  
- Sometimes used to "nudge" the network toward learning a positive gradient.

**<u>Cons:</u>**  
- Requires fine-tuning, and the performance benefit can be marginal.