# Neural networks consist of many Linear Transformations

A linear transformation takes the form

$ Y = X \cdot W + b $

**W** and **b** are the weight and bias matrices and **X** is the input matrix.  **Y** is the output of the **linear transformation**.

A linear transformation is a way to change a set of points (vectors) in space so that lines remain lines, and the origin remains fixed. In simpler terms, it's a rule for moving every point to a specific new location without twisting, warping, or tearing the shape formed by those points. In machine learning, this is often done using matrices to systematically shift, rotate, or scale the input data.

In the code below, notice W is passed to matmul with the `.t()` function.  This transposes the matrix W.  In the first instance of the Adder model, X (the input) has a **shape** of (1,2) and W has a shape of (4, 2).  Matrix multiplication requires that **broadcasting rules** be followed.  Taking the transposed form of W changes its shape to (2, 4).  The result (Y) of the matrix multiplication is a matrix of (1, 4).  (1,2) * (2,4) = (1,4)

```
Y = torch.matmul(X, W.t()) + b
```

# Nonlinearity is introduced through Activation Functions

Activation functions introduce nonlinearity into neural networks, allowing them to learn from the error and make adjustments, which enables the model to handle more complex data patterns.

Here's a list of commonly used activation functions, their brief descriptions, ideal usage scenarios, and how to use them in PyTorch:

1. **ReLU (Rectified Linear Unit)**
    - **Description**: ReLU replaces all negative values in the vector with zero.
    - **Where to Use**: Hidden layers in most networks.
    - **PyTorch Code**: `torch.nn.ReLU()`

2. **Sigmoid**
    - **Description**: Maps input values to the range of 0 to 1.
    - **Where to Use**: Binary classification output layer.
    - **PyTorch Code**: `torch.nn.Sigmoid()`

3. **Tanh (Hyperbolic Tangent)**
    - **Description**: Maps input to a range between -1 and 1.
    - **Where to Use**: Hidden layers when outputs may be negative.
    - **PyTorch Code**: `torch.nn.Tanh()`

4. **Softmax**
    - **Description**: Converts a real vector to a probability distribution.
    - **Where to Use**: Multi-class classification output layer.
    - **PyTorch Code**: `torch.nn.Softmax(dim=1)`

5. **Leaky ReLU**
    - **Description**: Similar to ReLU but allows a small gradient for negative values.
    - **Where to Use**: Hidden layers to avoid dead neurons.
    - **PyTorch Code**: `torch.nn.LeakyReLU()`

6. **Swish**
    - **Description**: Self-gated activation function.
    - **Where to Use**: Hidden layers, generally outperforms ReLU on deeper networks.
    - **PyTorch Code**: Custom function or `torch.nn.SiLU()` (Swish implementation in PyTorch)

7. **ELU (Exponential Linear Unit)**
    - **Description**: Similar to ReLU but takes care of the vanishing gradient problem for negative inputs.
    - **Where to Use**: Hidden layers when faster learning is needed.
    - **PyTorch Code**: `torch.nn.ELU()`

To use any of these in a PyTorch model, you can add them as a layer, e.g., `self.act1 = torch.nn.ReLU()`, and then apply them in your `forward` method: `x = self.act1(x)`.

# Broadcasting Rules

If you understand broadcasting rules, it can help to make sense of when to transpose a matrix and what types of shape transformations can happen.  [Andrej Karpathy's Zero to Hero course](https://karpathy.ai/zero-to-hero.html) explains this well!

[Read this to understand broadcasting rules](https://numpy.org/doc/stable/user/basics.broadcasting.html).  

# Neural networks usually start out with random values that need trained

A model is **trained** by having its weights adjusted by small amounts so that the **accuracy** improves and the **loss** shrinks.  During training, it is common for the **loss** to grow instead of shrink in some cases.  This happens when the model potentially skips over the ideal weight combination or finds itself in a local minima.  


# Gradient Descent and Learning Rate

The process of applying gradient descent to a neural network deposits gradients to each parameter of the network.  The gradient is a floating point number indicating the direction the parameter in the network needs to change to improve **accuracy** and reduce **loss**.  Once the gradients are distributed through the network, they are applied by multiplying the **negative** gradient by the weight or bias and a **learning rate**.  The negative is why the process is called **descent**.  When numbers are multiplied and one of them is negative, it makes the entire result negative.  The gradient is like a slope and if the negative wasn't used, the result would continue to grow away from the desired result.

# Optimizer

The gradients are deposited to each weight and depending upon the network setup a multiple gradients may be deposited to the same weight.  When this happens, the gradients are simply summed together.  For this reason, before depositing gradients it is important to set them all to zero.  In PyTorch this is done by using an **optimizer**. 

# Learning Rate

The **learning rate** is typically a small number 0.01E-4 to 0.01E-5.  The primary reason for the learning rate being so small is to prevent the model from skipping past the best settings.  There is a lot that can be done with learning rates including making them dynamic, altering them based upon the given epoch.

# Epoch

An epoch is one complete pass through the entire training dataset. During an epoch, the model's parameters are updated iteratively using subsets of the training data, often referred to as batches. Multiple epochs are usually necessary to sufficiently train a model.  In the code below, 2000 epochs are used training on the single equation `30+70 = 100`.

# Hyperparameters

A hyperparameter is a parameter whose value is set before the training process begins, as opposed to the parameters of the model, which are learned during training. Examples include learning rate, batch size, and number of epochs. Hyperparameters are often tuned to optimize model performance.

# Loss function

The loss function (often called criterion) is used to compare the model's prediction to the actual value.  In the example below, the loss function is the MSELoss() function.  The MSELoss function calculates the average of the squares of the differences between predicted and actual values.

# Overfitting

Overfitting occurs when a model learns the training data too well, capturing noise and anomalies, rather than the underlying pattern. As a result, it performs poorly on new, unseen data. Overfitting is often a sign that a model is too complex relative to the simplicity of the problem. Techniques like **regularization**, **dropout**, and **simpler architectures** can help mitigate overfitting.

## Gradient Descent

The process of applying gradient descent to a neural network deposits gradients to each parameter of the network.  The gradient is a floating point number indicating the direction the parameter in the network needs to change to improve **accuracy** and reduce **loss**.  Once the gradients are distributed through the network, they are applied by multiplying the **negative** gradient by the weight or bias and a **learning rate**.  The negative is why the process is called **descent**.  When numbers are multiplied and one of them is negative, it makes the entire result negative.  The gradient is like a slope and if the negative wasn't used, the result would continue to grow away from the desired result.

Models are trained using **gradient descent**.  The "gradient" in the name refers to the derivative of the function at the current point. The algorithm takes steps proportional to the negative of the gradient, moving towards the minimum of the function.

The derivative of a function at a given point is essentially the slope of the function at that point. In the context of a function $ ( f(x) ) $ of a single variable, the derivative $ ( f'(x) ) $represents the slope of the tangent line to the function at a specific point $ ( x ) $. This slope indicates how the function is changing at that point. For functions of more than one variable, the concept of a derivative generalizes to partial derivatives.

A partial derivative is the derivative of a function of multiple variables with respect to one of those variables, keeping the other variables constant. For a function $ ( f(x, y, z, \ldots) ) $, the partial derivative with respect to $ ( x ) $ would indicate how $ ( f ) $ changes as $ ( x ) $ changes, while keeping $ ( y, z, \ldots ) $ constant. It is denoted as $ ( \frac{\partial f}{\partial x} ) $ for the variable $ ( x ) $.

Partial derivatives are used to form the gradient vector, which combines all the partial derivatives of a function into a single vector. This is useful in multivariable calculus and optimization problems, including machine learning algorithms like gradient descent.

The chain rule extends to functions of multiple variables through the use of partial derivatives. This generalizes to more complicated functions and is a cornerstone of backpropagation in neural networks, where it's used to compute gradients of a loss function with respect to the weights.

[Andrej Karpathy's Zero to Hero course](https://karpathy.ai/zero-to-hero.html) explains this well!

# Loss Functions

Here's the list of common loss functions with their corresponding PyTorch functions:

Each loss function has its own characteristics and is suited for specific types of problems. Choosing the right loss function is crucial for training an effective model.

1. **Mean Squared Error (MSE)**
   - **PyTorch Function**: `torch.nn.MSELoss()`
   - **Description**: Calculates the average of the squares of the differences between predicted and actual values.
   - **Best Suited For**: Regression problems.

2. **Mean Absolute Error (MAE)**
   - **PyTorch Function**: `torch.nn.L1Loss()`
   - **Description**: Calculates the average of the absolute differences between predicted and actual values.
   - **Best Suited For**: Regression problems with outliers.

3. **Cross-Entropy Loss (Log Loss)**
   - **PyTorch Function**: `torch.nn.CrossEntropyLoss()`
   - **Description**: Measures the performance of a classification model, rewarding confidence in correct classifications.
   - **Best Suited For**: Binary and multi-class classification.

4. **Hinge Loss**
   - **PyTorch Function**: `torch.nn.HingeEmbeddingLoss()`
   - **Description**: Used for "maximum-margin" classification, particularly for support vector machines.
   - **Best Suited For**: Binary classification.

5. **Categorical Cross-Entropy**
   - **PyTorch Function**: `torch.nn.CrossEntropyLoss()`
   - **Description**: Extension of Cross-Entropy loss for multi-class classification problems.
   - **Best Suited For**: Multi-class classification with single label.

6. **Kullback-Leibler Divergence**
   - **PyTorch Function**: `torch.nn.KLDivLoss()`
   - **Description**: Measures how one probability distribution diverges from another.
   - **Best Suited For**: Multi-class classification, recommendation systems.

7. **Poisson Loss**
   - **PyTorch Function**: `torch.nn.PoissonNLLLoss()`
   - **Description**: Measures the difference between the predicted average occurrence rate and the actual rate.
   - **Best Suited For**: Count-based regression problems.

8. **Cosine Similarity**
   - **PyTorch Function**: Use `torch.nn.CosineSimilarity()` and create a custom loss
   - **Description**: Measures the cosine of the angle between the predicted and actual vectors to measure similarity.
   - **Best Suited For**: Text similarity, clustering.

9. **Huber Loss**
   - **PyTorch Function**: `torch.nn.SmoothL1Loss()`
   - **Description**: Combination of MAE and MSE; less sensitive to outliers than MSE.
   - **Best Suited For**: Regression problems with occasional outliers.

10. **Negative Log-Likelihood (NLL)**
    - **PyTorch Function**: `torch.nn.NLLLoss()`
    - **Description**: Similar to Cross-Entropy but without logit transformation; often used with Softmax.
    - **Best Suited For**: Classification problems with Softmax.

11. **Focal Loss**
    - **PyTorch Function**: Not built-in; custom implementation required.
    - **Description**: Modification of Cross-Entropy that gives more weight to hard-to-classify examples.
    - **Best Suited For**: Imbalanced classification problems.

12. **Dice Loss**
    - **PyTorch Function**: Not built-in; custom implementation required.
    - **Description**: Measures overlap between predicted and ground truth sets; often used in image segmentation.
    - **Best Suited For**: Image segmentation tasks.


# Mean Squared Error (MSE) Loss Function

- **Essence**: Measures the average of the squares of the errors or deviations between predicted and actual values.

- **Why Use**: Suitable for regression problems or tasks where predicting the exact numeric value is important. It penalizes larger errors more significantly than smaller ones.

- **PyTorch Code**:
  ```python
  criterion = torch.nn.MSELoss()
  ```

- **How It Works**: The formula for MSE is:

  $ [
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  ] $
  
  Here, $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. $n$ is the total number of samples.

The MSE loss function is sensitive to outliers and might not be suitable for data with heavy-tailed distributions. However, it is one of the most commonly used loss functions for regression tasks.

# Mean Absolute Error (MAE) Loss Function

- **Essence**: Measures the average of the absolute differences between predicted and actual values.

- **Why Use**: Suitable for regression problems, particularly when you want to be less sensitive to outliers compared to MSE.

- **PyTorch Code**:
  ```python
  criterion = torch.nn.L1Loss()
  ```

- **How It Works**: The formula for MAE is:

  $[
  \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
  ]$

  Here, $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. $n$ is the total number of samples.

MAE is more robust to outliers compared to MSE and is often used when the distribution of errors or residuals is not normally distributed.

# Cross-Entropy Loss (Log Loss) Loss Function

- **Essence**: Measures the dissimilarity between the true label distribution and the predicted probabilities in classification tasks.

- **Why Use**: Commonly used in binary and multi-class classification problems where outputs can be interpreted as probabilities.

- **PyTorch Code**:
  ```python
  criterion = torch.nn.CrossEntropyLoss()  # For multi-class
  criterion = torch.nn.BCEWithLogitsLoss()  # For binary classification
  ```

- **How It Works**: For binary classification, the formula is:

  $[
  \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]
  ]$

  For multi-class classification:

  $[
  \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})
  ]$

  Here, $y_{ic}$ is 1 if the $i$-th sample belongs to class $c$ and 0 otherwise. $\hat{y}_{ic}$ is the predicted probability that the $i$-th sample belongs to class $c$.

Cross-Entropy Loss is particularly useful when the output can be interpreted as the probability of belonging to certain classes. It heavily penalizes predictions that are confidently wrong.

# Optimizers

Below are some common optimizers, their characteristics, and example PyTorch code snippets to initialize them:

### SGD (Stochastic Gradient Descent)
- **Why**: Simplicity and ease of implementation.
- **Essence**: Updates parameters using the gradient of the loss function.
- **Code**: 
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
  ```

### Momentum
- **Why**: Faster convergence compared to plain SGD.
- **Essence**: Adds a momentum term to SGD, which considers past gradients.
- **Code**:
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
  ```

### Adagrad
- **Why**: Suitable for sparse data, adjusts learning rates.
- **Essence**: Scales learning rate for each parameter individually.
- **Code**:
  ```python
  optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01)
  ```

### RMSprop
- **Why**: Good for non-stationary objectives.
- **Essence**: Adapts learning rates during training.
- **Code**:
  ```python
  optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
  ```

### Adam
- **Why**: Good default for many problems.
- **Essence**: Combines features of Momentum and RMSprop.
- **Code**:
  ```python
  optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
  ```

### AdamW
- **Why**: Improves generalization.
- **Essence**: Similar to Adam but with weight decay.
- **Code**:
  ```python
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
  ```

Each of these optimizers can be more effective depending on the specific problem, architecture, or data you are working with.

# Stochastic Gradient Descent (SGD) Optimizer

- **Essence**: Unlike traditional Gradient Descent that computes the gradient using the entire dataset, SGD estimates the gradient using a single or a small batch of samples. This makes each update faster but noisier.

- **Why Use**: Faster computation and ability to escape local minima due to the noisy updates. Useful when you have a very large dataset.

- **PyTorch Code**: 
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
  ```

- **How It Works**: The weight update formula is the same as in traditional gradient descent:
  ``` 
  w = w - learning_rate * estimated_gradient
  ```
  The key difference is that `estimated_gradient` is calculated using a subset of the entire dataset.

The main advantage of SGD is computational efficiency, allowing for faster iterations and suitability for large-scale data. The randomness in choosing mini-batches can also help escape local minima for non-convex problems.

# Momentum Optimizer

- **Essence**: Momentum helps the optimizer to navigate along the relevant directions and softens the oscillations in the irrelevant directions. It accumulates past gradients and uses them to make future updates, thereby adding inertia to the optimization process.
  
- **Why Use**: It speeds up the convergence and mitigates problems like getting stuck in local minima or oscillating.

- **PyTorch Code**: 
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
  ```
  
- **How It Works**: Traditional gradient descent updates weights (`w`) as follows:
  ``` 
  w = w - learning_rate * gradient
  ```
  Momentum modifies this by introducing a velocity (`v`) term:
  ```
  v = momentum * v - learning_rate * gradient
  w = w + v
  ```

The `momentum` term usually has a value between 0 and 1; a typical value is 0.9. The velocity `v` is initialized as zero, and subsequently updated with the weighted sum of the negative gradient and the previous velocity.

# Adagrad (Adaptive Gradient Algorithm) Optimizer

- **Essence**: Adagrad adapts the learning rates for each parameter individually based on the historical gradient information.

- **Why Use**: Suitable for problems with features that have different frequencies. Good for sparse data and NLP tasks.

- **PyTorch Code**: 
  ```python
  optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01)
  ```

- **How It Works**: The weight update formula is:
  ```
  G_t = G_{t-1} + (gradient_t)^2
  w = w - (learning_rate / sqrt(G_t + epsilon)) * gradient_t
  ```
  where \( G_t \) is the sum of the squares of the past gradients and \( \epsilon \) is a smoothing term to prevent division by zero.

Adagrad adjusts the effective learning rate for each parameter, which can be beneficial for problems where some features are sparse and need more aggressive updates. However, the learning rate may decrease too fast, effectively stopping learning; thus, it's not always suitable for all problems.

# RMSprop (Root Mean Square Propagation) Optimizer

- **Essence**: RMSprop adjusts the learning rate during training, with a bias toward more recent gradients. It aims to resolve Adagrad's rapidly decreasing learning rate.

- **Why Use**: Useful for non-stationary objectives and online learning scenarios. Also effective for problems that are sensitive to parameter updates, such as recurrent neural networks (RNNs).

- **PyTorch Code**: 
  ```python
  optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
  ```

- **How It Works**: The weight update formula is:
  ```
  E[g^2]_t = (1 - decay_rate) * (gradient_t)^2 + decay_rate * E[g^2]_{t-1}
  w = w - (learning_rate / sqrt(E[g^2]_t + epsilon)) * gradient_t
  ```
  where \( E[g^2]_t \) is the moving average of the square of the gradient, and \( \epsilon \) is a smoothing term to prevent division by zero.

RMSprop combines the benefits of both AdaGrad and AdaDelta. It uses a moving average of squared gradients to normalize the gradient itself. That means the step size is decided on a per-parameter basis.

# Adam (Adaptive Moment Estimation) Optimizer

- **Essence**: Combines the advantages of both Momentum and RMSprop. It calculates adaptive learning rates for each parameter and also keeps an exponentially decaying average of past gradients, similar to momentum.

- **Why Use**: Effective in practice and suitable for most non-convex optimization problems. Good for deep learning and complex architectures.

- **PyTorch Code**:
  ```python
  optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
  ```

- **How It Works**: The weight update formula combines terms from both Momentum and RMSprop:
  ```
  m_t = beta1 * m_{t-1} + (1 - beta1) * gradient_t
  v_t = beta2 * v_{t-1} + (1 - beta2) * (gradient_t)^2
  
  m_t_hat = m_t / (1 - beta1^t)
  v_t_hat = v_t / (1 - beta2^t)

  w = w - learning_rate * m_t_hat / (sqrt(v_t_hat) + epsilon)
  ```
  where \(m_t\) and \(v_t\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.

Adam works well in practice, outperforming most other optimizers in machine learning and deep learning tasks.

# AdamW (Weight Decay Adam) Optimizer

- **Essence**: A modification of the original Adam optimizer that decouples weight decay from the optimization steps. This corrects the weight decay calculation in the original Adam algorithm.

- **Why Use**: Effective when you require L2 regularization in your models, especially in deep learning tasks. AdamW is often better than Adam for tasks that require a sparse representation.

- **PyTorch Code**:
  ```python
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
  ```

- **How It Works**: AdamW separates the weight decay term from the parameter update:
  ```
  m_t = beta1 * m_{t-1} + (1 - beta1) * gradient_t
  v_t = beta2 * v_{t-1} + (1 - beta2) * (gradient_t)^2

  m_t_hat = m_t / (1 - beta1^t)
  v_t_hat = v_t / (1 - beta2^t)

  w = (w - weight_decay * lr * w) - lr * m_t_hat / (sqrt(v_t_hat) + epsilon)
  ```
  This makes the optimizer more suitable for fine-tuning and helps in achieving better generalization.

AdamW is particularly useful in scenarios where weight decay regularization is required, as it corrects the shortcomings of how weight decay is handled in the original Adam optimizer.