# Neural networks consist of many Linear Transformations

Notice that in each layer the computation is basically the same taking the form

$ Y = X \cdot W + b $

**W** and **b** are the weight and bias matrices and **X** is the input matrix.  **Y** is the output of the **linear transformation**.

A linear transformation is a way to change a set of points (vectors) in space so that lines remain lines, and the origin remains fixed. In simpler terms, it's a rule for moving every point to a specific new location without twisting, warping, or tearing the shape formed by those points. In machine learning, this is often done using matrices to systematically shift, rotate, or scale the input data.

In the above code, the function used the `t()` function which transposes a matrix.  In the first instance, the input_tensor has a **shape** of (1,2) and the W_input has a shape of (4, 2).  Matrix multiplication requires that **broadcasting rules** be followed.  Taking the transposed form of W_input changes its shape to (2, 4).  The result (Y_input) of the matrix multiplication is a matrix of (1, 4).  

```
Y_input = torch.matmul(input_tensor, W_input.t()) + b_input
```


# Broadcasting Rules

If you understand broadcasting rules, it can help to make sense of when to transpose a matrix and what types of shape transformations can happen.  [Andrej Karpathy's Zero to Hero course](https://karpathy.ai/zero-to-hero.html) explains this well!

[Read this to understand broadcasting rules](https://numpy.org/doc/stable/user/basics.broadcasting.html).  

# Neural networks usually start out with random values that need trained

A model is **trained** by having its weights adjusted by small amounts so that the **accuracy** improves and the **loss** shrinks.  During training, it is common for the **loss** to grow instead of shrink in some cases.  This happens when the model potentially skips over the ideal weight combination or finds itself in a local minima.  


# Gradient Descent and Learning Rate

The process of applying gradient descent to a neural network deposits gradients to each parameter of the network.  The gradient is a floating point number indicating the direction the parameter in the network needs to change to improve **accuracy** and reduce **loss**.  Once the gradients are distributed through the network, they are applied by multiplying the **negative** gradient by the weight or bias and a **learning rate**.  The negative is why the process is called **descent**.  When numbers are multiplied and one of them is negative, it makes the entire result negative.  The gradient is like a slope and if the negative wasn't used, the result would continue to grow away from the desired result.

# Optimizer

The gradients are deposited to each weight and depending upon the network setup a multiple gradients may be deposited to the same weight.  When this happens, the gradients are simply summed together.  For this reason, before depositing gradients it is important to set them all to zero.  In PyTorch this is done by using an **optimizer**. 

# Learning Rate

The **learning rate** is typically a small number 0.01E-4 to 0.01E-5.  The primary reason for the learning rate being so small is to prevent the model from skipping past the best settings.  There is a lot that can be done with learning rates including making them dynamic, altering them based upon the given epoch.

# Epoch

An epoch is one complete pass through the entire training dataset. During an epoch, the model's parameters are updated iteratively using subsets of the training data, often referred to as batches. Multiple epochs are usually necessary to sufficiently train a model.  In the code below, 2000 epochs are used training on the single equation `30+70 = 100`.

# Hyperparameters

A hyperparameter is a parameter whose value is set before the training process begins, as opposed to the parameters of the model, which are learned during training. Examples include learning rate, batch size, and number of epochs. Hyperparameters are often tuned to optimize model performance.

# Loss function

The loss function (often called criterion) is used to compare the model's prediction to the actual value.  In the example below, the loss function is the MSELoss() function.  The MSELoss function calculates the average of the squares of the differences between predicted and actual values.

# Overfitting

Overfitting occurs when a model learns the training data too well, capturing noise and anomalies, rather than the underlying pattern. As a result, it performs poorly on new, unseen data. Overfitting is often a sign that a model is too complex relative to the simplicity of the problem. Techniques like **regularization**, **dropout**, and **simpler architectures** can help mitigate overfitting.

## Gradient Descent (more details)

Models are trained using **gradient descent**.  The "gradient" in the name refers to the derivative of the function at the current point. The algorithm takes steps proportional to the negative of the gradient, moving towards the minimum of the function.

The derivative of a function at a given point is essentially the slope of the function at that point. In the context of a function $ ( f(x) ) $ of a single variable, the derivative $ ( f'(x) ) $represents the slope of the tangent line to the function at a specific point $ ( x ) $. This slope indicates how the function is changing at that point. For functions of more than one variable, the concept of a derivative generalizes to partial derivatives.

A partial derivative is the derivative of a function of multiple variables with respect to one of those variables, keeping the other variables constant. For a function $ ( f(x, y, z, \ldots) ) $, the partial derivative with respect to $ ( x ) $ would indicate how $ ( f ) $ changes as $ ( x ) $ changes, while keeping $ ( y, z, \ldots ) $ constant. It is denoted as $ ( \frac{\partial f}{\partial x} ) $ for the variable $ ( x ) $.

Partial derivatives are used to form the gradient vector, which combines all the partial derivatives of a function into a single vector. This is useful in multivariable calculus and optimization problems, including machine learning algorithms like gradient descent.

The chain rule extends to functions of multiple variables through the use of partial derivatives. This generalizes to more complicated functions and is a cornerstone of backpropagation in neural networks, where it's used to compute gradients of a loss function with respect to the weights.

[Andrej Karpathy's Zero to Hero course](https://karpathy.ai/zero-to-hero.html) explains this well!

# Loss Functions

Here's the list of common loss functions with their corresponding PyTorch functions:

Each loss function has its own characteristics and is suited for specific types of problems. Choosing the right loss function is crucial for training an effective model.

1. **Mean Squared Error (MSE)**
   - **PyTorch Function**: `torch.nn.MSELoss()`
   - **Description**: Calculates the average of the squares of the differences between predicted and actual values.
   - **Best Suited For**: Regression problems.

2. **Mean Absolute Error (MAE)**
   - **PyTorch Function**: `torch.nn.L1Loss()`
   - **Description**: Calculates the average of the absolute differences between predicted and actual values.
   - **Best Suited For**: Regression problems with outliers.

3. **Cross-Entropy Loss (Log Loss)**
   - **PyTorch Function**: `torch.nn.CrossEntropyLoss()`
   - **Description**: Measures the performance of a classification model, rewarding confidence in correct classifications.
   - **Best Suited For**: Binary and multi-class classification.

4. **Hinge Loss**
   - **PyTorch Function**: `torch.nn.HingeEmbeddingLoss()`
   - **Description**: Used for "maximum-margin" classification, particularly for support vector machines.
   - **Best Suited For**: Binary classification.

5. **Categorical Cross-Entropy**
   - **PyTorch Function**: `torch.nn.CrossEntropyLoss()`
   - **Description**: Extension of Cross-Entropy loss for multi-class classification problems.
   - **Best Suited For**: Multi-class classification with single label.

6. **Kullback-Leibler Divergence**
   - **PyTorch Function**: `torch.nn.KLDivLoss()`
   - **Description**: Measures how one probability distribution diverges from another.
   - **Best Suited For**: Multi-class classification, recommendation systems.

7. **Poisson Loss**
   - **PyTorch Function**: `torch.nn.PoissonNLLLoss()`
   - **Description**: Measures the difference between the predicted average occurrence rate and the actual rate.
   - **Best Suited For**: Count-based regression problems.

8. **Cosine Similarity**
   - **PyTorch Function**: Use `torch.nn.CosineSimilarity()` and create a custom loss
   - **Description**: Measures the cosine of the angle between the predicted and actual vectors to measure similarity.
   - **Best Suited For**: Text similarity, clustering.

9. **Huber Loss**
   - **PyTorch Function**: `torch.nn.SmoothL1Loss()`
   - **Description**: Combination of MAE and MSE; less sensitive to outliers than MSE.
   - **Best Suited For**: Regression problems with occasional outliers.

10. **Negative Log-Likelihood (NLL)**
    - **PyTorch Function**: `torch.nn.NLLLoss()`
    - **Description**: Similar to Cross-Entropy but without logit transformation; often used with Softmax.
    - **Best Suited For**: Classification problems with Softmax.

11. **Focal Loss**
    - **PyTorch Function**: Not built-in; custom implementation required.
    - **Description**: Modification of Cross-Entropy that gives more weight to hard-to-classify examples.
    - **Best Suited For**: Imbalanced classification problems.

12. **Dice Loss**
    - **PyTorch Function**: Not built-in; custom implementation required.
    - **Description**: Measures overlap between predicted and ground truth sets; often used in image segmentation.
    - **Best Suited For**: Image segmentation tasks.


# Optimizers

Below are some common optimizers, their characteristics, and example PyTorch code snippets to initialize them:

### SGD (Stochastic Gradient Descent)
- **Why**: Simplicity and ease of implementation.
- **Essence**: Updates parameters using the gradient of the loss function.
- **Code**: 
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
  ```

### Momentum
- **Why**: Faster convergence compared to plain SGD.
- **Essence**: Adds a momentum term to SGD, which considers past gradients.
- **Code**:
  ```python
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
  ```

### Adagrad
- **Why**: Suitable for sparse data, adjusts learning rates.
- **Essence**: Scales learning rate for each parameter individually.
- **Code**:
  ```python
  optimizer = torch.optim.Adagrad(model.parameters(), lr=0.01)
  ```

### RMSprop
- **Why**: Good for non-stationary objectives.
- **Essence**: Adapts learning rates during training.
- **Code**:
  ```python
  optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
  ```

### Adam
- **Why**: Good default for many problems.
- **Essence**: Combines features of Momentum and RMSprop.
- **Code**:
  ```python
  optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
  ```

### AdamW
- **Why**: Improves generalization.
- **Essence**: Similar to Adam but with weight decay.
- **Code**:
  ```python
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
  ```

Each of these optimizers can be more effective depending on the specific problem, architecture, or data you are working with.