<a href="https://colab.research.google.com/github/christophergaughan/PyTorch/blob/main/Copy_of_PyTorch_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Hyperparameter             | Binary Classification                                                                                              | Multiclass Classification                                                                                  |
|----------------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| **Input layer shape (in_features)** | Same as number of features (e.g. 5 for age, sex, height, weight, smoking status in heart disease prediction) | Same as binary classification                                                                             |
| **Hidden layer(s)**        | Problem specific, minimum = 1, maximum = unlimited                                                               | Same as binary classification                                                                             |
| **Neurons per hidden layer** | Problem specific, generally 10 to 512                                                                            | Same as binary classification                                                                             |
| **Output layer shape (out_features)** | 1 (one class or the other)                                                                                  | 1 per class (e.g. 3 for food, person, or dog photo)                                                       |
| **Hidden layer activation** | Usually ReLU (rectified linear unit) but can be many others                                                      | Same as binary classification                                                                             |
| **Output activation**      | Sigmoid (`torch.sigmoid` in PyTorch)                                                                             | Softmax (`torch.softmax` in PyTorch)                                                                      |
| **Loss function**          | Binary crossentropy (`torch.nn.BCELoss` in PyTorch)                                                              | Cross entropy (`torch.nn.CrossEntropyLoss` in PyTorch)                                                    |
| **Optimizer**              | SGD (stochastic gradient descent), Adam (see `torch.optim` for more options)                                    | Same as binary classification                                                                             |


Classification is a problem connecting to whether one thing is identified with another

## Make classification data and get it ready

- This is a dataset already made in scikitlearn

In [None]:
import sklearn
from sklearn.datasets import make_circles

# Make 1000 circles
n_samples = 1000

# Create circles
X, y = make_circles(n_samples,
                    noise = 0.03,
                    random_state=42)


In [None]:
len(X), len(y)

In [None]:
print(f'First 5 samples of X:\n {X[:5]}')
print(f'First 5 samples of y:\n {y[:5]}')

In [None]:
y

## Clearly, we have a binary classification problem here as we have only 0's and 1's in the predictor column $(y)$

In [None]:
# Make a dataframe
import pandas as pd
circles = pd.DataFrame({'X1': X[:, 0],
                        'X2': X[:, 1],
                        'label': y})
circles.head()

In [None]:
# Visualize data
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
            y=X[:, 1],
            c=y,
            cmap=plt.cm.RdYlBu);

### This is a *toy dataset*: small enough to experiment with, but it gives us a platform to employ PyTorch code

**Our goal: separate the blue dots from the red dots**

In [None]:
# Check input and output shapes
X.shape, y.shape

In [None]:
# The data is in numpy arrays, we need to turn into pytorch tensors
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

In [None]:
X[:5], y[:5]

In [None]:
print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')

In [None]:
print(f'Values for one sample of X: {X[0]} with shape: {X[0].shape}')
print(f'Values for one sample of y: {y[0]} with shape: {y[0].shape}')

## Create train and test splits

In [None]:
torch.__version__

In [None]:
X.dtype, y.dtype

In [None]:
# Split data randomly
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)


In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

## Build Model

1. Device agnostoc code
2. Construct a model by subclassing `nn.Module`
3. loss function and optimizer
4. Create training and test loop

In [None]:
import torch
from torch import nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


1. Subclass `nn.Module`
2. Create 2 `nn.Linear()` layers capable of handling the shapes in our data
3. Define `forward()` method that outlines the forward pass
4. Instantiate an instance of our model class and sen to target `device`

In [None]:
# Subclass nn.Module
class CircleModelV0(nn.Module):
    def __init__(self):
        super().__init__()
        # # Create nn.Linear layers capable of handling the shapes of our data
        # self.layer_1 = nn.Linear(in_features=2,
        #                          out_features=5) # upscales to 5 features (hidden layers

        # self.layer_2 = nn.Linear(in_features=5,
        #                          out_features=1) # we're predicting a 0 or 1
        self.two_linear = nn.Sequential(
            nn.Linear(in_features=2,
                      out_features=5),
            nn.Linear(in_features=5,
                      out_features=1)
        )
    # define the forward pass
    def forward(self, x):
        return self.two_linear(x)
    #   return self.layer_2(self.layer_1(x)) # x-> layer_1 -> layer_2 -> output

# Instantiate instance of model class and send to target device
model_0 = CircleModelV0().to(device)
model_0


### Note in the code above:

The forward pass in the provided code may seem "backwards" because the sequence in which the operations are written in code starts with the last layer and progresses to the input layer, but this is simply a reflection of the computation flow in neural networks. Let's break it down:

#### Understanding the forward Pass
Order of Operations:

* When you call self.layer_1(x), the input x is passed through layer_1. * This produces the intermediate output of the first layer.
* The intermediate output is then passed to self.layer_2, which produces the final output.

In functional terms:
`x -> layer_1 -> layer_2 -> output
`
However, the Python code is written as:
`return self.layer_2(self.layer_1(x))
`

This is standard practice in programming because you apply the innermost function (layer 1) first and then the outermost function (layer 2).

#### Why It Feels "Backwards":

* Neural network layers are typically thought of as a forward progression from input to output.
* In the `forward` method, the "nesting" structure can feel reversed because you start with the input, apply transformations in order, but write it with the innermost function first.

#### It's Just Function Composition:

* The code uses function composition, where one function's output is the input to the next. This is conceptually similar to:
`f(g(x))
`


In [None]:
next(model_0.parameters()).device

In [None]:
# Let's replicate the model above using nn.Sequential
model_0 = nn.Sequential(
    nn.Linear(in_features=2,
              out_features=5),
    nn.Linear(in_features=5,
              out_features=1)).to(device)

model_0


In [None]:
model_0.state_dict()

In [None]:
# Make preds *rmbr to use the inference mode
with torch.inference_mode():
    untrained_preds = model_0(X_test.to(device))
print(f'Length of preds: {len(untrained_preds)}')
print(f'Shape of preds: {untrained_preds.shape}')
print(f'First 10 preds: {untrained_preds[:10]}')
print(f'First 10 y_test: {y_test[:10]}')

In [None]:
X_test[:10], y_test[:10]

### Set-up loss function and optimizer

Which loss and optimizer should we use?

- Depends on the problem
    - regression: MAE, MSE
    - Classification: binary cross entropy or categorical cross entropy

# Optimizer and Loss Functions in PyTorch

However, the same optimizer function can often be used across different problem spaces.

For example, the stochastic gradient descent optimizer (SGD, `torch.optim.SGD()`) can be used for a range of problems, and the same applies to the Adam optimizer (`torch.optim.Adam()`).

| **Loss Function/Optimizer**               | **Problem Type**                   | **PyTorch Code**                        |
|-------------------------------------------|-------------------------------------|-----------------------------------------|
| **Stochastic Gradient Descent (SGD)**     | Classification, regression, many others. | `torch.optim.SGD()`                     |
| **Adam Optimizer**                         | Classification, regression, many others. | `torch.optim.Adam()`                    |
| **Binary Cross Entropy Loss**             | Binary classification               | `torch.nn.BCELossWithLogits` or `torch.nn.BCELoss` |
| **Cross Entropy Loss**                    | Multi-class classification          | `torch.nn.CrossEntropyLoss`             |
| **Mean Absolute Error (MAE) or L1 Loss**  | Regression                          | `torch.nn.L1Loss`                       |
| **Mean Squared Error (MSE) or L2 Loss**   | Regression                          | `torch.nn.MSELoss`                      |


In [None]:
# Setup loss function

loss_fn = nn.BCEWithLogitsLoss()

# Setup optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
                            lr=0.1)



In [None]:
model_0.state_dict()

In [None]:
# Calculate accuracy- out of 100 examples what percentage does our model get right?
def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item()
    acc = (correct / len(y_pred)) * 100
    return acc

### Train Model

1. Forward pass
2. Calculate the loss
3. Optimizer zero grad
4. Loss backward (backpropagation)
5. Optimizer step (gradient descent)

Aslo we are going to perform the folowing:

`going from raw logits -> prediction probabilities -> prediction labels`

Our raw outputs from our model are logits. Convert into prediction probabilities  by passing them to some kind of activation function (e.g. sigmoid for binary classification or softmax for multiclass classificsation)

Then we convert our models prediction probabilities to **prediction labels** by either rounding them or taking `argmax()`

In [None]:
model_0

In [None]:
# View the first 5 outputs of the forward pass on the test data
model_0.eval()
with torch.inference_mode():
    y_logits = model_0(X_test.to(device))[:5]
y_logits

In [None]:
y_test[:5]

In [None]:
# Since we are performing a binary classification- use sigmoid activation function
y_probs = torch.sigmoid(y_logits)
y_probs

For our predicition probability values, we need to perform a range-style rounding on them:
* `y_pred_probs` >= 0.5 y = 1 (class 1)
* `y_pred+probs` < 0.5 y=0 (class 0)

In [None]:
# Find predicition probabilities
y_preds = torch.round(y_probs)

# In full (logits->pred_probs->pred_labels)
y_pred_labels = torch.round(torch.sigmoid(model_0(X_test.to(device))[:5]))
y_pred_labels

# Check for equality
print(torch.eq(y_preds.squeeze(), y_pred_labels.squeeze()))

# Get rid of extra dimension
y_preds.squeeze()

In [None]:
y_test[:5]

# Building a training and test loop

In [None]:
device

In [None]:
!nvidia-smi

can also use a cuda manual seed

In [None]:
torch.manual_seed(42)

In [None]:
torch.cuda.manual_seed(42)

### Remember we are using BCEWITHLOGITSLOSS
MORE NUMERICALLY STABLE (as per docs)

## PyTorch BCEWithLogitsLoss

In **PyTorch**, `BCEWithLogitsLoss` combines a Sigmoid layer and the Binary Cross-Entropy (BCE) loss in one single class. Mathematically, for each scalar input $( x_i)$ (the **logit**) and corresponding label $( y_i \in \{0,1\} )$, the loss for one sample is given by:

$$[
\ell_i = -\Bigl[y_i \cdot \log\bigl(\sigma(x_i)\bigr) \;+\; \bigl(1 - y_i\bigr)\cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr],
]$$

where $( \sigma(x_i) )$ is the Sigmoid function:

$$[
\sigma(x_i) = \frac{1}{1 + e^{-x_i}}.
]$$

If we have a mini-batch of \( N \) samples, the **mean** (or **sum**, depending on the `reduction` parameter) of all individual losses $( \ell_i )$ is typically taken as the final scalar loss value:

$$[
\text{BCEWithLogitsLoss} = \frac{1}{N} \sum_{i=1}^{N} \ell_i.
]$$

### Optional Weights

- **Weight:** In PyTorch, you can assign a per-sample weight $( w_i )$ to handle unbalanced data. This modifies the loss term to:

  $$[
  \ell_i = -\, w_i\, \Bigl[y_i \cdot \log\bigl(\sigma(x_i)\bigr) + (1 - y_i)\cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr].
  ]$$

- **Positional Weight (`pos_weight`)**: This is an additional multiplier for the positive targets, useful when you have *many* more negatives than positives. It modifies the loss term for $( y_i=1 )$. Specifically,

$$
\ell_i = -\Bigl[\mathrm{pos\_weight} \cdot y_i \cdot \log(\sigma(x_i)) + (1 - y_i) \cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr].
$$


By accepting raw logits $( x_i )$ (i.e., values **before** the Sigmoid), `BCEWithLogitsLoss` is more numerically stable than applying a Sigmoid followed by a separate `BCELoss`.


In [None]:
#Set number of epochs = 100
epochs = 100

# Put split data to target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

# Build our training and avaluation loop
for eopch in range(epochs):
    #training
    model_0.train()

    # Forward pass- remember that squeeze removes an extra 1-dimension from a tensor
    y_logits = model_0(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labels

    # Calculate loss/accuracy- rememeber function above
    # loss = loss_fn(torch.sigmoid(y_logits), # nn.BCELoss expects prediction probabilities as input
    #                y_train)
    loss = loss_fn(y_logits, # nn.BCEWithLogitsLoss expects raw logits as input
                   y_train)  # Note the order of the arguments here
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)


    # remember our accuracy function
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Optimizer zero grad
    optimizer.zero_grad()

    # Loss backward- backpropagation
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    # Testing
    model_0.eval()
    with torch.inference_mode():
        #forward pass
        test_logits = model_0(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits))
        # Calculate test loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)

    # Print out whats happening
    if eopch % 10 == 0:
        print(f"Epoch: {eopch} | Loss: {loss:.5f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.5f}, Test Acc: {test_acc:.2f}%")



#### Our results are akin to flipping a coin. Not ideal

In [None]:
circles.label.value_counts()

Why is our model not learning?

Let's visualize

to do so, we'll import a function called `plot_decision_boundary`

we note a very important website for our endevors in ML/DL: https://madewithml.com/

Specifically a repo by `Goku Mohandas`:
https://madewithml.com/courses/mlops/evaluation/

here we will use `mrdbourke's` helper function for visualizing our results



In [None]:
import requests
from pathlib import Path

# 1. (Optional) Remove the existing (likely invalid) helper_functions.py
# !rm helper_functions.py

# 2. Use the *raw* GitHub URL
url_to_download = "https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/helper_functions.py"

if Path("helper_functions.py").is_file():
    print("helper_functions.py already exists, skipping download")
else:
    print("Downloading helper_functions.py")
    request = requests.get(url_to_download)
    with open("helper_functions.py", "wb") as f:
        f.write(request.content)


In [None]:
from helper_functions import plot_predictions, plot_decision_boundary


In [None]:
# Plot the decision boundary of the model
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_0, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_0, X_test, y_test)

#### This shows us why our model is getting such poor results.

## Improve our model
* add more layers- give the model more chances to learn about the patterns in the data
* Add more hidden units - go from 5 hidden units to 10 hidden units
* fit for longer
* Change activation function - we're using sigmoid at the moment (good for binary data)
* Change the learning rate (warning vanishing/exploding gradients)
* Change the loss function
* Change the optimization function

These options are all from our model's perspective b/c they relate to the form of our model, rather than the data

Because these options are all values we can change within the model itself- they are called **hyperparameters**

##### Below we will
* Add more hidden units
* Increase the number of layers 2 -> 3
* Increase the number of epochs 100 -> 1000
*Ideally, we would only change 1 at a time b/c we will not know which of these improved/degraded our model. We do this just to save time

In [None]:
class CircleModelV1(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2,
                                 out_features=10)
        self.layer_2 = nn.Linear(in_features=10,
                                 out_features=10)
        self.layer_3 = nn.Linear(in_features=10,
                                 out_features=1) # out just has 1 layer as it is binary choice
    def forward(self, x):
        # z = self.layer_1(x)
        # z = self.layer_2(z)
        # z = self.layer_3(z)
        return self.layer_3(self.layer_2(self.layer_1(x))) # think f(g(x)) speeds up everything behind the scenes

model_1 = CircleModelV1().to(device)
model_1



In [None]:
# Create the loss function
loss_fn = nn.BCEWithLogitsLoss()

# Create an optimizer
optimizer = torch.optim.SGD(params=model_1.parameters(),
                            lr=0.1)

In [None]:
# Write a training and evaluation loop
torch.manual_seed(42)
torch.cuda.manual_seed(42)
epochs = 1000 #training for longer

# Put data on target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

for epoch in range(epochs):
    model_1.train()
    # forward pass
    y_logits = model_1(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits))

    # Calculate loss/accuracy
    loss = loss_fn(y_logits,
                   y_train)
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Zero the gradients
    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    #Testing
    model_1.eval()
    with torch.inference_mode():
        #Frward pass
        test_logits = model_1(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits))

        # Calculate loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)
    # Print out whats happening
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.5f}, Test Acc: {test_acc:.2f}%")

In [None]:
# Plot decision boundary
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_1, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_1, X_test, y_test)

## Nope- coin flip

One ay to trouble-shoot a larger problem is to test out a smaller problem

In [None]:
# Create some data (same as previous notebook)
weight = 0.7
bias = 0.3
start = 0
end = 1
step = 0.01
X_regression = torch.arange(start, end, step).unsqueeze(dim=1)
y_regression = weight * X_regression + bias

# Check the data
print(X_regression[:5], y_regression[:5])

In [None]:
train_split = int(0.8 * len(X_regression))
X_train_regression, y_train_regression = X_regression[:train_split], y_regression[:train_split]
X_test_regression, y_test_regression = X_regression[train_split:], y_regression[train_split:]
# Check lengths of each
len(X_train_regression), len(X_test_regression), len(y_train_regression), len(y_test_regression)

In [None]:
plot_predictions(train_data=X_train_regression,
                 train_labels = y_train_regression,
                 test_data = X_test_regression,
                 test_labels = y_test_regression);

In [None]:
X_train_regression.shape, y_train_regression.shape

Our present model is set up to recieve 2 x-features, here we are only feeding in one feature.

In [None]:
X_train_regression[:10]

only one value as shown above

Adjust `model_1` to fit a straight line- use `nn.Sequential`

In [None]:
# same architecture as model_1 (but using nn.Sequential())
model_2 = nn.Sequential(
    nn.Linear(in_features=1,
              out_features=10),
    nn.Linear(in_features=10,
              out_features=10),
    nn.Linear(in_features=10,
              out_features=1)
).to(device)
model_2

In [None]:
# Loss and optimizer
loss_fn = nn.L1Loss() # MAE loss with regression data
optimizer = torch.optim.SGD(model_2.parameters(),
                            lr=0.01)

In [None]:
# Train the model
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the number of epochs
epochs = 1000

# Put the data on the target device
X_train_regression, y_train_regression, X_test_regression, y_test_regression = X_train_regression.to(device), y_train_regression.to(device), X_test_regression.to(device), y_test_regression.to(device)

# Training
for epoch in range(epochs):
    y_pred = model_2(X_train_regression)
    loss = loss_fn(y_pred, y_train_regression)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # Testing
    model_2.eval()
    with torch.inference_mode():
        test_pred = model_2(X_test_regression)
        test_loss = loss_fn(test_pred, y_test_regression)

    # Print out what is happening- Note, there is no accuracy here because we
    # are doing regression
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f} | Test Loss: {test_loss:.5f}")

**OK, we see that model_2 can indeed learn as we see such small losses, particularly after we lowered the hyperparameter lr=0.1. -> lr = 0.01. This would lead us to think that model_2 is a valid approach**

In [None]:
# Turn on evaluation mode
model_2.eval()

# Create predictions (inference) with the model
with torch.inference_mode():
    y_preds = model_2(X_test_regression)

# Plot data and predictions
plot_predictions(train_data=X_train_regression,
                 train_labels=y_train_regression,
                 test_data=X_test_regression,
                 test_labels=y_test_regression,
                 predictions=y_preds)

#### Wait, why is the the plot blank. We get the error `can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`

*explanation*: matplot lib references NumPy- which uses the cpu rather than the gpu!


In [None]:
# write the correct code!
model_2.eval()

# Create predictions (inference) with the model
with torch.inference_mode():
    y_preds = model_2(X_test_regression)

# Plot data and predictions
plot_predictions(train_data=X_train_regression.cpu(),
                 train_labels=y_train_regression.cpu(),
                 test_data=X_test_regression.cpu(),
                 test_labels=y_test_regression.cpu(),
                 predictions=y_preds.cpu())

*Now the question becomes: is it the data our model can't learn on? Is it the circular nature of the data? Our model only comprises linear functions- which are all related to straight lines- is it the fact that our data has some non-linear characteristics?*

The big reveal here is that we will have to employ non linear activation functions!

Neural networks have the benifit of combining straight lines with non-linear lines. Thus, our model was hamstrung by giving it the power to only use straight lines in its calculations. Our data is is non-linear, thus the introduction of non-linear components will be the key to our success.

### Recreating non-linear data (red and blue circles)

In [None]:
# Make and plot data
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles

n_samples = 1000
X, y = make_circles(n_samples,
                    noise=0.03,
                    random_state=42)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu)



In [None]:
# Convert data to tensors and then to train and test splits
import torch
from sklearn.model_selection import train_test_split

# Turn data into tensors
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

# Check the data
X_train[:5], y_train[:5]

## Building a model with non-linearity

### Non-Linear Activation Functions in PyTorch

Activation functions introduce non-linearity into a neural network, allowing it to learn complex patterns in data. Below are some commonly used non-linear activation functions in PyTorch:

1. **ReLU (Rectified Linear Unit)**:
   - Formula: $( f(x) = \max(0, x))$
   - Introduces sparsity and avoids the vanishing gradient problem.
   - Commonly used in hidden layers.

2. **Sigmoid**:
   - Formula: $( f(x) = \frac{1}{1 + e^{-x}} )$
   - Squashes input to a range between 0 and 1.
   - Suitable for binary classification problems but may suffer from the vanishing gradient issue.

3. **Tanh (Hyperbolic Tangent)**:
   - Formula: $( f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} )$
   - Squashes input to a range between -1 and 1.
   - Zero-centered, which can help during optimization compared to Sigmoid.

4. **Leaky ReLU**:
   - Formula: $( f(x) = x ) if ( x > 0 ), else ( f(x) = \alpha x )$ where $( \alpha )$ is a small positive constant.
   - Addresses the "dying ReLU" problem by allowing a small gradient for negative inputs.

5. **Softmax**:
   - Formula: $( f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} )$
   - Converts outputs into probabilities, typically used in the final layer for multi-class classification.

6. **ELU (Exponential Linear Unit)**:
   - Formula:
$$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha (e^x - 1) & \text{if } x \leq 0
\end{cases}
$$


   - Similar to ReLU but smoothens the transition for negative inputs.

7. **Swish**:
   - Formula: $( f(x) = x \cdot \text{sigmoid}(x) )$
   - A smooth, self-gated activation function known to improve performance on some deep learning tasks.

#### Key Considerations:
- **Choice of Activation Function**: Depends on the task and architecture. ReLU and its variants are widely used in hidden layers, while Softmax and Sigmoid are popular in output layers for classification tasks.
- **Non-Linearity**: Activation functions enable neural networks to approximate complex, non-linear mappings.

We could experiment with these activation functions to see their impact on the **make_circles** dataset, which is inherently non-linear!


In [None]:
from torch import nn
class CircleModeV2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=10)
        self.layer_2 = nn.Linear(in_features=10, out_features=10)
        self.layer_3 = nn.Linear(in_features=10, out_features=1)
        self.relu = nn.ReLU() # This function is applied elelment-wise and is a non-linear act func
    def forward(self, x):
        # Where should we put our non-linear activation functions
        return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x))))) # Remember chain rule

model_3 = CircleModeV2().to(device)
model_3

### Why I think of the Chain Rule in Calculus

the code snippet on the line 11 has a form that had confused me at first. One thing to remember is that it is conceptually similar to the **chain rule** in calculus! Here's why:

---

### The Chain Rule in Calculus
The chain rule is used to compute the derivative of a composition of functions. If $( f(x) = g(h(x)) )$, then the derivative $( f'(x) )$ is:

$$[
f'(x) = g'(h(x)) \cdot h'(x)
]$$

It works by applying the derivative of the outer function $( g )$, evaluated at the inner function $( h(x) )$, and multiplying it by the derivative of the inner function $( h )$.

---

### Thus when I see the code snippet below I think of that chain rule
The code:
```python
return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x)))))


In [None]:
# Create the loss function
loss_fn = nn.BCEWithLogitsLoss()

# Create an optimizer
optimizer = torch.optim.SGD(params=model_3.parameters(),
                            lr=0.01) # this hyperparameter will affect how long it takes for the model to complete

In [None]:
# Write a training and evaluation loop
torch.manual_seed(42)
torch.cuda.manual_seed(42)
epochs = 1000

# Put data on target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

# Loop tthroughh the data
for epoch in range(epochs):
    #Training
    model_3.train()
    # forward pass
    y_logits = model_3(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits))

    # Calculate loss/accuracy
    loss = loss_fn(y_logits,
                   y_train)
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Zero the gradients
    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    #Testing
    model_3.eval()
    with torch.inference_mode():
        #Frward pass
        test_logits = model_3(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits)) #logits -> prediction probabilities -> prediction labels

        # Calculate loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)
    # Print out whats happening
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.4f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

### Visualize the model
* non linear activation function
* lr = 0.01 (my change)

In [None]:
model_3.eval()
with torch.inference_mode():
    y_preds = torch.round(torch.sigmoid(model_3(X_test))).squeeze()
y_preds[:10], y_test[:10]

In [None]:
# plot decision Boundaries
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_3, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_3, X_test, y_test)

## Test setups that may work

### Increase Model Capacity
* To handle the non-linear separability of the `make_circles` dataset and potential outlier effects from unscaled data, add more hidden units and layers:

## **Approach Overview**
In this part of ther project, we're trying to get better results than we have previously. We train a binary classification neural network on the `make_circles` dataset, without scaling the features. The network is designed to handle the dataset's non-linear nature by including multiple hidden layers with ReLU activations and sufficient hidden units. The key components of the approach include:

1. **Model Architecture**:
   - A 3-layer neural network with ReLU activations is used.
   - The output layer produces logits, which are raw predictions used by `BCEWithLogitsLoss` for numerical stability.

2. **Loss Function**:
   - `BCEWithLogitsLoss` is chosen as it combines sigmoid activation with binary cross-entropy loss, making it efficient for binary classification tasks.

3. **Optimizer**:
   - The Adam optimizer is used with a learning rate of `0.01` to enable efficient training and faster convergence.

4. **Weight Initialization**:
   - Xavier uniform initialization is applied to improve weight scaling, particularly with ReLU activations.

5. **Training and Evaluation**:
   - During training, the model computes logits, calculates loss, and updates weights using backpropagation.
   - In evaluation mode, test predictions are generated, and the test loss and accuracy are reported.

6. **Key Hyperparameters**:
   - The model is trained for `1500` epochs, and results are printed every `100` epochs.

This approach ensures robust training and testing, resulting in high accuracy for binary classification tasks on the `make_circles` dataset.


In [None]:
# Define the neural network model for binary classification
class CircleModeV2(nn.Module):
    def __init__(self):
        super().__init__()
        # Input layer to the first hidden layer (2 input features, 64 hidden units)
        self.layer_1 = nn.Linear(2, 64)
        self.relu_1 = nn.ReLU()  # Apply ReLU activation for non-linearity

        # Second hidden layer (64 hidden units)
        self.layer_2 = nn.Linear(64, 64)
        self.relu_2 = nn.ReLU()  # Apply ReLU activation for non-linearity

        # Output layer (1 unit for binary classification logits)
        self.layer_3 = nn.Linear(64, 1)

    def forward(self, x):
        # Pass the input through the layers with activations
        x = self.relu_1(self.layer_1(x))
        x = self.relu_2(self.layer_2(x))
        return self.layer_3(x)  # Return raw logits for use with BCEWithLogitsLoss


In [None]:
# Instantiate the model and move it to the appropriate device (GPU)
model_3a = CircleModeV2().to(device)

# Ensure training and testing data are also on the same device
X_train, y_train = X_train.to(device), y_train.to(device)
X_test, y_test = X_test.to(device), y_test.to(device)


In [None]:
print(torch.unique(y_train))  # Should output: tensor([0, 1])


## Training Setup

## ☠ Beware!

of a a mismatch in the shapes of `y_logits` and `y_train`. The `BCEWithLogitsLoss` function requires the shapes of both the logits (`y_logits`) and targets (`y_train`) to match exactly.

In our code, `y_logits` has the shape [800], while `y_train` has the shape [800, 1].

#### Reinitialize Weights

In [None]:
# Function to initialize weights for the linear layers
def initialize_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier initialization for weights (good for layers with ReLU activations)
        nn.init.xavier_uniform_(m.weight)
        # Set biases to zero
        nn.init.zeros_(m.bias)

# Apply the weight initialization to all layers of the model
model_3a.apply(initialize_weights)


In [None]:
# Define the Binary Cross-Entropy Loss with Logits
# This loss function is designed for binary classification and expects raw logits
loss_fn = nn.BCEWithLogitsLoss()

# Use the Adam optimizer with a learning rate of 0.01 for efficient training
# Adam dynamically adjusts learning rates for each parameter
optimizer = torch.optim.Adam(params=model_3a.parameters(), lr=0.01)


In [None]:
# Set manual seeds for reproducibility
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Define the number of epochs for training
epochs = 1500

for epoch in range(epochs):
    # Set the model to training mode
    model_3a.train()

    # Perform a forward pass to calculate logits
    y_logits = model_3a(X_train).squeeze()  # Squeeze to ensure dimensions match

    # Calculate the loss using BCEWithLogitsLoss
    loss = loss_fn(y_logits, y_train.squeeze())

    # Zero gradients to prevent accumulation
    optimizer.zero_grad()

    # Backpropagate the loss to compute gradients
    loss.backward()

    # Update model weights using the optimizer
    optimizer.step()

    # Set the model to evaluation mode for testing
    model_3a.eval()
    with torch.no_grad():  # Disable gradient computation for efficiency
        # Forward pass for the test data
        test_logits = model_3a(X_test).squeeze()  # Logits for test data

        # Calculate test loss
        test_loss = loss_fn(test_logits, y_test.squeeze())

        # Convert logits to probabilities and round to binary predictions
        test_pred = torch.round(torch.sigmoid(test_logits))

        # Calculate test accuracy
        test_acc = (test_pred == y_test.squeeze()).float().mean().item() * 100

    # Print results every 100 epochs
    if epoch % 100 == 0:
        print(f"Epoch {epoch}: Loss = {loss:.4f}, Test Loss = {test_loss:.4f}, Test Acc = {test_acc:.2f}%")


In [None]:
model_3a.eval()
with torch.inference_mode():
    y_preds = torch.round(torch.sigmoid(model_3a(X_test))).squeeze()
y_preds[:10], y_test[:10]

In [None]:
# plot decision Boundaries
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_3a, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_3a, X_test, y_test)

## Here we have successfully isolated the red from the blue circles

### Replicate non-linear activation functions

Neural networks, rather than us telling the model what to learn, we give it the tools to discover patterns in the data and it tries to figure out those patterns on its own

These tools are linear and non-linear functions.

In [None]:
# Create a tensor
A = torch.arange(-10, 10, 1, dtype = torch.float32)
A.dtype


In [None]:
# Visualize the tensor
plt.plot(A);

In [None]:
plt.plot(torch.relu(A));

In [None]:
def relu(x: torch.tensor) -> torch.Tensor:
    return torch.maximum(torch.tensor(0), x) # inputs must be tensors
relu(A)

In [None]:
plt.plot(relu(A));

In [None]:
# Now lets do the same for sigmoid
def sigmoid(x: torch.tensor) -> torch.Tensor:
    return 1 / (1 + torch.exp(-x))
sigmoid(A)

In [None]:
plt.plot(sigmoid(A));

**Challenge** Improve this model

## Building a multi-class classification model using PyTorch

In [None]:
# Import dependencies
import torch
from torch import nn
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

# Set Hyperparamaeters
NUM_CLASSES = 4
NUM_FEATURES = 2
RANDOM_SEED = 42
HIDDEN_UNITS = 8

X_blob, y_blob = make_blobs(n_samples=1000,
                            n_features=NUM_FEATURES,
                            centers=NUM_CLASSES,
                            cluster_std=1.5, # Creates noise in the data
                            random_state=RANDOM_SEED)

# Turn data into tensors
X_blob = torch.from_numpy(X_blob).type(torch.float)
y_blob = torch.from_numpy(y_blob).type(torch.float)

# Split into training and test
X_blob_train, X_blob_test, y_blob_train, y_blob_test = train_test_split(
                                                                        X_blob,
                                                                        y_blob,
                                                                        test_size=0.2,
                                                                        random_state=RANDOM_SEED
)

# Plot the data
plt.figure(figsize=(10, 7))
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y_blob, cmap=plt.cm.RdYlBu);

In [None]:
# Create device agnostic code

device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
import torch
import torch.nn as nn

class BlobModel(nn.Module):
    def __init__(self, input_features, output_features, hidden_units=8):
        """
        Initializes a multi-class classification model.
        Args:
        input_features (int): Number of input features to the model.
        output_features (int): Number of output features of the model.
        hidden_units (int): Number of hidden units between layers, default 8.
        """
        super().__init__()
        self.linear_layer_stack = nn.Sequential(
            nn.Linear(in_features=input_features, out_features=hidden_units),
            nn.ReLU(),
            nn.Linear(in_features=hidden_units, out_features=hidden_units),
            nn.ReLU(),
            nn.Linear(in_features=hidden_units, out_features=output_features)
        )

    def forward(self, x):  # Correctly indented at the class level- WATCH INDENTATION
        return self.linear_layer_stack(x)

# Create instance of the BlobModel and send to target device
model_4 = BlobModel(
    input_features=NUM_FEATURES,
    output_features=NUM_CLASSES,
    hidden_units=HIDDEN_UNITS
).to(device)

model_4  # Display the model architecture


In [None]:
X_blob_train.shape, y_blob_train.shape, X_blob_test.shape, y_blob_test.shape

In [None]:
torch.unique(y_blob_train)

In [None]:
# Create loss and optimizer- with multiclass we use cross entropy loss
# Note: we have a balanced training set
loss_fn = nn.CrossEntropyLoss() # loss function measures how wrong our model our model's predictions are
optimizer = torch.optim.SGD(params=model_4.parameters(), # optimizer updates our model parameter's to try to reduce the loss
                            lr=0.1)


In [None]:
# Getting raw ourputs of our model (i.e. logits)
# Note: our data is on the cpu
model_4.eval()
with torch.inference_mode():
    y_logits = model_4(X_blob_test.to(device))

y_logits[:10]


In [None]:
y_blob_test[:10]

### NOTE: we need to get y_pred shown above into the form of y_blob_test shown directly above this cell

* each member of y_preds has 4 features - these are the logits
* **In order to evaluate and train and test our model, we need to convert our models's output (logits) to prediction probabilities and then to prediction labels.**

`logits` (raw outputs of model) -> `prediction probabilities` (use `torch.softmax()`) -> `prediction labels` (take the `argmax` of the prediction probabilities)  

In [None]:
# Convert our models logit outputs to prediction probabilities
y_pred_probs = torch.softmax(y_logits, dim=1) # we want them accross the first dimension
print(y_logits[:5])
print(y_pred_probs[:5])

In [None]:
y_pred_probs[0]

In [None]:
torch.sum(y_pred_probs[0])

**note, using the Softmax function, the sum of y_pred_probs sums up to 1**

In [None]:
torch.max(y_pred_probs[0])

#### So we we look at our data:

[-0.3817,  0.2051,  0.1333, -0.9696]
    

`-0.3817`-> prob that this is class 0

`0.2051` -> prob that is class 1

`0.1333` -> prob that this is class 2 etc..

### convert our models prediction probabilities to prediction labels - done using `argmax()`--> finds the index of this argmax

In [None]:
y_preds = torch.argmax(y_pred_probs, dim=1)
y_preds

In [None]:
y_blob_test

**Ideally these two blocks of numbers would match up but since we are using random numbers, it is of little surprise that they don't match since no training has taken place**

### Now we will create our training and testing loop
**BE CAREFUL of data types here or else errors will occur**

An error is easily generated because `y_blob_train` is of type Float, but the CrossEntropyLoss function in PyTorch expects the target (labels) to be of type Long. The issue is likely that your labels (y_blob_train) are floating-point numbers, but they should be integers representing class indices.

```
# Ensure labels are of type Long
loss = loss_fn(y_logits, y_blob_train.long())

```
**ALSO**
`y_blob_test` is still of type Float when passed to the `CrossEntropyLoss` function. As mentioned earlier, the `CrossEntropyLoss` function requires the target labels (`y_blob_test`) to be of type `Long`. Additionally, the `test_logits` tensor might need to have the correct shape for the loss function to work.

#### Solution:

1. Ensure `y_blob_test` is of type `torch.long`: Convert y_blob_test to torch.long before passing it to the loss function.
`test_loss = loss_fn(test_logits, y_blob_test.long())`
2. Check the shape of `test_logits`: The shape of `test_logits` must be (`batch_size`, `num_classes`) for `CrossEntropyLoss`. If `test_logits` is squeezed improperly or has an incorrect shape, it can cause errors.

Remove `.squeeze()` from `test_logits` unless you are absolutely sure the dimensions are correct after squeezing.
`test_logits = model_4(X_blob_test)  # Don't squeeze
`

#### here is the correct validation loop:
```
model_4.eval()
with torch.inference_mode():
    test_logits = model_4(X_blob_test)  # Raw logits without squeezing
    test_pred = torch.softmax(test_logits, dim=1).argmax(dim=1)  # Get predictions
    # Ensure y_blob_test is of type Long
    test_loss = loss_fn(test_logits, y_blob_test.long())  
    test_acc = accuracy_fn(y_true=y_blob_test, y_pred=test_pred)

```




# Principles to Avoid Runtime Errors in PyTorch Training Loops

When building and training models in PyTorch, it’s important to follow these principles to prevent common runtime errors:

---

## 1. **Ensure Data Type Compatibility**
- **Input Data (`X`)**:
  - Must be of type `torch.float32` (default for `torch.Tensor`).
  - Ensure that the device matches your model's device (e.g., `X.to(device)`).

- **Target Data (`y`)**:
  - Must be of type `torch.long` when using classification loss functions like `CrossEntropyLoss`.
  - Convert using `.long()`: `y.long()`.

---

## 2. **Understand `CrossEntropyLoss` Requirements**
- The `CrossEntropyLoss` function expects:
  - **Logits**: Raw model outputs of shape `(batch_size, num_classes)` with no activation applied (e.g., no `softmax`).
  - **Targets**: Ground truth labels of shape `(batch_size,)` with integer class indices (type `torch.long`).

- **Avoid Applying `softmax`**:
  - `CrossEntropyLoss` includes `log_softmax` internally, so applying `softmax` beforehand is unnecessary and may cause incorrect behavior.

---

## 3. **Shape Consistency**
- Ensure the model's output (`logits`) and target labels (`y`) are compatible:
  - **Logits**: `(batch_size, num_classes)`
  - **Targets**: `(batch_size,)`
- **Avoid Unnecessary Squeezing**:
  - Do not use `.squeeze()` unless you are certain of its effect on tensor dimensions.
  - Print tensor shapes using `.shape` for debugging.

---

## 4. **Device Management**
- Always ensure tensors are on the same device as the model (`cpu` or `cuda`):
  ```python
  X = X.to(device)
  y = y.to(device)


In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the number of epochs
epochs = 100

# Put data to target device
X_blob_train, y_blob_train, X_blob_test, y_blob_test  = X_blob_train.to(device), y_blob_train.to(device), X_blob_test.to(device), y_blob_test.to(device)

# Loop through data
for epoch in range(epochs):
    model_4.train()
    # Forward pass
    y_logits = model_4(X_blob_train).squeeze()
    y_pred = torch.softmax(y_logits, dim=1).argmax(dim=1)
    # Calculate loss
    loss = loss_fn(y_logits, y_blob_train.long())
    acc = accuracy_fn(y_true=y_blob_train,
                      y_pred=y_pred)

    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step
    optimizer.step() # update paramters in our model

    # testing
    model_4.eval()
    with torch.inference_mode():
        test_logits = model_4(X_blob_test)  # Raw logits without squeezing
        test_pred = torch.softmax(test_logits, dim=1).argmax(dim=1)  # Get predictions
        # Ensure y_blob_test is of type Long
        test_loss = loss_fn(test_logits, y_blob_test.long())
        test_acc = accuracy_fn(y_true=y_blob_test, y_pred=test_pred)

    # Print out what is happening
    if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.4f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")



### Wow! Excellent results!!!!

In [None]:
# Evaluate - Make predictions
model_4.eval()
with torch.inference_mode():
    y_logits = model_4(X_blob_test).squeeze() # remember - logits are the raw output of our model
y_logits[:10]

In [None]:
# Go from logits -> prediction probabilities
y_pred_probs = torch.softmax(y_logits, dim=1)
y_pred_probs[:10]

In [None]:
y_blob_test

**So we are not apples to apples yet wrt our data-types**

In [None]:
# Go from pred_probs -> pred_labels
y_preds = torch.argmax(y_pred_probs, dim=1)
y_preds

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_4, X_blob_train, y_blob_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_4, X_blob_test, y_blob_test)


### We have successfully separated our data almost as best we could
**Question**
* Could we have analyzed our data without the help of non-linear activation functions?
* let's comment out our ReLU activation functions

In [None]:
import torch
import torch.nn as nn

class BlobModel(nn.Module):
    def __init__(self, input_features, output_features, hidden_units=8):
        """
        Initializes a multi-class classification model.
        Args:
        input_features (int): Number of input features to the model.
        output_features (int): Number of output features of the model.
        hidden_units (int): Number of hidden units between layers, default 8.
        """
        super().__init__()
        self.linear_layer_stack = nn.Sequential(
            nn.Linear(in_features=input_features, out_features=hidden_units),
            # nn.ReLU(),
            nn.Linear(in_features=hidden_units, out_features=hidden_units),
            # nn.ReLU(),
            nn.Linear(in_features=hidden_units, out_features=output_features)
        )

    def forward(self, x):  # Correctly indented at the class level- WATCH INDENTATION
        return self.linear_layer_stack(x)

# Create instance of the BlobModel and send to target device
model_4a = BlobModel(
    input_features=NUM_FEATURES,
    output_features=NUM_CLASSES,
    hidden_units=HIDDEN_UNITS
).to(device)

model_4a  # Display the model architecture


In [None]:
# Create loss and optimizer- with multiclass we use cross entropy loss
# Note: we have a balanced training set
loss_fn = nn.CrossEntropyLoss() # loss function measures how wrong our model our model's predictions are
optimizer = torch.optim.SGD(params=model_4a.parameters(), # optimizer updates our model parameter's to try to reduce the loss
                            lr=0.1)


In [None]:
# Getting raw ourputs of our model (i.e. logits)
# Note: our data is on the cpu
model_4a.eval()
with torch.inference_mode():
    y_logits = model_4a(X_blob_test.to(device))

y_logits[:10]

In [None]:
# Convert our models logit outputs to prediction probabilities
y_pred_probs = torch.softmax(y_logits, dim=1) # we want them accross the first dimension
print(y_logits[:5])
print(y_pred_probs[:5])

In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the number of epochs
epochs = 100

# Put data to target device
X_blob_train, y_blob_train, X_blob_test, y_blob_test  = X_blob_train.to(device), y_blob_train.to(device), X_blob_test.to(device), y_blob_test.to(device)

# Loop through data
for epoch in range(epochs):
    model_4a.train()
    # Forward pass
    y_logits = model_4a(X_blob_train).squeeze()
    y_pred = torch.softmax(y_logits, dim=1).argmax(dim=1)
    # Calculate loss
    loss = loss_fn(y_logits, y_blob_train.long())
    acc = accuracy_fn(y_true=y_blob_train,
                      y_pred=y_pred)

    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step
    optimizer.step() # update paramters in our model

    # testing
    model_4a.eval()
    with torch.inference_mode():
        test_logits = model_4a(X_blob_test)  # Raw logits without squeezing
        test_pred = torch.softmax(test_logits, dim=1).argmax(dim=1)  # Get predictions
        # Ensure y_blob_test is of type Long
        test_loss = loss_fn(test_logits, y_blob_test.long())
        test_acc = accuracy_fn(y_true=y_blob_test, y_pred=test_pred)

    # Print out what is happening
    if epoch % 10 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.4f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")


In [None]:
# Evaluate - Make predictions
model_4a.eval()
with torch.inference_mode():
    y_logits = model_4a(X_blob_test).squeeze() # remember - logits are the raw output of our model
y_logits[:10]

In [None]:
# Go from logits -> prediction probabilities
y_pred_probs = torch.softmax(y_logits, dim=1)
y_pred_probs[:10]

In [None]:
# Go from pred_probs -> pred_labels
y_preds = torch.argmax(y_pred_probs, dim=1)
y_preds

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_4a, X_blob_train, y_blob_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_4a, X_blob_test, y_blob_test)


### The answer is YES!!!!

In [None]:
model_4a

## A few more classification metrics..(to evaluate classification model)
* Accuracy- out of 100 samples, how many does our model get correct? (Good to use when you have balanced classes
* Precision
* Recall
* F1-score
* Confusion matrix
* Classification report

# Metrics Table

| **Metric Name**         | **Metric Formula**                                                         | **PyTorch Code**                                                                                                                                          | **When to Use**                                                                 |
|--------------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| **Precision**            | $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $          | `precision = tp / (tp + fp + 1e-10)`                                                                                | Use when minimizing false positives is important (e.g., spam detection).        |
| **Recall**               | $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $             | `recall = tp / (tp + fn + 1e-10)`                                                                                  | Use when minimizing false negatives is important (e.g., disease diagnosis).     |
| **F1-Score**             | $ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $ | `f1 = 2 * (precision * recall) / (precision + recall + 1e-10)`                                                     | Use when balancing precision and recall is important.                           |
| **Accuracy**             | $ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $ | `accuracy = torchmetrics.Accuracy()(predictions, targets)` <br>`# OR` <br>`accuracy = (tp + tn) / (tp + tn + fp + fn + 1e-10)`                           | Use when evaluating overall correctness of predictions.                         |
| **Confusion Matrix**     | N/A                                                                       | `conf_matrix = torch.zeros(num_classes, num_classes)` <br>`for t, p in zip(targets, predictions): conf_matrix[t.long(), p.long()] += 1`                  | Use for a detailed view of model performance for each class.                    |
| **Classification Report**| N/A                                                                       | `from sklearn.metrics import classification_report` <br>`report = classification_report(y_true, y_pred, target_names=class_names)`                      | Use for a comprehensive summary of precision, recall, F1-score, and support.    |

---

### Notes:
1. **TP (True Positives)**: Correctly predicted positive samples.
2. **FP (False Positives)**: Incorrectly predicted positive samples.
3. **FN (False Negatives)**: Positive samples incorrectly predicted as negative.
4. **TN (True Negatives)**: Correctly predicted negative samples.
5. **num_classes**: Total number of classes in the classification task.
6. **1e-10**: Small value added *to avoid division by zero.*

---

## Precision-Recall Relationship

### Precision and Recall:
- **Precision** focuses on the proportion of correct positive predictions out of all positive predictions made by the model.
- **Recall** focuses on the proportion of actual positive samples that were correctly identified by the model.

### When to Use:
- **High Precision, Low Recall**: Use when false positives are costly (e.g., spam detection).
- **High Recall, Low Precision**: Use when false negatives are costly (e.g., disease diagnosis).
- **F1-Score**: Use to balance precision and recall when both are critical.

---

### Example Usage in PyTorch

```python
import torch
import torchmetrics

# Example inputs
targets = torch.tensor([1, 0, 1, 1, 0, 1])
predictions = torch.tensor([1, 0, 1, 0, 0, 1])

# True positives, false positives, false negatives, true negatives
tp = torch.sum((predictions == 1) & (targets == 1))
fp = torch.sum((predictions == 1) & (targets == 0))
fn = torch.sum((predictions == 0) & (targets == 1))
tn = torch.sum((predictions == 0) & (targets == 0))

# Precision
precision = tp / (tp + fp + 1e-10)

# Recall
recall = tp / (tp + fn + 1e-10)

# F1-Score
f1 = 2 * (precision * recall) / (precision + recall + 1e-10)

# Accuracy (TorchMetrics)
accuracy_metric = torchmetrics.Accuracy()
accuracy = accuracy_metric(predictions, targets)

# Confusion Matrix
num_classes = 2
conf_matrix = torch.zeros(num_classes, num_classes)
for t, p in zip(targets, predictions):
    conf_matrix[t.long(), p.long()] += 1

# Print results
print(f"Precision: {precision.item():.4f}")
print(f"Recall: {recall.item():.4f}")
print(f"F1-Score: {f1.item():.4f}")
print(f"Accuracy: {accuracy.item():.4f}")
print(f"Confusion Matrix:\n{conf_matrix}")


**If you want access to a lot of PyTorch metrics, see pages like (i.e. for accuracy: https://pytorch.org/ignite/generated/ignite.metrics.Accuracy.html

![Alt Text](https://bit.ly/42h5KMI)


In [None]:
!pip install torchmetrics

In [None]:
from torchmetrics import Accuracy

# Setup metric, specifying the task as 'multiclass' and num_classes
torchmetric_accuracy = Accuracy(task="multiclass", num_classes=4).to(device) # NUM_CLASSES should be defined previously

# Calculate accuracy
torchmetric_accuracy(y_preds, y_blob_test)