<a href="https://colab.research.google.com/github/christophergaughan/PyTorch/blob/main/PyTorch_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

| Hyperparameter             | Binary Classification                                                                                              | Multiclass Classification                                                                                  |
|----------------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| **Input layer shape (in_features)** | Same as number of features (e.g. 5 for age, sex, height, weight, smoking status in heart disease prediction) | Same as binary classification                                                                             |
| **Hidden layer(s)**        | Problem specific, minimum = 1, maximum = unlimited                                                               | Same as binary classification                                                                             |
| **Neurons per hidden layer** | Problem specific, generally 10 to 512                                                                            | Same as binary classification                                                                             |
| **Output layer shape (out_features)** | 1 (one class or the other)                                                                                  | 1 per class (e.g. 3 for food, person, or dog photo)                                                       |
| **Hidden layer activation** | Usually ReLU (rectified linear unit) but can be many others                                                      | Same as binary classification                                                                             |
| **Output activation**      | Sigmoid (`torch.sigmoid` in PyTorch)                                                                             | Softmax (`torch.softmax` in PyTorch)                                                                      |
| **Loss function**          | Binary crossentropy (`torch.nn.BCELoss` in PyTorch)                                                              | Cross entropy (`torch.nn.CrossEntropyLoss` in PyTorch)                                                    |
| **Optimizer**              | SGD (stochastic gradient descent), Adam (see `torch.optim` for more options)                                    | Same as binary classification                                                                             |


Classification is a problem connecting to whether one thing is identified with another

## Make classification data and get it ready

- This is a dataset already made in scikitlearn

In [None]:
import sklearn
from sklearn.datasets import make_circles

# Make 1000 circles
n_samples = 1000

# Create circles
X, y = make_circles(n_samples,
                    noise = 0.03,
                    random_state=42)


In [None]:
len(X), len(y)

In [None]:
print(f'First 5 samples of X:\n {X[:5]}')
print(f'First 5 samples of y:\n {y[:5]}')

In [None]:
y

## Clearly, we have a binary classification problem here as we have only 0's and 1's in the predictor column $(y)$

In [None]:
# Make a dataframe
import pandas as pd
circles = pd.DataFrame({'X1': X[:, 0],
                        'X2': X[:, 1],
                        'label': y})
circles.head()

In [None]:
# Visualize data
import matplotlib.pyplot as plt
plt.scatter(x=X[:, 0],
            y=X[:, 1],
            c=y,
            cmap=plt.cm.RdYlBu);

### This is a *toy dataset*: small enough to experiment with, but it gives us a platform to employ PyTorch code

**Our goal: separate the blue dots from the red dots**

In [None]:
# Check input and output shapes
X.shape, y.shape

In [None]:
# The data is in numpy arrays, we need to turn into pytorch tensors
import torch
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

In [None]:
X[:5], y[:5]

In [None]:
print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')

In [None]:
print(f'Values for one sample of X: {X[0]} with shape: {X[0].shape}')
print(f'Values for one sample of y: {y[0]} with shape: {y[0].shape}')

## Create train and test splits

In [None]:
torch.__version__

In [None]:
X.dtype, y.dtype

In [None]:
# Split data randomly
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)


In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

## Build Model

1. Device agnostoc code
2. Construct a model by subclassing `nn.Module`
3. loss function and optimizer
4. Create training and test loop

In [None]:
import torch
from torch import nn

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


1. Subclass `nn.Module`
2. Create 2 `nn.Linear()` layers capable of handling the shapes in our data
3. Define `forward()` method that outlines the forward pass
4. Instantiate an instance of our model class and sen to target `device`

In [None]:
# Subclass nn.Module
class CircleModelV0(nn.Module):
    def __init__(self):
        super().__init__()
        # # Create nn.Linear layers capable of handling the shapes of our data
        # self.layer_1 = nn.Linear(in_features=2,
        #                          out_features=5) # upscales to 5 features (hidden layers

        # self.layer_2 = nn.Linear(in_features=5,
        #                          out_features=1) # we're predicting a 0 or 1
        self.two_linear = nn.Sequential(
            nn.Linear(in_features=2,
                      out_features=5),
            nn.Linear(in_features=5,
                      out_features=1)
        )
    # define the forward pass
    def forward(self, x):
        return self.two_linear(x)
    #   return self.layer_2(self.layer_1(x)) # x-> layer_1 -> layer_2 -> output

# Instantiate instance of model class and send to target device
model_0 = CircleModelV0().to(device)
model_0


### Note in the code above:

The forward pass in the provided code may seem "backwards" because the sequence in which the operations are written in code starts with the last layer and progresses to the input layer, but this is simply a reflection of the computation flow in neural networks. Let's break it down:

#### Understanding the forward Pass
Order of Operations:

* When you call self.layer_1(x), the input x is passed through layer_1. * This produces the intermediate output of the first layer.
* The intermediate output is then passed to self.layer_2, which produces the final output.

In functional terms:
`x -> layer_1 -> layer_2 -> output
`
However, the Python code is written as:
`return self.layer_2(self.layer_1(x))
`

This is standard practice in programming because you apply the innermost function (layer 1) first and then the outermost function (layer 2).

#### Why It Feels "Backwards":

* Neural network layers are typically thought of as a forward progression from input to output.
* In the `forward` method, the "nesting" structure can feel reversed because you start with the input, apply transformations in order, but write it with the innermost function first.

#### It's Just Function Composition:

* The code uses function composition, where one function's output is the input to the next. This is conceptually similar to:
`f(g(x))
`


In [None]:
next(model_0.parameters()).device

In [None]:
# Let's replicate the model above using nn.Sequential
model_0 = nn.Sequential(
    nn.Linear(in_features=2,
              out_features=5),
    nn.Linear(in_features=5,
              out_features=1)).to(device)

model_0


In [None]:
model_0.state_dict()

In [None]:
# Make preds *rmbr to use the inference mode
with torch.inference_mode():
    untrained_preds = model_0(X_test.to(device))
print(f'Length of preds: {len(untrained_preds)}')
print(f'Shape of preds: {untrained_preds.shape}')
print(f'First 10 preds: {untrained_preds[:10]}')
print(f'First 10 y_test: {y_test[:10]}')

In [None]:
X_test[:10], y_test[:10]

### Set-up loss function and optimizer

Which loss and optimizer should we use?

- Depends on the problem
    - regression: MAE, MSE
    - Classification: binary cross entropy or categorical cross entropy

# Optimizer and Loss Functions in PyTorch

However, the same optimizer function can often be used across different problem spaces.

For example, the stochastic gradient descent optimizer (SGD, `torch.optim.SGD()`) can be used for a range of problems, and the same applies to the Adam optimizer (`torch.optim.Adam()`).

| **Loss Function/Optimizer**               | **Problem Type**                   | **PyTorch Code**                        |
|-------------------------------------------|-------------------------------------|-----------------------------------------|
| **Stochastic Gradient Descent (SGD)**     | Classification, regression, many others. | `torch.optim.SGD()`                     |
| **Adam Optimizer**                         | Classification, regression, many others. | `torch.optim.Adam()`                    |
| **Binary Cross Entropy Loss**             | Binary classification               | `torch.nn.BCELossWithLogits` or `torch.nn.BCELoss` |
| **Cross Entropy Loss**                    | Multi-class classification          | `torch.nn.CrossEntropyLoss`             |
| **Mean Absolute Error (MAE) or L1 Loss**  | Regression                          | `torch.nn.L1Loss`                       |
| **Mean Squared Error (MSE) or L2 Loss**   | Regression                          | `torch.nn.MSELoss`                      |


In [None]:
# Setup loss function

loss_fn = nn.BCEWithLogitsLoss()

# Setup optimizer
optimizer = torch.optim.SGD(params=model_0.parameters(),
                            lr=0.1)



In [None]:
model_0.state_dict()

In [None]:
# Calculate accuracy- out of 100 examples what percentage does our model get right?
def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item()
    acc = (correct / len(y_pred)) * 100
    return acc

### Train Model

1. Forward pass
2. Calculate the loss
3. Optimizer zero grad
4. Loss backward (backpropagation)
5. Optimizer step (gradient descent)

Aslo we are going to perform the folowing:

`going from raw logits -> prediction probabilities -> prediction labels`

Our raw outputs from our model are logits. Convert into prediction probabilities  by passing them to some kind of activation function (e.g. sigmoid for binary classification or softmax for multiclass classificsation)

Then we convert our models prediction probabilities to **prediction labels** by either rounding them or taking `argmax()`

In [None]:
model_0

In [None]:
# View the first 5 outputs of the forward pass on the test data
model_0.eval()
with torch.inference_mode():
    y_logits = model_0(X_test.to(device))[:5]
y_logits

In [None]:
y_test[:5]

In [None]:
# Since we are performing a binary classification- use sigmoid activation function
y_probs = torch.sigmoid(y_logits)
y_probs

For our predicition probability values, we need to perform a range-style rounding on them:
* `y_pred_probs` >= 0.5 y = 1 (class 1)
* `y_pred+probs` < 0.5 y=0 (class 0)

In [None]:
# Find predicition probabilities
y_preds = torch.round(y_probs)

# In full (logits->pred_probs->pred_labels)
y_pred_labels = torch.round(torch.sigmoid(model_0(X_test.to(device))[:5]))
y_pred_labels

# Check for equality
print(torch.eq(y_preds.squeeze(), y_pred_labels.squeeze()))

# Get rid of extra dimension
y_preds.squeeze()

In [None]:
y_test[:5]

# Building a training and test loop

In [None]:
device

In [None]:
!nvidia-smi

can also use a cuda manual seed

In [None]:
torch.manual_seed(42)

In [None]:
torch.cuda.manual_seed(42)

### Remember we are using BCEWITHLOGITSLOSS
MORE NUMERICALLY STABLE (as per docs)

## PyTorch BCEWithLogitsLoss

In **PyTorch**, `BCEWithLogitsLoss` combines a Sigmoid layer and the Binary Cross-Entropy (BCE) loss in one single class. Mathematically, for each scalar input $( x_i)$ (the **logit**) and corresponding label $( y_i \in \{0,1\} )$, the loss for one sample is given by:

$$[
\ell_i = -\Bigl[y_i \cdot \log\bigl(\sigma(x_i)\bigr) \;+\; \bigl(1 - y_i\bigr)\cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr],
]$$

where $( \sigma(x_i) )$ is the Sigmoid function:

$$[
\sigma(x_i) = \frac{1}{1 + e^{-x_i}}.
]$$

If we have a mini-batch of \( N \) samples, the **mean** (or **sum**, depending on the `reduction` parameter) of all individual losses $( \ell_i )$ is typically taken as the final scalar loss value:

$$[
\text{BCEWithLogitsLoss} = \frac{1}{N} \sum_{i=1}^{N} \ell_i.
]$$

### Optional Weights

- **Weight:** In PyTorch, you can assign a per-sample weight $( w_i )$ to handle unbalanced data. This modifies the loss term to:

  $$[
  \ell_i = -\, w_i\, \Bigl[y_i \cdot \log\bigl(\sigma(x_i)\bigr) + (1 - y_i)\cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr].
  ]$$

- **Positional Weight (`pos_weight`)**: This is an additional multiplier for the positive targets, useful when you have *many* more negatives than positives. It modifies the loss term for $( y_i=1 )$. Specifically,

$$
\ell_i = -\Bigl[\mathrm{pos\_weight} \cdot y_i \cdot \log(\sigma(x_i)) + (1 - y_i) \cdot \log\bigl(1 - \sigma(x_i)\bigr)\Bigr].
$$


By accepting raw logits $( x_i )$ (i.e., values **before** the Sigmoid), `BCEWithLogitsLoss` is more numerically stable than applying a Sigmoid followed by a separate `BCELoss`.


In [None]:
#Set number of epochs = 100
epochs = 100

# Put split data to target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

# Build our training and avaluation loop
for eopch in range(epochs):
    #training
    model_0.train()

    # Forward pass- remember that squeeze removes an extra 1-dimension from a tensor
    y_logits = model_0(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits)) # turn logits -> pred probs -> pred labels

    # Calculate loss/accuracy- rememeber function above
    # loss = loss_fn(torch.sigmoid(y_logits), # nn.BCELoss expects prediction probabilities as input
    #                y_train)
    loss = loss_fn(y_logits, # nn.BCEWithLogitsLoss expects raw logits as input
                   y_train)  # Note the order of the arguments here
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)


    # remember our accuracy function
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Optimizer zero grad
    optimizer.zero_grad()

    # Loss backward- backpropagation
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    # Testing
    model_0.eval()
    with torch.inference_mode():
        #forward pass
        test_logits = model_0(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits))
        # Calculate test loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)

    # Print out whats happening
    if eopch % 10 == 0:
        print(f"Epoch: {eopch} | Loss: {loss:.5f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.5f}, Test Acc: {test_acc:.2f}%")



#### Our results are akin to flipping a coin. Not ideal

In [None]:
circles.label.value_counts()

Why is our model not learning?

Let's visualize

to do so, we'll import a function called `plot_decision_boundary`

we note a very important website for our endevors in ML/DL: https://madewithml.com/

Specifically a repo by `Goku Mohandas`:
https://madewithml.com/courses/mlops/evaluation/

here we will use `mrdbourke's` helper function for visualizing our results



In [None]:
import requests
from pathlib import Path

# 1. (Optional) Remove the existing (likely invalid) helper_functions.py
# !rm helper_functions.py

# 2. Use the *raw* GitHub URL
url_to_download = "https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/helper_functions.py"

if Path("helper_functions.py").is_file():
    print("helper_functions.py already exists, skipping download")
else:
    print("Downloading helper_functions.py")
    request = requests.get(url_to_download)
    with open("helper_functions.py", "wb") as f:
        f.write(request.content)


In [None]:
from helper_functions import plot_predictions, plot_decision_boundary


In [None]:
# Plot the decision boundary of the model
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_0, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_0, X_test, y_test)

#### This shows us why our model is getting such poor results.

## Improve our model
* add more layers- give the model more chances to learn about the patterns in the data
* Add more hidden units - go from 5 hidden units to 10 hidden units
* fit for longer
* Change activation function - we're using sigmoid at the moment (good for binary data)
* Change the learning rate (warning vanishing/exploding gradients)
* Change the loss function
* Change the optimization function

These options are all from our model's perspective b/c they relate to the form of our model, rather than the data

Because these options are all values we can change within the model itself- they are called **hyperparameters**

##### Below we will
* Add more hidden units
* Increase the number of layers 2 -> 3
* Increase the number of epochs 100 -> 1000
*Ideally, we would only change 1 at a time b/c we will not know which of these improved/degraded our model. We do this just to save time

In [None]:
class CircleModelV1(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2,
                                 out_features=10)
        self.layer_2 = nn.Linear(in_features=10,
                                 out_features=10)
        self.layer_3 = nn.Linear(in_features=10,
                                 out_features=1) # out just has 1 layer as it is binary choice
    def forward(self, x):
        # z = self.layer_1(x)
        # z = self.layer_2(z)
        # z = self.layer_3(z)
        return self.layer_3(self.layer_2(self.layer_1(x))) # think f(g(x)) speeds up everything behind the scenes

model_1 = CircleModelV1().to(device)
model_1



In [None]:
# Create the loss function
loss_fn = nn.BCEWithLogitsLoss()

# Create an optimizer
optimizer = torch.optim.SGD(params=model_1.parameters(),
                            lr=0.1)

In [None]:
# Write a training and evaluation loop
torch.manual_seed(42)
torch.cuda.manual_seed(42)
epochs = 1000 #training for longer

# Put data on target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

for epoch in range(epochs):
    model_1.train()
    # forward pass
    y_logits = model_1(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits))

    # Calculate loss/accuracy
    loss = loss_fn(y_logits,
                   y_train)
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Zero the gradients
    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    #Testing
    model_1.eval()
    with torch.inference_mode():
        #Frward pass
        test_logits = model_1(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits))

        # Calculate loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)
    # Print out whats happening
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.5f}, Test Acc: {test_acc:.2f}%")

In [None]:
# Plot decision boundary
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_1, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_1, X_test, y_test)

## Nope- coin flip

One ay to trouble-shoot a larger problem is to test out a smaller problem

In [None]:
# Create some data (same as previous notebook)
weight = 0.7
bias = 0.3
start = 0
end = 1
step = 0.01
X_regression = torch.arange(start, end, step).unsqueeze(dim=1)
y_regression = weight * X_regression + bias

# Check the data
print(X_regression[:5], y_regression[:5])

In [None]:
train_split = int(0.8 * len(X_regression))
X_train_regression, y_train_regression = X_regression[:train_split], y_regression[:train_split]
X_test_regression, y_test_regression = X_regression[train_split:], y_regression[train_split:]
# Check lengths of each
len(X_train_regression), len(X_test_regression), len(y_train_regression), len(y_test_regression)

In [None]:
plot_predictions(train_data=X_train_regression,
                 train_labels = y_train_regression,
                 test_data = X_test_regression,
                 test_labels = y_test_regression);

In [None]:
X_train_regression.shape, y_train_regression.shape

Our present model is set up to recieve 2 x-features, here we are only feeding in one feature.

In [None]:
X_train_regression[:10]

only one value as shown above

Adjust `model_1` to fit a straight line- use `nn.Sequential`

In [None]:
# same architecture as model_1 (but using nn.Sequential())
model_2 = nn.Sequential(
    nn.Linear(in_features=1,
              out_features=10),
    nn.Linear(in_features=10,
              out_features=10),
    nn.Linear(in_features=10,
              out_features=1)
).to(device)
model_2

In [None]:
# Loss and optimizer
loss_fn = nn.L1Loss() # MAE loss with regression data
optimizer = torch.optim.SGD(model_2.parameters(),
                            lr=0.01)

In [None]:
# Train the model
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the number of epochs
epochs = 1000

# Put the data on the target device
X_train_regression, y_train_regression, X_test_regression, y_test_regression = X_train_regression.to(device), y_train_regression.to(device), X_test_regression.to(device), y_test_regression.to(device)

# Training
for epoch in range(epochs):
    y_pred = model_2(X_train_regression)
    loss = loss_fn(y_pred, y_train_regression)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # Testing
    model_2.eval()
    with torch.inference_mode():
        test_pred = model_2(X_test_regression)
        test_loss = loss_fn(test_pred, y_test_regression)

    # Print out what is happening- Note, there is no accuracy here because we
    # are doing regression
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.5f} | Test Loss: {test_loss:.5f}")

**OK, we see that model_2 can indeed learn as we see such small losses, particularly after we lowered the hyperparameter lr=0.1. -> lr = 0.01. This would lead us to think that model_2 is a valid approach**

In [None]:
# Turn on evaluation mode
model_2.eval()

# Create predictions (inference) with the model
with torch.inference_mode():
    y_preds = model_2(X_test_regression)

# Plot data and predictions
plot_predictions(train_data=X_train_regression,
                 train_labels=y_train_regression,
                 test_data=X_test_regression,
                 test_labels=y_test_regression,
                 predictions=y_preds)

#### Wait, why is the the plot blank. We get the error `can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`

*explanation*: matplot lib references NumPy- which uses the cpu rather than the gpu!


In [None]:
# write the correct code!
model_2.eval()

# Create predictions (inference) with the model
with torch.inference_mode():
    y_preds = model_2(X_test_regression)

# Plot data and predictions
plot_predictions(train_data=X_train_regression.cpu(),
                 train_labels=y_train_regression.cpu(),
                 test_data=X_test_regression.cpu(),
                 test_labels=y_test_regression.cpu(),
                 predictions=y_preds.cpu())

*Now the question becomes: is it the data our model can't learn on? Is it the circular nature of the data? Our model only comprises linear functions- which are all related to straight lines- is it the fact that our data has some non-linear characteristics?*

The big reveal here is that we will have to employ non linear activation functions!

Neural networks have the benifit of combining straight lines with non-linear lines. Thus, our model was hamstrung by giving it the power to only use straight lines in its calculations. Our data is is non-linear, thus the introduction of non-linear components will be the key to our success.

### Recreating non-linear data (red and blue circles)

In [None]:
# Make and plot data
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles

n_samples = 1000
X, y = make_circles(n_samples,
                    noise=0.03,
                    random_state=42)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu)



In [None]:
# Convert data to tensors and then to train and test splits
import torch
from sklearn.model_selection import train_test_split

# Turn data into tensors
X = torch.from_numpy(X).type(torch.float)
y = torch.from_numpy(y).type(torch.float)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

# Check the data
X_train[:5], y_train[:5]

## Building a model with non-linearity

### Non-Linear Activation Functions in PyTorch

Activation functions introduce non-linearity into a neural network, allowing it to learn complex patterns in data. Below are some commonly used non-linear activation functions in PyTorch:

1. **ReLU (Rectified Linear Unit)**:
   - Formula: $( f(x) = \max(0, x))$
   - Introduces sparsity and avoids the vanishing gradient problem.
   - Commonly used in hidden layers.

2. **Sigmoid**:
   - Formula: $( f(x) = \frac{1}{1 + e^{-x}} )$
   - Squashes input to a range between 0 and 1.
   - Suitable for binary classification problems but may suffer from the vanishing gradient issue.

3. **Tanh (Hyperbolic Tangent)**:
   - Formula: $( f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} )$
   - Squashes input to a range between -1 and 1.
   - Zero-centered, which can help during optimization compared to Sigmoid.

4. **Leaky ReLU**:
   - Formula: $( f(x) = x ) if ( x > 0 ), else ( f(x) = \alpha x )$ where $( \alpha )$ is a small positive constant.
   - Addresses the "dying ReLU" problem by allowing a small gradient for negative inputs.

5. **Softmax**:
   - Formula: $( f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} )$
   - Converts outputs into probabilities, typically used in the final layer for multi-class classification.

6. **ELU (Exponential Linear Unit)**:
   - Formula:
$$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha (e^x - 1) & \text{if } x \leq 0
\end{cases}
$$


   - Similar to ReLU but smoothens the transition for negative inputs.

7. **Swish**:
   - Formula: $( f(x) = x \cdot \text{sigmoid}(x) )$
   - A smooth, self-gated activation function known to improve performance on some deep learning tasks.

#### Key Considerations:
- **Choice of Activation Function**: Depends on the task and architecture. ReLU and its variants are widely used in hidden layers, while Softmax and Sigmoid are popular in output layers for classification tasks.
- **Non-Linearity**: Activation functions enable neural networks to approximate complex, non-linear mappings.

We could experiment with these activation functions to see their impact on the **make_circles** dataset, which is inherently non-linear!


In [None]:
from torch import nn
class CircleModeV2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=2, out_features=10)
        self.layer_2 = nn.Linear(in_features=10, out_features=10)
        self.layer_3 = nn.Linear(in_features=10, out_features=1)
        self.relu = nn.ReLU() # This function is applied elelment-wise and is a non-linear act func
    def forward(self, x):
        # Where should we put our non-linear activation functions
        return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x))))) # Remember chain rule

model_3 = CircleModeV2().to(device)
model_3

### Why I think of the Chain Rule in Calculus

the code snippet on the line 11 has a form that had confused me at first. One thing to remember is that it is conceptually similar to the **chain rule** in calculus! Here's why:

---

### The Chain Rule in Calculus
The chain rule is used to compute the derivative of a composition of functions. If $( f(x) = g(h(x)) )$, then the derivative $( f'(x) )$ is:

$$[
f'(x) = g'(h(x)) \cdot h'(x)
]$$

It works by applying the derivative of the outer function $( g )$, evaluated at the inner function $( h(x) )$, and multiplying it by the derivative of the inner function $( h )$.

---

### Thus when I see the code snippet below I think of that chain rule
The code:
```python
return self.layer_3(self.relu(self.layer_2(self.relu(self.layer_1(x)))))


In [None]:
# Create the loss function
loss_fn = nn.BCEWithLogitsLoss()

# Create an optimizer
optimizer = torch.optim.SGD(params=model_3.parameters(),
                            lr=0.01) # this hyperparameter will affect how long it takes for the model to complete

In [None]:
# Write a training and evaluation loop
torch.manual_seed(42)
torch.cuda.manual_seed(42)
epochs = 1000

# Put data on target device
X_train, y_train, X_test, y_test = X_train.to(device), y_train.to(device), X_test.to(device), y_test.to(device)

# Loop tthroughh the data
for epoch in range(epochs):
    #Training
    model_3.train()
    # forward pass
    y_logits = model_3(X_train).squeeze()
    y_pred = torch.round(torch.sigmoid(y_logits))

    # Calculate loss/accuracy
    loss = loss_fn(y_logits,
                   y_train)
    acc = accuracy_fn(y_true=y_train,
                      y_pred=y_pred)

    # Zero the gradients
    optimizer.zero_grad()

    # Loss backward
    loss.backward()

    # Optimizer step (gradient descent)
    optimizer.step()

    #Testing
    model_3.eval()
    with torch.inference_mode():
        #Frward pass
        test_logits = model_3(X_test).squeeze()
        test_pred = torch.round(torch.sigmoid(test_logits)) #logits -> prediction probabilities -> prediction labels

        # Calculate loss/accuracy
        test_loss = loss_fn(test_logits,
                            y_test)
        test_acc = accuracy_fn(y_true=y_test,
                               y_pred=test_pred)
    # Print out whats happening
    if epoch % 100 == 0:
        print(f"Epoch: {epoch} | Loss: {loss:.4f}, Acc: {acc:.2f}% | Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

### Visualize the model
* non linear activation function
* lr = 0.01 (my change)

In [None]:
model_3.eval()
with torch.inference_mode():
    y_preds = torch.round(torch.sigmoid(model_3(X_test))).squeeze()
y_preds[:10], y_test[:10]

In [None]:
# plot decision Boundaries
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_decision_boundary(model_3, X_train, y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_decision_boundary(model_3, X_test, y_test)

#### Still room for improvement

### Replicate non-linear activation functions

Neural networks, rather than us telling the model what to learn, we give it the tools to discover patterns in the data and it tries to figure out those patterns on its own

These tools are linear and non-linear functions.

In [None]:
# Create a tensor
A = torch.arange(-10, 10, 1, dtype = torch.float32)
A.dtype


In [None]:
# Visualize the tensor
plt.plot(A);

In [None]:
plt.plot(torch.relu(A));

In [None]:
def relu(x: torch.tensor) -> torch.Tensor:
    return torch.maximum(torch.tensor(0), x) # inputs must be tensors
relu(A)

In [None]:
plt.plot(relu(A));

In [None]:
# Now lets do the same for sigmoid
def sigmoid(x: torch.tensor) -> torch.Tensor:
    return 1 / (1 + torch.exp(-x))
sigmoid(A)

In [None]:
plt.plot(sigmoid(A));