## Introduction
Suppose you have a 28x28 set of grayscale images, and you're assigned to 
classify whether it's certain class or not, considering there are 10 
classes. You can try a neural network for this task, so you devise the 
following architecture:  

![image1.jpg](../images/image8.jpg)  

Each node correspond to a pixel of the image. Hidden layers have activation 
functions defined (in this case, ReLU), and the output layer has a final 
decision layer (LogSoftmax) for the ten classes it has.

We could calculate the number of trainable parameters or weights, based on 
our architecture of 784, 128, 64 and 10 nodes. We have to consider to add 
a bias term relative to the output layer, So each node is connected to 
the other nodes, and that connection is called weight. What we want is 
to find the weights that minimizes the prediction error. So we train our 
neural network and we can calculate how many parameters (the weights) we 
will need. For this case:

![image2.jpg](../images/image2.jpg)  

We need to remember that one leayer is the result of the previous one, 
where we need to add a bias:
$$
L = WX + b
$$


Now the activation functions introduce non-linearity to our neural 
network. For example ReLU (rectified linear unit), allows nodes to learn 
more complex structures. In the case of ReLU, it's a function that takes 
value 0 if input is zero or below, and takes the same value if input is 
greater than zero. Let's look at the function:

![image3.jpg](../images/image3.jpg)


In [1]:
# Step 0. Load libraries and custom functions
# Torch ----------------------------------------------------------------
import torch
from torch import nn
import torch.nn.functional as F
torch.manual_seed(2024) # for reproducibility
# Basics ---------------------------------------------------------------
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(2024) # for reproducibility

# Validate if 
if torch.backends.mps.is_available():
    device = torch.device('mps')
    x = torch.ones(1, device=device)
    print (f'Use mps: {x}')
else:
    device = torch.device('cpu')
    print ("Use cpu")

Use mps: tensor([1.], device='mps:0')


In [2]:
# Step 1. Create your NN class
class MyNeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x
    
model = MyNeuralNetwork()

In this case the base class is nn.Module. So MyNeuralNetwork is a derived 
class whose parent is nn.Module. The constructor is called with `__init__` 
and in this case, it refers to the constructor on the base class. self let's 
to access the object's attribute and other methods. FC is a fully connected 
layer, and in this case, we define 3.  Now this is how a FC looks like:  

![image4.jpg](../images/image4.jpg)

This create a linear transformation ($ WX + b$), where fc is the transition 
from one layer to the other:

![image5.jpg](../images/image5.jpg)

The forward class define the model structure, components and order of the 
different layers. 

Finally, we create an object based on the class we've just created. 

### Tensors

A torch.Tensor is a multi-dimensional matrix containing elements of a 
single data type. You must be aware that `torch.tensor()` always copies 
data, so to avoid copying it, use `detach()` or `requires_grad()`. 

The contents of a tensor can be accessed and modified using Python’s 
indexing and slicing notation:

```python
x = torch.Tensor([[2,3],[4,5]])
x[0][0]=1
```
Use torch.Tensor.item() to get a Python number from a tensor containing 
a single value:

```python
x = torch.Tensor([[1]]) # If contains more values, returns error
x.item()
```


In [3]:
w = torch.zeros(4,3)
w

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

In [5]:
# Shape and size methods give the same answer
print(f'{w.size()}, {w.shape}')

torch.Size([4, 3]), torch.Size([4, 3])


In [6]:
# When we initialize weights, we use normal random numbers, so we use randn
w = torch.randn(4,3)
w

tensor([[-0.7314, -0.5315,  1.6219],
        [ 0.2582,  0.7322,  0.0904],
        [ 1.1983, -1.6961, -2.4074],
        [-0.4155,  1.2816, -0.6278]])

In [7]:
# We can even create a similar tensor based on another tensor dimensions
t = torch.rand_like(w)
w

tensor([[-0.7314, -0.5315,  1.6219],
        [ 0.2582,  0.7322,  0.0904],
        [ 1.1983, -1.6961, -2.4074],
        [-0.4155,  1.2816, -0.6278]])

In [10]:
# We can mutate tensors in place, which functions with underscore can
t.fill_(2.3) # So tensor t has already changed

tensor([[2.3000, 2.3000, 2.3000],
        [2.3000, 2.3000, 2.3000],
        [2.3000, 2.3000, 2.3000],
        [2.3000, 2.3000, 2.3000]])

In [11]:
# And of course we can reshape a tensor using view. 
t.view(3,4)

tensor([[2.3000, 2.3000, 2.3000, 2.3000],
        [2.3000, 2.3000, 2.3000, 2.3000],
        [2.3000, 2.3000, 2.3000, 2.3000]])

In [14]:
# With -1, it makes torch to figure out what should be the final shape
t.view(1,-1)

tensor([[2.3000, 2.3000, 2.3000, 2.3000, 2.3000, 2.3000, 2.3000, 2.3000, 2.3000,
         2.3000, 2.3000, 2.3000]])

In [15]:
# You can pass from torch to numpy, and every change affects both of them
# since they share the same memory slot
t.numpy()

array([[2.3, 2.3, 2.3],
       [2.3, 2.3, 2.3],
       [2.3, 2.3, 2.3],
       [2.3, 2.3, 2.3]], dtype=float32)

In [22]:
# You can obtain the value with item by indexing
x = torch.Tensor([1,2])
x[1].item()

2.0

### Train a neural network

Firstly we pass a batch of images with labels. Initially, we set the 
weights as random numbers. The weights are gradually adjusted, based on 
the feedback signal, so training a neural network is adjusting the 
weights of this net. One iteration over all the images is called an epoch 
For each epoch we expect the loss to be reduced. After some iterations 
the net starts to learn patterns, specific to the current data.  

![image6.jpg](../images/image6.jpg)

Depending on the kind of problem, whether classification or regression, 
you need a loss function that fits:

Problem Type | Last layer activation | Loss Function
-------------|-----------------------|--------------
Binary classification|Sigmoid|Binary crossentropy
Multiclass, single-label classification|Softmax|Categorical crossentropy
Multiclass, multilabel classification|Sigmoid|Binary crossentropy
Regression to arbitrary values|None|MSE (mean squared error)
Regression to values between 0 and 1|Sigmoid|MSE or binary crossentropy

For example, CrossEntropyLoss combines `nn.LogSoftmax()` and `nn.LLLoss()` 
in one single class, where LLLoss means negative log-likelihood. 

Let's see the difference in the implementation of forward pass with the 
negative log likelihood loss (NLLLog):
```python
criterion = nn.NLLLog()
...
def forward(self):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    x = F.log_softmax(x, dim=1)
    return x
```

and a forward pass with CrossEntropyLoss as criterion:
```python
criterion = nn.CrossEntropyLoss()
...
def forward(self):
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x
```
In this case, it doesn't require the log_softmax function. 

So the class should be defined:
```python
class FMNIST(nn.Module):
    def __init__(self):
        self.fc1 = nn.Linear(784, 784)
        self.fc2 = nn.Linear(784, 784)
        self.fc3 = nn.Linear(784, 10)

    def forward(self):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) #Activation for the last layer
        return x
model = FMNIST()
```

### Autograd
Now the loss in our architecture model is function of the weights, and 
our task is to find the combination of weights where the loss is the 
minimum. 

![image7.jpg](../images/image7.jpg)

It happens that the function is differentiable, this means that we can 
compute the gradient of the loss with regards of the weights of the model. 
This could be a challenging problem if we remember how many trainable 
parameters there are (in our example, 109,386), but with Autograd you 
can compute it easily. In the forward pass we calculate the loss, and 
in the backward pass we calculate the gradient with respect to each of 
the weights. 

In [None]:
from torch import optim
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

num_epochs = 3

for i in range(num_epochs):
    cum_loss = 0
    for images, labels in train_loader:
        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        cum_loss += loss.item()
    print(f'Taining loss: {cum_loss/len(train_loader)}')

In [None]:
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

In [None]:
training_data = datasets.FashionMNIST(
    root='data',
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root='data',
    train=False,
    download=True,
    transform=ToTensor(),
)

In [None]:
# DataLoader wraps an iterable over our dataset, and supports automatic 
# batching, sampling, shuffling and multiprocess data loading
batch_size = 64
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

for X, y in test_dataloader:
    # N: Batch size, C: Channels, H: Height, W: Width
    print(f'Size of X[N,C,H,W]: {X.shape}, type: {X.dtype}')
    print(f'Size of y: {y.shape}, type: {y.dtype}')
    break

In [None]:
print(f'{mps_device}')

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Dropout(p=0.3),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10))
        
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(mps_device)
print(model)

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X,y) in enumerate(dataloader):
        X, y = X.to(mps_device), y.to(mps_device)
        pred = model(X)
        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if batch % 100 == 0:
            loss, current = loss.item(), (batch+1) * len(X)
            print(f'Loss: {loss:>.8f}, [{(current / size)}]')

In [None]:
def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches =  len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(mps_device), y.to(mps_device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f'Test error: \n Accuracy {correct*100:>.1f}% ,avg loss: {test_loss:>.8f}')
    



In [None]:
epochs = 10
for t in range(epochs):
    print(f'Epoch {t+1}\n')
    train(train_dataloader, model, loss_fn, optimizer)
    test(test_dataloader, model, loss_fn)
print('Done!')