# Building Neural Network Models

In this recipe, we will start by understanding some important functions of TorchVision 
that enable it to deal with image data and process it. We will then define a basic 
architecture for a neural network by defining a class, and look at the modules and methods 
available for this. In this recipe, we will be focusing on a fully connected neural network 
class. Its attributes are the various layers whose purpose is to classify various types of 
clothes.

## Description of the dataset

We will be using the Fashion–MNIST dataset. This is a dataset of Zalando's article images, 
consisting of a training set of 60,000 examples and a test set of 10,000 examples. We will 
take an individual grayscale image 28 x 28 in size and convert it into a vector of 784.


## Note

This recipe should be used in conjuction with the recipe named "02_Loading Data Using PyTorch",
to pull the MNIST data.

In [None]:
import torch
from torch import nn
from torchvision import datasets, transforms

In [None]:
# define transforms for the preprocessing of our image data

transform = transforms.Compose([
    transforms.ToTensor(), # convert the input to PyTorch tensors
    transforms.Normalize((0.5,), (0.5,)), # normalize the data
])

In [None]:
# define the batch_size to divide our dataset into chunks to be fed into the model

batch_size = 64

# path to data
path = "./Data"

train_data = datasets.FashionMNIST(path, train=True, download=False,transform=transform) # setting download=False since I already have the data downloaded

# training data loader
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True)


# test data

test_data = datasets.FashionMNIST(path, train=False, transform=transform)

# test data loader
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True)

Our main task here is to define the neural network class, which will be a subclass of nn.Module.

## *Important Notes*

We could define the model class with any name, but what is important is that it is a subclass of 
nn.Module and has super().__init__(), which provides the model with a lot of useful methods and 
attributes and retains knowledge of the architecture.

We use nn.Linear() to define fully connected layers by passing in the input and output dimensions.
We use a softmax layer for the last layer output because there are 10 output classes. We use ReLU 
activation in the layers before the output layer to learn nonlinearity in the data. The hidden1 
layer takes 784 inputs units and gives out 256 output units. The hidden2 phrase outputs 128 units 
and the output layer has 10 output units representing 10 output classes. The softmax layer converts 
the activations into probabilities so that it adds to 1 along dimension 1.

In [None]:
class FashionNetwork(nn.Module):

    def __init__(self):
        super().__init__()

        self.hidden1 = nn.Linear(784, 256)  # first hidden layer
        self.hidden2 = nn.Linear(256, 128)  # second hidden layer
        self.output = nn.Linear(128, 10)    # output layer
        self.activation = nn.ReLU()         # activation function for the inner layers
        self.softmax = nn.Softmax(dim=1)    # softmax activation layer
        

    
    def foward(self, x):
        x = self.hidden1(x) # move the input to the first hidden layer, with 256 nodes
        x = self.activation(x)  # pass the outputs from the first hidden layer through 
                                #the activation function, which in our case is ReLU
        x = self.hidden2(x)
        x = self.activation(x)
        x = self.output(x)      # pass the last output layer, with 10 output classes
        
        output = self.softmax(x)    # push the output using the softmax function

        return output               # return the output tensor

In [None]:
# create the network object
model = FashionNetwork()

# a quick look at our model
print(model)

## Summary Notes

A network defined with nn.Module needs to have a forward() method defined. It takes the 
input tensor and passes it through the network components defined in the __init__() method 
in the network class, in the sequence of operations defined in the forward method.

The forward method is called automatically when input is passed referring to the name of 
the model object. The nn.Module automatically creates the weight and bias tensors that we'll 
use in the forward method. The linear unit by itself defines a linear function, such as 
*xW* + *B*; to have nonlinear capabilities, we need to insert nonlinear activation functions, 
and here we use one of the most popular activation functions, ReLU, although you could use other 
available activation functions in PyTorch.

The reason we squish the final layer output through softmax is because we want to have 1 output 
class with a higher probability than all the other classes, and the sum of the output probabilities 
should equal 1. The softmax function has a parameter dim=1 that ensures that softmax is taken across 
the columns of the output.

We can define the network architecture without defining a network class using the *nn.Sequential* module, 
and it is important to ensure that the sequence of operation in the *forward* method is ordered properly, 
although the sequence doesn't matter in \__init__. You can use *nn.Tanh* for tanh activation. You can 
access the weight and bias tensors from the model object with *model.hidden.weight* and *model.hidden.bias*.

# Defining the loss function

A machine learning model, when being trained, may have some deviation between the predicted output and the 
actual output, and this difference is called the *error* of the model. The function that lets us calculate 
this *error* is called the *loss function*, or *error function*. This function provides a metric to evaluate 
all possible solutions and choose the most optimized model. The loss function has to be able to reduce all 
attributes of the model down to a single number so that an improvement in that *loss function value* is 
representative of a better model. 

In this recipe, we will define a *loss function* for our fashion dataset using the loss function available in 
PyTorch.

** First, we will modify our existing network architecture to the output log of softmax instead of softmax, 
starting with the \__init__ method in the network constructor.

** Next, we will make the same change in the forward method of the neural network

In [None]:
class FashionNetwork(nn.Module):

    def __init__(self):
        super().__init__()

        self.hidden1 = nn.Linear(784, 256)  # first hidden layer
        self.hidden2 = nn.Linear(256, 128)  # second hidden layer
        self.output = nn.Linear(128, 10)    # output layer
        self.activation = nn.ReLU()         # activation function for the inner layers
        self.log_softmax = nn.LogSoftmax()    # softmax activation layer
        

    
    def foward(self, x):
        x = self.hidden1(x) # move the input to the first hidden layer, with 256 nodes
        x = self.activation(x)  # pass the outputs from the first hidden layer through 
                                #the activation function, which in our case is ReLU
        x = self.hidden2(x)
        x = self.activation(x)
        x = self.output(x)      # pass the last output layer, with 10 output classes
        
        output = self.log_softmax(x)    # push the output using the softmax function

        return output               # return the output tensor

In [None]:
model = FashionNetwork()

print(model)

In [None]:
# Define our loss function; we will use negative log likelihood loss for this

criterion = nn.NLLLoss()

## Section Notes

We replaced softmax with log softmax so that we could then use the log of probabilities 
over probabilities, which has nice theoretic interpretations. There are various reasons 
for doing this, including improved numerical performance and gradient optimization. 
These advantages can be extremely important when training a model that can be computationally 
challenging and expensive. Furthermore, it has a high penalizing effect when it is not 
predicting the correct class. We therefore use negative log likelihood when dealing with 
*log softmax*, as *softmax* is not compatible. It is useful in classification between *n* 
number of classes. The log would ensure that we are not dealing with very small values between 
0 and 1, and negative values would ensure that a *logarithm* of probability that is less than 1 
is nonzero. Our goal would be to reduce this negative log loss error function. In PyTorch, the 
loss function is called a *criterion*, and so we named our loss function criterion.

# Implementing optimizers                                                        
In the previous recipe, Defining the loss function, we spoke of errors and error functions, and 
learned that, for us to get a good model, we need to minimize the errors that are calculated. 
*Backpropagation* is a method by which the neural networks learn from errors; the errors are used 
to modify weights in such a way that the errors are minimized. *Optimization* functions are responsible 
for modifying weights to reduce the error. *Optimization* functions calculate the *partial derivative* 
of errors with respect to *weights*. The *derivative* shows the direction of a slope, and so we need to 
reverse the direction of the *gradient*. The *optimizer function* combines the *model parameters* and 
*loss function* to iteratively modify the model parameters to reduce the model error. *Optimizers* can 
be thought of as fiddling with the model weights to get the best possible model based on the difference 
in prediction from the model and the actual output, and the *loss function* acts as a guide by indicating 
when the *optimizer* is going right or wrong. The *learning rate* is a *hyperparameter* of the *optimizer*, 
which controls the amount by which the *weights* are updated. The *learning rate* ensures that the *weights* 
are not updated by a huge amount so that the algorithm fails to converge at all and the error gets bigger 
and bigger; however at the same time, the updating of the weight should not be so low that it takes forever 
to reach the *minimum* of the *cost function/error function*.

In PyTorch, the optim module provides a number of *optimizers*.

In [None]:
from torch import optim

In [None]:
# create an optimizer object. We will use the Adam optimizer and pass model parameters

optimizer = optim.Adam(model.parameters())

# To check for the defaults of the optimizer, you can do the following
optimizer.defaults

In [None]:
# You can also add the learning rate as an additional parameter

optimizer = optim.Adam(model.parameters(), lr=3e-3)

In [25]:
# Let's start training our model, starting with the number of epochs

epochs = 10


# start the training loop

for _ in range(epochs):
    # initilize the loss
    running_loss = 0

    # iterate through each image in training the image loader, which we defined in an earlier
    for image, label in train_loader.dataset:
        # reset the gradients to zero
        optimizer.zero_grad()
        
        # reshape the image
        image = image.view(image.shape[0], -1)

        # get the prediction from the model
        pred = model(image)

        # calculate the loss/error
        loss = criterion(pred, label)

        # call the .backward() method on the loss
        loss.backward()

        # call the .step() method on the optimizer
        optimizer.step()

        # append to the running loss
        running_loss += loss.item()

    # Finally, print the loss after each epoch
    else:
        print(f'Training loss: {running_loss/len(train_loader):.4f}')


NotImplementedError: 