# Multilayer Neural Networks with PyTorch

In [1]:
import torch

Inherit from `torch.nn.Module` to build a neural network:

In [2]:
class MyNeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        # initialize parent class
        super().__init__()
        # Define layers
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(num_inputs, 8),  # First layer with 32 neurons
            torch.nn.ReLU(),                   # Activation function for first layer

            # Second hidden layer
            torch.nn.Linear(8, 4),           # Second layer with 16 neurons
            torch.nn.ReLU(),                   # Activation function for second layer

            # output layer
            torch.nn.Linear(4, num_outputs)
        )

    def forward(self, x):
        # The output of the last layer is called the logits
        logits = self.layers(x) 
        return logits

Now we can initialize the model and take a look at its layers:

In [3]:
model = MyNeuralNetwork(num_inputs=5, num_outputs=2)
model

MyNeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=5, out_features=8, bias=True)
    (1): ReLU()
    (2): Linear(in_features=8, out_features=4, bias=True)
    (3): ReLU()
    (4): Linear(in_features=4, out_features=2, bias=True)
  )
)

You may have heard of models like Google's Gemini or OpenAI's GPT4 being having been trained on billions of *parameters*. In this context, parameters are simply the tweak-able values to reduce your loss values. The model defined above isn't a crazy-complex model, so it's relatively easier to calculate the number of weights in it. 

Before using code to see the number of parameters, let's calculate the value manually by looking at each layer. 

The first layer has 5 `in_features` and 8 `out_features`, which means that the input is a `1 x 5` vector, and after it goes through the first linear transformation, it should becomes a `1 x 8` vector. Therefore, the shape of the weights should be `5 x 8`, right? 

In [16]:
first_layer = model.layers[0]
first_layer.weight.shape

torch.Size([8, 5])

Well that's not the case. As you can see, the shape of the first layer is equal to the *transpose* of what we expected. 

This is because PyTorch's `nn.Linear` stores weights from the layer's POV. 
Let's revisit how we initialized the first layer.


In [9]:
torch.nn.Linear(5, 8)

Linear(in_features=5, out_features=8, bias=True)

If you look at the [official docs](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html) for `torch.nn.Linear`, under `Variables` > `weights`, it writes: 

> weight (torch.Tensor) – the learnable weights of the module of shape (out features, in features).

So mathematically, what's really happening under the hood is:

$z = W^TX + b$

Where the weight matrix is transposed. 

 

What doesn't change is the fact that the weights of the first layer will contribute `5 * 8 = 40` or `8 * 5 = 40` parameters, along with an extra `8` paramters from the bias, 1 for each output feature.

In [None]:
for name, param in first_layer.named_parameters():
    print(f"{name}: shape={tuple(param.shape)}, num_params={param.numel()}")

weight: shape=(8, 5), num_params=40
bias: shape=(8,), num_params=8


In total, the first layer will have `40 + 8 = 48` parameters.

The same kind of calculation goes for the next two layers defined with `torch.nn.Linear`. 


Since the second layer takes in the linear transformation from the first linear layer `first_layer`, the shape of the input will be `1 x 8`. Since it needs to transform that input into a `1 x 4` vector, the shape of the matrix will have to be `8 x 4`. However, because of how PyTorch's `nn.Linear` module stores its weights, the shape of the weights will be the transposition of the matrix used to calculate the layer's output -- `4 x 8`

In [22]:
second_linear_layer = model.layers[2]
second_linear_layer.weight.shape

torch.Size([4, 8])

Since this layer also has a bias, one for each output neuron, the total number of parameters from this layer will be `4 * 8 + 4 = 36`.

In [23]:
for name, param in second_linear_layer.named_parameters():
    print(f"{name}: shape={tuple(param.shape)}, num_params={param.numel()}")

weight: shape=(4, 8), num_params=32
bias: shape=(4,), num_params=4


Since this calculation remains the same, I'll just show the code for calcaulating the number of paramters in the last layer defined using `nn.Linear`.

In [24]:
third_linear_layer = model.layers[4]
third_linear_layer.weight.shape

torch.Size([2, 4])

In [25]:
for name, param in third_linear_layer.named_parameters():
    print(f"{name}: shape={tuple(param.shape)}, num_params={param.numel()}")

weight: shape=(2, 4), num_params=8
bias: shape=(2,), num_params=2


In total, we can expect to see `48 + 36 + 10 = 94` parameters in the model.

In [28]:
model_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of trainable parameters in the model: {model_total_params}")

Total number of trainable parameters in the model: 94


One thing to note here is that each `nn.Linear` layer has `requires_grad` set to `True`, since calculating the gradients for all 94 paramters will take a long time, even for a model this small. 

In [29]:
print(model.layers[0].weight)

Parameter containing:
tensor([[ 4.4285e-01, -3.0492e-01, -3.2734e-01,  3.2656e-01, -1.0983e-01],
        [-1.3585e-01, -8.8028e-02,  1.6133e-01,  2.7886e-02, -1.9292e-01],
        [ 1.3574e-01,  2.4653e-01,  7.3773e-02, -1.2550e-01,  2.4974e-01],
        [ 4.3568e-01,  3.3548e-01,  5.4901e-04,  3.9819e-01,  3.4916e-01],
        [-6.4779e-02, -3.4380e-01,  2.1437e-01, -1.7376e-01,  3.3959e-01],
        [ 4.1931e-01, -6.9882e-02,  2.9337e-01, -3.6723e-03,  7.8795e-05],
        [ 1.8392e-01,  1.5690e-01,  1.9126e-01,  2.6826e-01, -2.2186e-02],
        [-3.1236e-01,  4.4170e-01, -1.9657e-01,  9.7641e-02, -2.2728e-02]],
       requires_grad=True)


The weights are initialized with random small numbers every time you initialize a new model. To fix the randomly-initilized values, you can seed the random number generator with `manual_seed`. 

In [32]:
torch.manual_seed(42)
my_model = MyNeuralNetwork(num_inputs=5, num_outputs=2)
print(my_model.layers[0].weight)

Parameter containing:
tensor([[ 0.3419,  0.3712, -0.1048,  0.4108, -0.0980],
        [ 0.0902, -0.2177,  0.2626,  0.3942, -0.3281],
        [ 0.3887,  0.0837,  0.3304,  0.0606,  0.2156],
        [-0.0631,  0.3448,  0.0661, -0.2088,  0.1140],
        [-0.2060, -0.0524, -0.1816,  0.2967, -0.3530],
        [-0.2062, -0.1263, -0.2689,  0.0422, -0.4417],
        [ 0.4039, -0.3799,  0.3453,  0.0744, -0.1452],
        [ 0.2764,  0.0697,  0.3613,  0.0489, -0.1410]], requires_grad=True)


## Forward pass


With a model initialized, you can generate output vectors by feeding the model with input vectors. This "feeding" of the input vector that results in an output is called the *forward pass*. Since `my_model` takes in 5 features as its input, the input will be a randomly generated `1 x 5` matrix. 

In [37]:
torch.manual_seed(42)
X = torch.randn((1, 5))
X, X.shape

(tensor([[ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229]]), torch.Size([1, 5]))

In [39]:
forward_output = my_model(X)
forward_output, forward_output.shape

(tensor([[0.3806, 0.3704]], grad_fn=<AddmmBackward0>), torch.Size([1, 2]))

When you pass the input to the model like `my_model(X)`, this will automatically call the `forward()` method that was defined when we first defined the model class. 

The output tensor contains a `grad_fn` property, which shows the operation that took to result in the output. In the case of the above code cell, that operation is `Addmm`, which is matrix multiplication `mm` and addition `Add`. This information will be used when we want to optimize the model's parameters using backpropagation, where PyTorch will handle the gradient calculation for you. 

However, in the case that you don't intend to perform backprop (e.g., you have a finalized model and you don't intend to optimize it again), you can use the directive `with torch_no_grad()`, which will save you a lot of compute power.

In [40]:
with torch.no_grad():
    forward_output_no_grad = my_model(X)
    print(forward_output_no_grad, forward_output_no_grad.shape)

tensor([[0.3806, 0.3704]]) torch.Size([1, 2])


## Logits

We have the outputs, now what? 
Well, if you look at the output tensor, it has two numbers. You can think of each number as a value that can be translated to a probability -- the probability of the input belonging in one of those categories. To see this, the output tensor would need to go through a `softmax` operation, which is essentially a function in math that takes in a vector of numbers into a probability distribution. Thus, the sum of the numbers of `softmax`'s output will be 1. 

In [41]:
with torch.no_grad():
    forward_output_no_grad = my_model(X)
    softmax_output = torch.nn.functional.softmax(forward_output_no_grad, dim=1)
    print(softmax_output, softmax_output.shape)    

tensor([[0.5025, 0.4975]]) torch.Size([1, 2])


One thing interesting to note here is that the probability of the input belonging in a particular category is 50:50, which reflects the model weights' random initialization.  