<a href="https://colab.research.google.com/github/asrjy/d2l-notes/blob/master/Chapter%206%20-%20Builder's%20Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Layers and Modules


In [None]:
import torch
from torch import nn 
from torch.nn import functional as F

net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape



torch.Size([2, 10])

In [None]:
net.__call__(X).shape

torch.Size([2, 10])

## A Custom Module

Basic functionality of a module:

1 - Ingest input data as arguments and pass it to it's forward propagation method. 

2 - Generate an output from the input passed to it, at the end of forward propagation computation. 

3 - Calculate the backpropagation of the output with respect to the input. 

4 - Store and provide access to it's parameters that are necessary for the forward propagation (weights). 

5 - Initialize model parameters as needed. 

In [None]:
class MLP(nn.Module):
  def __init__(self):
    # Calling the constructor of the parent class nn.Module to perform the necessary initialization
    super().__init__()
    self.hidden = nn.LazyLinear(256)
    self.out = nn.LazyLinear(10)
  def forward(self, X):
    return self.out(F.relu(self.hidden(X)))

In [None]:
net = MLP()
net(X).shape



torch.Size([2, 10])

## The Sequential Module

We can build our own version of Sequential if we can provide 

1 - A method to append modules one by one to  a list

2 - A forward propagation method to pass an input through a chain of modules

In [None]:
class MySequential(nn.Module):
  def __init__(self, *args):
    super().__init__()
    for idx, module  in enumerate(args):
      self.add_module(str(idx), module)
    
  def forward(self, X):
    for module in self.children():
      X = module(X)
    return X

In [None]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape



torch.Size([2, 10])

## Executing code in the forward propagation method

Sequential() is not so helpful when we want to include python control flow during forward propagation or apply some mathematical operations on the output of layers instead of relying on predefined network layers. 

We may also use constant parameters that are not a result of previous iteration or are updatable parameters. 

Defining an MLP that does this


In [None]:
class FixedHiddenMLP(nn.Module):
  def __init__(self):
    super().__init__()
    self.rand_weight = torch.rand((20, 20))
    self.linear = nn.LazyLinear(20)
  def forward(self, X):
    X = self.linear(X)
    X = F.relu(X @ self.rand_weight + 1)
    # Reusing the fully connected layer. This is equivalent to sharing parameters with two fully connected layers
    X = self.linear(X)
    # This may not be seen in a real life neural network. Just to showcase the advantage of creating a custom class instead of using Sequential() class. 
    while X.abs().sum() > 1:
      X /= 2
    return X.sum()

In [None]:
net = FixedHiddenMLP()
net(X)



tensor(0.0711, grad_fn=<SumBackward0>)

We can also use Sequntial inside of class in other words nesting of modules is possible. 

In [None]:
class NestMLP(nn.Module):
  def __init__(self):
    super().__init__()
    self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(), nn.LazyLinear(32), nn.ReLU())
    self.linear = nn.LazyLinear(16)

  def forward(self, X):
    return self.linear(self.net(X))

In [None]:
chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)



tensor(0.3221, grad_fn=<SumBackward0>)

## Parameter Management

Sometimes we may need to access the parameters of the model. Either to store them in the disk, or when we are working with a complex model and don't want to leave the initialization to the library, or any similar reasons. 

In [None]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size = (2, 4))
net(X).shape



torch.Size([2, 1])

### Parameter Access

Each layer's attributes are available to be accessed in it's corresponding attribute. 

In [None]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[ 0.0707,  0.0520, -0.2712, -0.2416, -0.2818, -0.3370,  0.2806, -0.0592]])),
             ('bias', tensor([-0.1263]))])

#### Targeted Parameters

Parameters are complex objects containing values, gradients and additional information. When requested, PyTorch returns a parameter object. So we need to request the data explicityly if we need to access the underlying numerical values. 

In [None]:
type(net[2].weight), net[2].weight.data

(torch.nn.parameter.Parameter,
 tensor([[ 0.0707,  0.0520, -0.2712, -0.2416, -0.2818, -0.3370,  0.2806, -0.0592]]))

Since this network's backpropagation has not been initiated yet, grad should be a None value. 

In [None]:
net[2].bias.grad == None

True

#### All parameters at once

In [None]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

### Tied Parameters/Weight Sharing

In [None]:
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), shared, nn.ReLU(), shared, nn.ReLU(), nn.LazyLinear(1))
net(X)



tensor([[-0.0531],
        [-0.0540]], grad_fn=<AddmmBackward0>)

In [None]:
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] == 100
# Since they are the same object, changing at one place will be reflected on the other side
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])


Since they are shared, the gradients are also added during backpropagation

## Parameter Initialization

Most deep learning frameworks perform random initialization. But we may sometimes want to initialize them following certain protocols or initialize them manually.

In [1]:
import torch
from torch import nn

In [2]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size = (2, 4))
net(X).shape



torch.Size([2, 1])

### Built In Initialization

Initializing parameters as Gaussian Random Variables with standard deviation 0.01 while bias parameters are cleared to zero. 

In [3]:
def init_normal(module):
  if type(module) == nn.LazyLinear:
    nn.init.normal_(module.weight, mean = 0, std = .01)
    nn.init.zeros_(module.bias)

# Applying a function on top of neural net. We use the .apply() function and pass the function as a parameter. 
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.2773,  0.4349,  0.1406, -0.3453]), tensor(0.4316))

Initializing parameters to zero. 

In [4]:
def init_zero(module):
  if type(module) == nn.LazyLinear:
    nn.init.constant_(module.weight, 0)
    nn.init.zeros_(module.bias)

net.apply(init_zero)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.2773,  0.4349,  0.1406, -0.3453]), tensor(0.4316))

We can also apply different initializers for certain blocks. 

Initializing the first layer with Xavier Initializer and the second layer to a constant of 42

In [5]:
def xavier_init(module):
  if type(module) == nn.LazyLinear:
    nn.init.xavier_uniform_(module.weight)

def init_42(module):
  if type(module) == nn.LazyLinear:
    nn.init.constant_(module.weight, 42)

net[0].apply(xavier_init)
net[2].apply(init_42)
net[0].weight.data[0], net[2].weight.data[0]

(tensor([ 0.2773,  0.4349,  0.1406, -0.3453]),
 tensor([ 0.2718,  0.3377, -0.3003,  0.1580, -0.1517, -0.0880, -0.0288, -0.1555]))

Sometimes we may also want completely customized initializations. 

In [6]:
def custom_init(module):
  if type(module) == nn.LazyLinear:
    print("init ", *[(name, param.shape) for name, param in module.named_parameters()][0])
    nn.init.uniform_(module.weight,  -10, 10)
    module.weight.data *= module.weight.data.abs() >= 5

net.apply(custom_init)
net[0].weight.data[0]

tensor([ 0.2773,  0.4349,  0.1406, -0.3453])

In [7]:
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

tensor([42.0000,  1.4349,  1.1406,  0.6547])

### Lazy Initialization

In the above cells, we did not mention the output dimensions of each layer, but we were still able to perform a forward pass. In this case, the framework infers the dimensions on the fly, this is called deferring intialization. 

Later on when working with CNNs, this technique will become even more convenient since the input dimensionality will affect the dimensionality of each subsequent layer. 

In [1]:
!pip install d2l

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting d2l
  Downloading d2l-0.17.5-py3-none-any.whl (82 kB)
[K     |████████████████████████████████| 82 kB 691 kB/s 
Collecting requests==2.25.1
  Downloading requests-2.25.1-py2.py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 8.0 MB/s 
[?25hCollecting pandas==1.2.4
  Downloading pandas-1.2.4-cp37-cp37m-manylinux1_x86_64.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 30.4 MB/s 
[?25hCollecting matplotlib==3.5.1
  Downloading matplotlib-3.5.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████████████████████| 11.2 MB 31.5 MB/s 
[?25hCollecting numpy==1.21.5
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 44.2 MB/s 
Collecting fonttools>=4.22.0
  Downloading fonttools-4.33.3-py3-none-any.whl (930 kB)
[K 

In [9]:
import torch
from torch import nn 
from d2l import torch as d2l

In [10]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))



At this point, we cannot access the weights, because the framework does not know the dimensions of the layers yet. 

In [11]:
net[0].weight

<UninitializedParameter>

In [12]:
X = torch.rand(2, 20)
net(X)

net[0].weight.shape

torch.Size([256, 20])

In [14]:
def apply_init(self, inputs, init = None):
  self.forward(*inputs)
  if init is not None:
    self.net.apply(init)

## Custom Layers

### Layers without parameters

The following layer simply subtracts mean from the input. We simply need to inherit from the base layer class and implement the forward propagation function. 

In [2]:
import torch
from torch import nn 
from torch.nn import functional as F 
from d2l import torch as d2l 

In [3]:
class CenteredLayer(nn.Module):
  def __init__(self):
    super().__init__()
  def forward(self, X):
    return X - X.mean()

In [4]:
layer = CenteredLayer()
layer(torch.tensor([1, 2, 3, 4], dtype = torch.float32))

tensor([-1.5000, -0.5000,  0.5000,  1.5000])

Now we can use this layer in a neural net. 

In [5]:
net = nn.Sequential(nn.LazyLinear(10), CenteredLayer())
X = torch.rand((15, 10))
Y = net(X)
Y.mean()



tensor(-3.1789e-09, grad_fn=<MeanBackward0>)

### Layers with Parameters

In [6]:
class MyLinear(nn.Module):
  def __init__(self, in_units, units):
    super().__init__()
    self.weight = nn.Parameter(torch.randn(in_units, units))
    self.bias = nn.Parameter(torch.randn(units, ))
  
  def forward(self, X):
    linear = torch.matmul(X, self.weight.data) + self.bias.data
    return F.relu(linear)

In [7]:
linear = MyLinear(5, 3)
linear.weight

Parameter containing:
tensor([[-0.8844,  1.2134,  0.8407],
        [ 0.1381, -0.1755,  1.6680],
        [-2.4763, -0.1815,  0.0935],
        [-0.8110,  1.1586, -0.9401],
        [ 1.6361,  1.8363,  0.4628]], requires_grad=True)

In [8]:
linear(torch.rand(2, 5))

tensor([[0.0000, 1.7787, 2.5256],
        [0.0000, 1.2261, 1.3980]])

In [15]:
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
net(torch.rand(2, 64))

tensor([[0.],
        [0.]])