# Layers and Blocks 

In [1]:
import torch

In [2]:
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F

In [3]:
net = nn.Sequential(nn.Linear(20, 256),nn.ReLU(),nn.Linear(256, 10))

Blocks are combinations of one or more layers.The above code generates a network with a hidden layer of 256 units, followed by a ReLU activation and another 10 units governing the output. In particular, we used the nn.Sequential constructor to generate an empty network into which we then inserted both layers.

In the following we will explain the various steps needed to go from defining layers to defining blocks (of one or more layers). To get started, we need a bit of reasoning about software. For most intents and purposes a block behaves very much like a fancy layer. That is, it provides the following functionality:

It needs to ingest data (the input).

It needs to produce a meaningful output. This is typically encoded in what we will call the forward function. It allows us to invoke a block via net(X) to obtain the desired output. What happens behind the scenes is that it invokes forward to perform forward propagation.

It needs to produce a gradient with regard to its input when invoking backward. Typically this is automatic.

It needs to store parameters that are inherent to the block. For instance, the block above contains two hidden layers, and we need a place to store parameters for it.

Obviously it also needs to initialize these parameters as needed

## A Custom Block

The nn.Module class provides the functionality required for much of what we need. It is a model constructor provided in the nn module, which we can inherit to define the model we want. The following inherits the Block class to construct the multilayer perceptron mentioned at the beginning of this section. The MLP class defined here overrides the __init__ and forward functions of the Block class. They are used to create model parameters and define forward computations, respectively. Forward computation is also forward propagation.

In [4]:
class MLP(nn.Module):
    # Declare a layer with model parameters. Here, we declare two fully
    # connected layers
    def __init__(self, **kwargs):
        # Call the constructor of the MLP parent class Module to perform the
        # necessary initialization. In this way, other function parameters can
        # also be specified when constructing an instance, such as the model
        # parameter, params, described in the following sections
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Sequential(nn.Linear(20,256),nn.ReLU())  # Hidden layer
        self.output = nn.Linear(256,10)  # Output layer

    # Define the forward computation of the model, that is, how to return the
    # required model output based on the input x
    def forward(self, x):
        return self.output(self.hidden(x))

Let’s look at it a bit more closely. The forward method invokes a network simply by evaluating the hidden layer self.hidden(x) and subsequently by evaluating the output layer self.output( ... ). This is what we expect in the forward pass of this block.

In order for the block to know what it needs to evaluate, we first need to define the layers. This is what the __init__ method does. It first initializes all of the Block-related parameters and then constructs the requisite layers. This attached the corresponding layers and the required parameters to the class. Note that there is no need to define a backpropagation method in the class. The system automatically generates the backward method needed for back propagation by automatically finding the gradient. The same applies to the initialize method, which is generated automatically. Let’s try this out:

In [5]:
net1 = MLP()
x = torch.randn(2,20)
net1(x)

tensor([[ 0.1308, -0.1152,  0.1580,  0.0634,  0.3933,  0.2577,  0.0927, -0.0791,
         -0.2182,  0.0789],
        [-0.0340,  0.0796,  0.1644, -0.3716, -0.1382, -0.0067, -0.1728,  0.3552,
         -0.3119,  0.2546]], grad_fn=<AddmmBackward>)

## A Sequential Block

The Block class is a generic component describing dataflow. In fact, the Sequential class is derived from the Block class: when the forward computation of the model is a simple concatenation of computations for each layer, we can define the model in a much simpler way. The purpose of the Sequential class is to provide some useful convenience functions. In particular, the add method allows us to add concatenated Block subclass instances one by one, while the forward computation of the model is to compute these instances one by one in the order of addition. Below, we implement a MySequential class that has the same functionality as the Sequential class. This may help you understand more clearly how the Sequential class works.

In [6]:
class MySequential(nn.Sequential):
    def __init__(self, **kwargs):
        super(MySequential, self).__init__(**kwargs)

    def add_module(self, block):
        # Here, block is an instance of a Block subclass, and we assume it has  
        # a unique name. We save it in the member variable _children of the
        # Block class, and its type is OrderedDict. When the MySequential
        # instance calls the initialize function, the system automatically
        # initializes all members of _children
        self._modules[block] = block
        

    def forward(self, x):
        # OrderedDict guarantees that members will be traversed in the order
        # they were added
        for block in self._modules.values():
            x = block(x)
        return x

At its core is the add method. It adds any block to the ordered dictionary of children. These are then executed in sequence when forward propagation is invoked. Let’s see what the MLP looks like now.

In [7]:
net = MySequential()
net.add_module(nn.Linear(20,256))
net.add_module(nn.ReLU())
net.add_module(nn.Linear(256,10))
x = torch.randn(2,20)
net(x)

tensor([[-0.4160, -0.5861,  0.2552,  0.0811,  0.5112, -0.2685, -0.0360,  0.0191,
          0.4249,  0.1650],
        [-0.0161, -0.2798,  0.4258, -0.0046, -0.1093, -0.3125, -0.1284, -0.0041,
         -0.0086, -0.0243]], grad_fn=<AddmmBackward>)

 # Blocks with Code

Although the Sequential class can make model construction easier, and you do not need to define the forward method, directly inheriting the Block class can greatly expand the flexibility of model construction. In particular, we will use Python’s control flow within the forward method. While we’re at it, we need to introduce another concept, that of the constant parameter. These are parameters that are not used when invoking backprop. This sounds very abstract but here’s what’s really going on. Assume that we have some function

(5.1.1)
f(x,w)=3⋅w⊤x.
 
In this case 3 is a constant parameter. We could change 3 to something else, say  c  via

(5.1.2)
f(x,w)=c⋅w⊤x.

In [8]:
class FancyMLP(nn.Sequential):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        
        # Random weight parameters created with the get_constant are not
#         # iterated during training (i.e. constant parameters)
#           self.rand_weight = self.parameters.get_constant(
#                                           'rand_weight', nd.random.uniform(shape=(20, 20)))


      # in pytorch you can use any module block if incoming input shape is same with defined block
    
        
        self.dense1=nn.Sequential(nn.Linear(20,20),nn.ReLU())
        
        self.rand_weight=nn.Parameter(torch.empty(20,20).uniform_(0, 1) )
    
        self.dense = nn.Sequential(nn.Linear(20,256),nn.ReLU())
        
        self.register_buffer('random_weights', self.rand_weight)
        
    def forward(self, x):
        x = self.dense1(x)
        # Use the constant parameters created, as well as the relu and dot
        # functions of NDArray
        print(x.shape)
        # in pytorch dot product is for 1d tensors 
        # for 2d tensors use torch.mm or torch.bmm
        x = F.relu(torch.mm(x, Variable(self.random_weights).data) + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        x = self.dense(x)
        # Here in Control flow, we need to call asscalar to return the scalar
        # for comparison
        while x.norm().item() > 1:
            x /= 2
        if x.norm().item() < 0.8:
            x *= 10
        return x.sum()
        

In this FancyMLP model, we used constant weight Rand_weight (note that it is not a model parameter), performed a matrix multiplication operation (nd.dot<), and reused the same Dense layer. Note that this is very different from using two dense layers with different sets of parameters. Instead, we used the same network twice. Quite often in deep networks one also says that the parameters are tied when one wants to express that multiple parts of a network share the same parameters. Let’s see what happens if we construct it and feed data through it.

In [9]:
net = FancyMLP()
x = torch.randn(2,20)
net(x)

torch.Size([2, 20])


tensor(11.3613, grad_fn=<SumBackward0>)

There’s no reason why we couldn’t mix and match these ways of build a network. Obviously the example below resembles more a chimera, or less charitably, a Rube Goldberg Machine. That said, it combines examples for building a block from individual blocks, which in turn, may be blocks themselves. Furthermore, we can even combine multiple strategies inside the same forward function. To demonstrate this, here’s the network.

In [22]:
class NestMLP(nn.Sequential):
    def __init__(self, **kwargs):
        super(NestMLP, self).__init__(**kwargs)
        self.net = nn.Sequential(nn.Linear(20,64),nn.ReLU(),nn.Linear(64, 32),nn.ReLU())
        self.dense = nn.Sequential(nn.Linear(32,20),nn.ReLU())

    def forward(self, x):
        return self.dense(self.net(x))

chimera = nn.Sequential()
chimera.add_module("Linear1",nn.Linear(20,20))
chimera.add_module("NestNlp",NestMLP())
chimera.add_module("FancyMLP",FancyMLP())

chimera(x)

torch.Size([2, 20])


tensor(96.3867, grad_fn=<SumBackward0>)

# Compilation 

The avid reader is probably starting to worry about the efficiency of this. After all, we have lots of dictionary lookups, code execution, and lots of other Pythonic things going on in what is supposed to be a high performance deep learning library. The problems of Python’s Global Interpreter Lock are well known. In the context of deep learning it means that we have a super fast GPU (or multiple of them) which might have to wait until a puny single CPU core running Python gets a chance to tell it what to do next. This is clearly awful and there are many ways around it. The best way to speed up Python is by avoiding it altogether.

Pytorch does this by allowing for Hybridization (Section 11.1). In it, the Python interpreter executes the block the first time it’s invoked. The Pytorch runtime records what is happening and the next time around it short circuits any calls to Python. This can accelerate things considerably in some cases but care needs to be taken with control flow. We suggest that the interested reader skip forward to the section covering hybridization and compilation after finishing the current chapter.

# Summary

1.Layers are blocks
2.Many layers can be a block
3.Many blocks can be a block
4.Code can be a block
5.Blocks take are of a lot of housekeeping, such as parameter initialization, backprop and related issues.
Sequential concatenations of layers and blocks are handled by the eponymous Sequential block.

## Exercise

1.What kind of error message will you get when calling an __init__ method whose parent class not in the __init__ function of the parent class?

2.What kinds of problems will occur if you remove the item function in the FancyMLP class?

3.What kinds of problems will occur if you change self.net defined by the Sequential instance in the NestMLP class to self.net = [nn.Sequential(nn.Linear(64,256),nn.ReLU()), nn.Sequential(nn.Linear(32,256),nn.ReLU())]?

4.Implement a block that takes two blocks as an argument, say net1 and net2 and returns the concatenated output of both networks in the forward pass (this is also called a parallel block).

5.Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.