# Chapter 6: Builder's Guide

### Dhuvi Karthikeyan

2/07/2023

## 6.1 Layers and Modules

Organizational Hierarchy:

* Module (Organized and repeating blocks of layers): Can describe a single layer, multiple layers, or the entire model itself. It is an abstraction of a functional unit of a neural network.
    * Layer: Consists of multiple neurons. Neurons of the same layer receieve input in the same time step of information propagation. They also take a set of inputs and return a set of outputs.   
        * Neuron: Takes a set of inputs and returns a deterministic scalar output based on internal state variables (parameters)
        
**In practice:**

* Module is a class-level entity:
    1. Must take input data for forward pass
    2. Must generate output using forward call
    3. Must calculate gradients of output with respect to the input, and parameters (Autograd)
    4. Must initialize model parameters robustly as needed
    5. Must store parameters and make them accessible and updateable

### 6.1.3 Executing Code in the Forward Propagation Method

The forward pass may not only be a mathematical transformation. It can also include control flow in it as well:

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F

class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

How does backpropagation through the while loop work?

## 6.2 Parameter Management

Most of the time one doesn't need to worry about writing parameters to memory and shuttling parameters from CPU to GPU or vice versa. The following in a minimum viable approach to understanding parameter management for non-standard use cases.

### 6.2.1 Parameter Accesses

In [8]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1)) # Dummy Net
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

In [9]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[-0.1544,  0.1493,  0.0143,  0.1825, -0.1538,  0.1383, -0.1592,  0.2035]])),
             ('bias', tensor([0.2956]))])

In [10]:
net[2].bias.data

tensor([0.2956])

In [11]:
# Get all the parameters all at once
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

### 6.2.2 Sharing parameters

The book talks about sharing parameters by simpling reusing a layer. Unsure why this would be a good move? Parameter sharing also shares updates but does it necessarily share the same gradients?

In [19]:
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))
net(X)

tensor([[0.0942],
        [0.0942]], grad_fn=<AddmmBackward0>)

In [13]:
print(net[2].weight.data[0] == net[4].weight.data[0]) # Check that they are in fact the same

tensor([True, True, True, True, True, True, True, True])


In [20]:
print(net[0].weight.data[0] == net[4].weight.data[0]) # Returns error due to lazy loading dimensions

RuntimeError: The size of tensor a (4) must match the size of tensor b (8) at non-singleton dimension 0

In [24]:
net[2].weight.data[0,0] = 99 #Change a value in net[2]
net[2].weight.data

tensor([[ 9.9000e+01,  2.3116e-01, -2.0159e-01,  2.5524e-01,  3.3988e-01,
         -1.3143e-01,  9.9637e-02, -2.7399e-02],
        [ 3.5148e-01, -2.9227e-01,  1.0891e-01,  2.8594e-01, -2.7712e-01,
         -2.2687e-01,  3.1859e-02, -2.6861e-01],
        [ 2.4074e-01, -2.5918e-01,  2.1355e-01, -9.4026e-02, -1.7284e-01,
         -1.0327e-01,  2.5735e-01, -1.5697e-01],
        [ 2.0667e-01, -3.9887e-02, -1.7813e-01, -8.4779e-02, -1.9416e-01,
         -1.3511e-01, -1.8886e-01, -2.6019e-01],
        [ 3.3085e-02,  3.4689e-02,  2.8782e-01, -1.2055e-01, -2.3878e-01,
         -3.4340e-01, -2.4142e-01, -2.9684e-01],
        [-2.1810e-01,  3.2054e-01, -1.6799e-01, -6.0757e-02,  2.9624e-01,
         -2.5178e-01,  3.6727e-02,  2.9169e-01],
        [ 3.4839e-01, -7.2290e-02, -2.2480e-02,  3.3316e-01,  3.0976e-01,
         -3.4173e-01,  2.6596e-01, -1.0933e-01],
        [ 1.7213e-01, -1.9290e-01, -2.5482e-01, -9.4157e-02,  3.4947e-01,
          1.1196e-01,  2.1353e-02, -1.2128e-01]])

In [25]:
net[4].weight.data # The change has been reflected here since it is the same object

tensor([[ 9.9000e+01,  2.3116e-01, -2.0159e-01,  2.5524e-01,  3.3988e-01,
         -1.3143e-01,  9.9637e-02, -2.7399e-02],
        [ 3.5148e-01, -2.9227e-01,  1.0891e-01,  2.8594e-01, -2.7712e-01,
         -2.2687e-01,  3.1859e-02, -2.6861e-01],
        [ 2.4074e-01, -2.5918e-01,  2.1355e-01, -9.4026e-02, -1.7284e-01,
         -1.0327e-01,  2.5735e-01, -1.5697e-01],
        [ 2.0667e-01, -3.9887e-02, -1.7813e-01, -8.4779e-02, -1.9416e-01,
         -1.3511e-01, -1.8886e-01, -2.6019e-01],
        [ 3.3085e-02,  3.4689e-02,  2.8782e-01, -1.2055e-01, -2.3878e-01,
         -3.4340e-01, -2.4142e-01, -2.9684e-01],
        [-2.1810e-01,  3.2054e-01, -1.6799e-01, -6.0757e-02,  2.9624e-01,
         -2.5178e-01,  3.6727e-02,  2.9169e-01],
        [ 3.4839e-01, -7.2290e-02, -2.2480e-02,  3.3316e-01,  3.0976e-01,
         -3.4173e-01,  2.6596e-01, -1.0933e-01],
        [ 1.7213e-01, -1.9290e-01, -2.5482e-01, -9.4157e-02,  3.4947e-01,
          1.1196e-01,  2.1353e-02, -1.2128e-01]])

What happens during backprop to the gradients? Since gradients are additive, they are added together and distributed to the layer object (which makes sense because it is contributing twice essentially). It makes sense why you don't just take the gradients of one of them and multiply it by two.

## 6.3 Parameter Initialization

### Default Initialization

In [26]:
def init_normal(module):
    # Check for linear layers
    if type(module) == nn.Linear:
        #nn.init.normal_ initialize obj to normal distribution with mean and std
        nn.init.normal_(module.weight, mean=0, std=0.01)
        #nn.init.constant_ initialize obj to constant
        # nn.init.constant_(module.bias, 1)
        #nn.init.zeros_ initialize the object to zeros
        nn.init.zeros_(module.bias)
        
# .apply a function 
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0011, -0.0130, -0.0026,  0.0050]), tensor(0.))

### Custom Initialization

In [27]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)
def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

In [28]:
net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([ 0.2216, -0.3519,  0.1610,  0.2502])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.],
        [42., 42., 42., 42., 42., 42., 42., 42.]])


## 6.4 Lazy Initialization

Lazy loading initializes parameters on the fly after the first forward pass. While this may seem "lazy" its particularly useful in various contexts. Take for example a scenario where a Convnet is applied to image data, but the data contains images of an unknown resolution. 

In [29]:
@class_method
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

NameError: name 'class_method' is not defined

## 6.5 Custom Layers

### 6.5.1 Layers without Parameters

In [30]:
class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

### 6.5.2 Layers with Parameters

In [31]:
class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))

    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

## 6.6 File I/O

### 6.6.1 Loading and Saving Tensors

x = torch.arange(4)

y = torch.zeros(4)

torch.save([x, y],'x-files')

### 6.6.2 Loading and Saving Model Parameters

In [32]:
torch.save(network.state_dict())
torch.load('file')
network.load_state_dict(torch.load('statedict'))

NameError: name 'network' is not defined

## 6.7 GPUs

GPU performance has increased by a factor of 1000 every decade since 2000.

### 6.7.1 Computing Devices

In [33]:
torch.cuda.device_count()

1

### 6.7.2 Tensors and GPUs

In [34]:
x = torch.tensor([1,2,3])
x.device # Defaults to CPU

device(type='cpu')

In [37]:
y = torch.tensor([1,2,3], device='cuda:0')
y.device

device(type='cuda', index=0)

In [38]:
x + y

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [39]:
z = x.cuda(0) # Copy the tensor onto GPU
y + z

tensor([2, 4, 6], device='cuda:0')

**Note**: Transferring between devices is often much slower than the actual computation. In order to prevent this make sure to instantiate variables mindfully so as to not copy between devices too often. 

    * Many small operations are actually worse than one big operation
    * Several operations done at a time are better than interspersing the operations through the code
    * Printing tensors (which implicitly converts to NumPy) results in additional overhead
    * As long as data and params are on the same device we can expect training to occur efficiently