# Builders’ Guide
Details of the APIs: how they work. Key components of deep learning computation, namely model construction, parameter access and initialization, designing custom layers and blocks, reading and writing models to disk, and leveraging GPUs to achieve dramatic speedups.
## 6.1. Layers and Modules
Linear models with a single output neural network, consist of a single neuron:
1. takes some set of inputs
2. generates a corresponding scalar output
3. has a set of associated parameters that can be updated to optimize some objective function of interest.

Networks with multiple outputs / layers:
1. take a set of inputs
2. generate corresponding outputs
3. described by a set of tunable parameters.

For MLPs and its layers:
1. take raw inputs/features
2. generate outputs/predictions
3. possesses parameters/combined parameters form all constituent layers

Neural network **module** for implement the complex networks. It can be a single layer, a component consisting of multiple layers, or the model.   
A module is represented by a **class**, an subclass of it must define a forward propagation method (transforms input into output, store necessary parameters), and process a backpropagation method (calculate gradients, not need to worry).  

In [None]:
import torch
from torch import nn
from torch.nn import functional as F

Implement a network with 1 fully connected hidden layer, 256 units and ReLU activation, and a fully connected output layer, 10 units, no activation function.

In [None]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape

torch.Size([2, 10])

Construct model by instantiating an ```nn.Sequential```, defines a special kind of ```Module```. Each 2 fully connected layers is an instance of the ```Linear``` class which is a subclass of ```Module```.  
The ```forward``` propagation method chains each module in the list and pass the output of each as input to the next.  
Invoke models via ```net(X)```  (a shorthand for ```net.__call__(X)```) to obtain output.

### 6.1.1. A Custom Module
Basic functionality that each module must provide:
1. Ingest input data as arguments to its forward propagation method.
2. Generate an output by having the forward propagation method return a value. Note that the output may have a different shape from the input.
3. Calculate the gradient of its output with respect to its input, which can be accessed via its backpropagation method.
4. Store and provide access to those parameters necessary for executing the forward propagation computation.
5. Initialize model parameters as needed.

Code a module for an MLP with 1 hidden layer with 256 hidden units and a 10-dimensional output layer. Supply only ```__init__``` method  and the forward propagation method.  
- in ```forward``` method: input X, calculate hidden representation + activation applied, output the logits.
- ```__init__```invokes the parent's init via ```super().__init__()```
- instantiate 2 fully connected layers, assigning to ```self.hidden```, ```self.out```. System will generate backpropagation automatically

In [None]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

In [None]:
net = MLP()
net(X).shape

torch.Size([2, 10])

### 6.1.2. The Sequential Module
```Sequential``` is used to daisy-chain other modules together. For our simplified sequential we need 2 methods:
1. A method for appending modules one by one to a list.
2. A forward propagation method for passing an input through the chain of modules, in the same order as they were appended.

- in ```__init__```: add module by calling ```add_modules``` method.
- in ```forward```propagation method, each added module is executedin the order they added.

In [None]:
# same functionality of the default Sequential
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In [None]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

torch.Size([2, 10])

### 6.1.3. Executing Code in the Forward Propagation Method
Not all architectures are simple daisy chains. We aim to perform arbitrary mathematical operations.  
So far all operations act on network's activation and its parameters. We also want to act on terms that neigher the output of previous layer nor updatable parameters, which is called **constant parameters**. We implement a ```FixedHiddenMLP``` class as follows:
- weights are initialized randomly, which is not a model parameter and never updated by backpropagation.
- run a while-loop to ensure the output's $\ell_1$ norm is smaller than 1, if not, divide output by 2.
- return the sum of the entiries in X.

Above may not useful but just for showing how to use arbitrary code into NN flow.

In [None]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In [None]:
net = FixedHiddenMLP()
net(X)

tensor(-0.0721, grad_fn=<SumBackward0>)

In [None]:
# mix modules together example
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(0.3910, grad_fn=<SumBackward0>)

### 6.1.4. Summary
- Individual layers can be modules. Many layers can comprise a module. Many modules can comprise a module.
- a Module can contain code. Modules can do parameter initialization and backpropagation.
- Use ```Sequential``` to handle sequential concatenations of layers and modules

## 6.2. Parameter Management
- Accessing parameters for debugging, diagnostics, and visualizations.
- Sharing parameters across different model components.

### 6.2.1. Parameter Access
When a model is defined via ```Sequential``` class: access layer by indexing. Each layer's parameters are in the attribute.

In [None]:
import torch
from torch import nn
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

In [None]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[-0.1904, -0.0517, -0.1419, -0.3515, -0.2102, -0.0083,  0.2476,  0.0567]])),
             ('bias', tensor([0.3388]))])

#### 6.2.1.1. Targeted Parameters
Each parameter is represented as an instance of the parameter class. To access the underlying numerical values:

In [None]:
# extract bias from 2nd NN layer, return a parameter class instance
type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([0.3388]))

In [None]:
# parameter object contains value, gradient, and additional info
net[2].weight.grad == None

True

#### 6.2.1.2. All Parameters at Once
Access parameters of all layers

In [None]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

### 6.2.2. Tied Parameters
Share parameters across multiple layers. Parameters between layer are tied and represented by the same tensor.

In [None]:
# We need to give the shared layer a name so that we can refer to its parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the same value
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])


### 6.2.3 Summary
Ways to access model parameters.

## 6.3. Parameter Initialization
How to initialize parameters properly:  
Default is weight/bias are uniformly drawn from a range computed from the input & output dimension. ```nn.init``` module provides preset initialization methods.

In [None]:
import torch
from torch import nn

In [None]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

### 6.3.1. Built-in Initialization
Initialize all weight parameters as Gaussian random variables with standard deviation = 0.01, bias = 0

In [None]:
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)

net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0032, -0.0064, -0.0138, -0.0060]), tensor(0.))

In [None]:
# initialize all parameter to a given constant
def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)

net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

In [None]:
# apply different initializers for certain blocks.
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

# initialize 1st layer with Xavier initializer
net[0].apply(init_xavier)
# initialize 2nd layer to constant 42
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([ 0.0935, -0.3470, -0.3493,  0.4180])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])


#### 6.3.1.1. Custom Initialization
For initialization methods that are not provided by the framwork, define a function for it.

In [None]:
# U(5,10) with p=1/4, 0 with p=1/2, U(-10,-5) with p=1/4
def my_init(module):
    if type(module) == nn.Linear:
        print("Init", *[(name, param.shape)
                        for name, param in module.named_parameters()][0])
        nn.init.uniform_(module.weight, -10, 10)
        module.weight.data *= module.weight.data.abs() >= 5

net.apply(my_init)
net[0].weight[:2]

Init weight torch.Size([8, 4])
Init weight torch.Size([1, 8])


tensor([[-7.1898, -0.0000, -0.0000,  0.0000],
        [ 6.6734, -5.5504, -0.0000, -7.9045]], grad_fn=<SliceBackward0>)

In [None]:
# set parameters directly
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

tensor([42.,  1.,  1.,  1.])

### 6.3.2 Summary
We can initialize parameters using built-in and custom initializers.



## 6.4. Lazy Initialization
For the unintuitive things: define the network architectures without specifying the input dimensionality, add layers without specifying the output dimension of the previous layer, initialize parameters before providing enough info to determine how many parameters we need.  
The framework **defers initialization**, waiting until the first time we pass data through the model, to infer the sizes of each layer on the fly. Next, we go deeper into the mechanics of initialization.

In [None]:
import torch
from torch import nn
from d2l import torch as d2l

In [None]:
# instantiate an MLP
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

# the framework has not initialized any parameters.
net[0].weight

<UninitializedParameter>

In [None]:
# pass data through then the framwork initialize parameters.
X = torch.rand(2, 20)
net(X)

net[0].weight.shape

torch.Size([256, 20])

With input dimensionality, the framework can find the shape of the first layer. With the shape of first layer, the framework compute the computational graph so all shapes are known. Only the first layer requires lazy initialization, the framework initializes sequentially. Once all shapes are known, the framework initialize the parameters.   
The following method passes in dummy inputs through the network for a dry run to infer all parameter shapes and subsequently initializes the parameters. It will be used later when default random initializations are not desired.


In [None]:
# dry run for better initialization.
@d2l.add_to_class(d2l.Module)  #@save
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

### 6.4.1 Summary
Lazy initialization can be convenient, allowing the framework to infer parameter shapes automatically, making it easy to modify architectures and eliminating one common source of errors. We can pass data through the model to make the framework finally initialize parameters.

## 6.5. Custom Layers
### 6.5.1. Layers without Parameters
Construct a custom layer that does not have any parameters of its own.

In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

In [None]:
class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

In [None]:
# example of usage
layer = CenteredLayer()
layer(torch.tensor([1.0, 2, 3, 4, 5]))

tensor([-2., -1.,  0.,  1.,  2.])

In [None]:
# use as a component in complex models
net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())

In [None]:
# example of model usage, mean should be 0
Y = net(torch.rand(4, 8))
Y.mean()

tensor(5.5879e-09, grad_fn=<MeanBackward0>)

### 6.5.2. Layers with Parameters
Implement a fully connected layer with
- 2 parameters: weight and bias
- 2 input arguments: in_units and units, denoted # inputs and # outputs.

In [None]:
class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))

    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

In [None]:
# instantiate the MyLinear class
linear = MyLinear(5, 3)
# access model parameters
linear.weight

Parameter containing:
tensor([[ 1.0234, -0.0247,  2.1286],
        [ 0.1858, -0.2367, -1.3932],
        [ 1.7267,  1.1290, -0.4259],
        [-0.0433,  0.5845, -0.0928],
        [-0.4610,  0.1135,  0.5796]], requires_grad=True)

In [None]:
# forward propagation calculations using custom layers
linear(torch.rand(2, 5))

tensor([[0.7767, 0.0470, 0.0000],
        [0.0000, 0.0000, 0.0000]])

In [None]:
# models using custom layers
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
net(torch.rand(2, 64))

tensor([[6.6446],
        [0.0000]])

### 6.5.3 Summary
Design custom layers via the layer class, which allows us to define flexible new layers. Layers can have local parameters, which can be created through build-in functions.

## 6.6. File I/O
Save the learned models or when running a long training process, periodically save intermediate results.
### 6.6.1. Loading and Saving Tensors¶
Invoke ```load``` and ```save``` functions to read and write individual tensors.

In [None]:
x = torch.arange(4)
torch.save(x, 'x-file')
x

tensor([0, 1, 2, 3])

In [None]:
# read the data from the file
x2 = torch.load('x-file')
x2 == x

  x2 = torch.load('x-file')


tensor([True, True, True, True])

In [None]:
# store a list of tensors and read back
y = torch.zeros(4)
torch.save([x, y],'x-files')
x2, y2 = torch.load('x-files')
(x2, y2)

  x2, y2 = torch.load('x-files')


(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))

In [None]:
# write and read a dictionary that maps from strings to tensors
mydict = {'x': x, 'y': y}
torch.save(mydict, 'mydict')
mydict2 = torch.load('mydict')
mydict2

  mydict2 = torch.load('mydict')


{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}

### 6.6.2. Loading and Saving Model Parameters
Load and save entire networks using build-in functionalities. Save the **parameters** and not the entire model. The architecture need to be specified separately.  
To reinstate a model, we need to generate the architecture in code then load the parameters from disk.

In [None]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.output = nn.LazyLinear(10)

    def forward(self, x):
        return self.output(F.relu(self.hidden(x)))

net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

In [None]:
# store the parameters as a file
torch.save(net.state_dict(), 'mlp.params')

In [None]:
# recover the model
# instantiate the original MLP model
clone = MLP()
# read parameters from the file
clone.load_state_dict(torch.load('mlp.params'))
clone.eval()

  clone.load_state_dict(torch.load('mlp.params'))


MLP(
  (hidden): LazyLinear(in_features=0, out_features=256, bias=True)
  (output): LazyLinear(in_features=0, out_features=10, bias=True)
)

In [None]:
# verify if is the same
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

### 6.6.3 Summary
- ```save``` and ```load``` functions can be used for file I/O for tensors.
- save and load the parameters of a network via a parameter dictionary
- save the architecture has to be done in code rather than in parameters


## 6.7 GPUs
How to use a single NVIDIA GPU for calculations.  
In PyTorch, every array has a device; we often refer it as a **context**. By default, all variables and associated computation have been assigned to the CPU. Other contexts might be various GPUs.  
By assigning arrays to contexts intelligently, we can minimize the time spent transferring data between devices. For example, when training neural networks on a server with a GPU, we typically prefer for the model’s parameters to live on the GPU.

In [5]:
# use google colab
import torch
from torch import nn
from d2l import torch as d2l

In [3]:
%pip uninstall -y d2l
%pip install --no-deps d2l

[0mCollecting d2l
  Downloading d2l-1.0.3-py3-none-any.whl.metadata (556 bytes)
Downloading d2l-1.0.3-py3-none-any.whl (111 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: d2l
Successfully installed d2l-1.0.3


In [2]:
import d2l

### 6.7.1. Computing Devices
Specify devices for storage and calculation. By default the tensors are created in the main memory and then CPU is used for calculation. CPU and GPU can be indicated by ```torch.device('cpu)``` and ```torch.device('cuda')```. ```cpu``` means all physical CPUs and memory. ```gpu```represents one card and the corresponding memory. ```torch.device(f'cuda:{i}')``` to represent the $i^{th}$ GPU stats at 0 (```gpu:0``` = ```gpu```)

In [5]:
def cpu():  #@save
    """Get the CPU device."""
    return torch.device('cpu')

def gpu(i=0):  #@save
    """Get a GPU device."""
    return torch.device(f'cuda:{i}')

cpu(), gpu(), gpu(1)

(device(type='cpu'),
 device(type='cuda', index=0),
 device(type='cuda', index=1))

In [6]:
# query the number of available GPUs
def num_gpus():  #@save
    """Get the number of available GPUs."""
    return torch.cuda.device_count()

num_gpus()

1

In [7]:
# run code if requested GPUs not exist
def try_gpu(i=0):  #@save
    """Return gpu(i) if exists, otherwise return cpu()."""
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():  #@save
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    return [gpu(i) for i in range(num_gpus())]

try_gpu(), try_gpu(10), try_all_gpus()

(device(type='cuda', index=0),
 device(type='cpu'),
 [device(type='cuda', index=0)])

### 6.7.2. Tensors and GPUs
By default, tensors are created on the CPU. We can query the device where the tensor is located.  
Whenever we want to operate on multiple terms, they need to be on the same device.

In [8]:
x = torch.tensor([1, 2, 3])
x.device

device(type='cpu')

#### 6.7.2.1. Storage on the GPU
Ways to store a tensor on the GPU:
- specify the storage device when creating a tensor.
-

In [9]:
X = torch.ones(2, 3, device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

In [12]:
# assume >= 2 GPUs: device='cuda:1'
Y = torch.rand(2, 3, device=try_gpu(1))
Y

tensor([[0.8111, 0.6049, 0.5527],
        [0.8126, 0.0376, 0.5428]])

#### 6.7.2.2 Copying
To compute X+Y, move X to the second GPU then compute.

In [18]:
# 2 GPU needed
# Z = X.cuda(1)
# print(X)
# print(Z)

# move Y from cpu to gpu (Z) then compute
Z = Y.cuda(0)
X + Z

tensor([[1.8111, 1.6049, 1.5527],
        [1.8126, 1.0376, 1.5428]], device='cuda:0')

In [21]:
# if Z is in the GPU, if call Z.cuda(), return then Z instead of making a copy.
Z.cuda(0) is Z

True

#### 6.7.2.3 Side Notes
- use GPUs to do machine learning because they expect them to be fast.
- transferring variables between devices is slow: much slower than computation. Parallelization a lot more difficult. Copy operations must be careful.
- several operations at a time are much better than many single operations interspersed in the code.
- when we print tensors or convert tensors to the NumPy format, if the data is not in the main memory, the framework will copy it to the main memory first, resulting in additional transmission overhead.

### 6.7.3. Neural Networks and GPUs
A neural network modeal can specify devices. When the input is in GPU, the model will calculate the result on the same GPU.

In [24]:
net = nn.Sequential(nn.LazyLinear(1))
net = net.to(device=try_gpu())
net(X), net[0].weight.data.device


(tensor([[0.1920],
         [0.1920]], device='cuda:0', grad_fn=<AddmmBackward0>),
 device(type='cuda', index=0))

In [6]:
# let the trainer support GPU
@d2l.add_to_class(d2l.Trainer)  #@save
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()
    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_batch(self, batch):
    if self.gpus:
        batch = [a.to(self.gpus[0]) for a in batch]
    return batch

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_model(self, model):
    model.trainer = self
    model.board.xlim = [0, self.max_epochs]
    if self.gpus:
        model.to(self.gpus[0])
    self.model = model

### 6.7.4 Summary
- By default, data is created in the main memory and then uses the CPU for calculations.
- The deep learning framework requires all input data for calculation to be on the same device.
- It is much better to allocate memory for logging inside the GPU and only move larger logs.