# Chapter 6: Builders' Guide

This chapter covers the key components of deep learning computation: model construction, parameter access and initialization, designing custom layers and blocks, reading and writing models to disk, and leveraging GPUs.

ðŸ”‘ **KEY INSIGHT**: This chapter moves you from *end user* to *power user* - giving you tools to implement complex custom models while leveraging mature framework capabilities.

---
## 6.1 Layers and Modules

Neural network *modules* can describe a single layer, multiple layers, or an entire model. They can be combined recursively into larger artifacts.

In [55]:
import torch
from torch import nn
from torch.nn import functional as F

### Basic MLP with Sequential

ðŸ”‘ **KEY INSIGHT**: `nn.Sequential` defines a special kind of `Module` that maintains an ordered list of constituent modules. The forward propagation chains each module, passing output as input to the next.

In [56]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

X = torch.rand(2, 20)
net(X).shape

torch.Size([2, 10])

### A Custom Module

ðŸ”‘ **KEY INSIGHT**: To create a custom module, inherit from `nn.Module`, define `__init__` to set up layers, and implement `forward` for the computation. Backpropagation is handled automatically by autograd.

In [57]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)

    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

In [58]:
net = MLP()
net(X).shape

torch.Size([2, 10])

### The Sequential Module

Building our own simplified `MySequential` class.

In [59]:
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

In [60]:
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
net(X).shape

torch.Size([2, 10])

### Executing Code in the Forward Propagation Method

ðŸ”‘ **KEY INSIGHT**: You can incorporate arbitrary Python control flow (while loops, if statements) and mathematical operations in the forward method - not just predefined layers.

In [61]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)

    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

In [62]:
net = FixedHiddenMLP()
net(X)

tensor(0.1201, grad_fn=<SumBackward0>)

### Mixing and Matching Modules

In [63]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(0.0994, grad_fn=<SumBackward0>)

---
## 6.2 Parameter Management

Accessing parameters for debugging, diagnostics, visualizations, and sharing parameters across model components.

In [64]:
import torch
from torch import nn

In [65]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

### Parameter Access

ðŸ”‘ **KEY INSIGHT**: Access any layer by indexing into the model like a list. Each layer's parameters are in its `state_dict()`.

In [66]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[-0.3092, -0.2822, -0.3134,  0.3273, -0.0952, -0.0106, -0.0956, -0.1442]])),
             ('bias', tensor([0.2549]))])

### Targeted Parameters

In [67]:
type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([0.2549]))

In [68]:
net[2].weight.grad == None

True

### All Parameters at Once

In [69]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

### Tied Parameters

ðŸ”‘ **KEY INSIGHT**: Shared/tied parameters are the same exact tensor - changing one changes the other. During backpropagation, gradients from multiple uses are added together.

In [70]:
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])


---
## 6.3 Parameter Initialization

PyTorch's `nn.init` module provides various preset initialization methods.

In [71]:
import torch
from torch import nn

In [72]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

### Built-in Initialization

Initialize all weights as Gaussian with std=0.01, biases to zero.

In [73]:
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)

net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0136, -0.0090,  0.0121, -0.0087]), tensor(0.))

Initialize all parameters to a constant value.

In [74]:
def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)

net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

### Different Initializers for Different Layers

ðŸ”‘ **KEY INSIGHT**: Use `apply()` on specific layers to use different initialization schemes (e.g., Xavier for first layer, constant for others).

In [75]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)

def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)

net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([ 0.6192, -0.2691, -0.1070, -0.3047])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])


### Custom Initialization

In [76]:
def my_init(module):
    if type(module) == nn.Linear:
        print("Init", *[(name, param.shape)
                        for name, param in module.named_parameters()][0])
        nn.init.uniform_(module.weight, -10, 10)
        module.weight.data *= module.weight.data.abs() >= 5

net.apply(my_init)
net[0].weight[:2]

Init weight torch.Size([8, 4])
Init weight torch.Size([1, 8])


tensor([[0.0000, 0.0000, 9.9566, -0.0000],
        [6.3867, 0.0000, 5.5473, 0.0000]], grad_fn=<SliceBackward0>)

Direct parameter modification.

In [77]:
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

tensor([42.0000,  1.0000, 10.9566,  1.0000])

---
## 6.4 Lazy Initialization

ðŸ”‘ **KEY INSIGHT**: The framework *defers initialization*, waiting until the first data pass to infer layer dimensions on the fly. This eliminates specifying input dimensions upfront.

In [78]:
from d2l import torch as d2l
import torch
from torch import nn

In [79]:
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

At this point, the network doesn't know input dimensions - parameters aren't initialized yet.

In [80]:
net[0].weight

<UninitializedParameter>

Passing data through the network triggers initialization.

In [81]:
X = torch.rand(2, 20)
net(X)

net[0].weight.shape

torch.Size([256, 20])

### Helper Method for Lazy Initialization

In [82]:
@d2l.add_to_class(d2l.Module)  #@save
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

---
## 6.5 Custom Layers

Building custom layers when framework-provided layers are insufficient.

In [83]:
from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as F

### Layers without Parameters

In [84]:
class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

In [85]:
layer = CenteredLayer()
layer(d2l.tensor([1.0, 2, 3, 4, 5]))

tensor([-2., -1.,  0.,  1.,  2.])

Incorporate custom layer in a model.

In [86]:
net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())

In [87]:
Y = d2l.rand(4, 8)
Y = net(Y)
Y.mean()

tensor(-4.6566e-09, grad_fn=<MeanBackward0>)

### Layers with Parameters

ðŸ”‘ **KEY INSIGHT**: Use `nn.Parameter` to create learnable parameters. This registers them for automatic gradient computation and optimization.

In [88]:
class MyLinear(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(in_units, units))
        self.bias = nn.Parameter(torch.randn(units,))

    def forward(self, X):
        linear = torch.matmul(X, self.weight.data) + self.bias.data
        return F.relu(linear)

In [89]:
linear = MyLinear(5, 3)
linear.weight

Parameter containing:
tensor([[ 0.6299,  0.4485, -1.1769],
        [ 1.0370,  0.1767,  0.8941],
        [-0.8960,  2.8925,  1.2377],
        [-0.5161,  1.0163,  1.0485],
        [ 0.5116, -0.2036,  0.4059]], requires_grad=True)

In [90]:
linear(torch.rand(2, 5))

tensor([[1.3679, 0.7704, 2.9373],
        [0.7794, 0.3532, 1.5523]])

Construct models using custom layers.

In [91]:
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))
net(torch.rand(2, 64))

tensor([[14.7006],
        [21.6946]])

---
## 6.6 File I/O

Saving and loading tensors and model parameters for checkpointing and deployment.

In [92]:
import torch
from torch import nn
from torch.nn import functional as F

### Loading and Saving Tensors

In [93]:
x = torch.arange(4)
torch.save(x, 'x-file')

In [94]:
x2 = torch.load('x-file')
x2

tensor([0, 1, 2, 3])

Store and load a list of tensors.

In [95]:
y = torch.zeros(4)
torch.save([x, y],'x-files')
x2, y2 = torch.load('x-files')
(x2, y2)

(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))

Save and load a dictionary mapping strings to tensors.

In [96]:
mydict = {'x': x, 'y': y}
torch.save(mydict, 'mydict')
mydict2 = torch.load('mydict')
mydict2

{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}

### Loading and Saving Model Parameters

ðŸ”‘ **KEY INSIGHT**: Save model *parameters* (not the entire model). To restore, recreate the architecture in code, then load parameters. This approach handles arbitrary model code.

In [97]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.output = nn.LazyLinear(10)

    def forward(self, x):
        return self.output(F.relu(self.hidden(x)))

net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)

Store the parameters.

In [98]:
torch.save(net.state_dict(), 'mlp.params')

Recover the model by loading parameters.

In [99]:
clone = MLP()
clone.load_state_dict(torch.load('mlp.params'))
clone.eval()

MLP(
  (hidden): LazyLinear(in_features=0, out_features=256, bias=True)
  (output): LazyLinear(in_features=0, out_features=10, bias=True)
)

Verify both instances produce the same output.

In [100]:
Y_clone = clone(X)
Y_clone == Y

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]])

---
## 6.7 GPUs

Using GPUs for accelerated computation. GPU performance has increased ~1000x every decade since 2000.

In [101]:
from d2l import torch as d2l
import torch
from torch import nn

### Computing Devices

In [102]:
def cpu():  #@save
    """Get the CPU device."""
    return torch.device('cpu')

def gpu(i=0):  #@save
    """Get a GPU device."""
    return torch.device(f'cuda:{i}')

cpu(), gpu(), gpu(1)

(device(type='cpu'),
 device(type='cuda', index=0),
 device(type='cuda', index=1))

In [103]:
def num_gpus():  #@save
    """Get the number of available GPUs."""
    return torch.cuda.device_count()

num_gpus()

1

In [104]:
def try_gpu(i=0):  #@save
    """Return gpu(i) if exists, otherwise return cpu()."""
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():  #@save
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    return [gpu(i) for i in range(num_gpus())]

try_gpu(), try_gpu(10), try_all_gpus()

(device(type='cuda', index=0),
 device(type='cpu'),
 [device(type='cuda', index=0)])

### Tensors and GPUs

Query tensor device.

In [105]:
x = torch.tensor([1, 2, 3])
x.device

device(type='cpu')

### Storage on the GPU

ðŸ”‘ **KEY INSIGHT**: All operands must be on the same device. Transferring data between devices is slow - minimize transfers for performance.

In [106]:
X = torch.ones(2, 3, device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')

In [107]:
Y = torch.rand(2, 3, device=try_gpu(1))
Y

tensor([[0.7984, 0.7177, 0.8226],
        [0.1696, 0.0482, 0.4586]])

### Copying Between Devices

In [108]:
Z = X.to(try_gpu(1))
print(X)
print(Z)

tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0')
tensor([[1., 1., 1.],
        [1., 1., 1.]])


In [109]:
Y + Z

tensor([[1.7984, 1.7177, 1.8226],
        [1.1696, 1.0482, 1.4586]])

If variable already on device, `cuda()` returns it without copying.

In [110]:
Z.to(try_gpu(1)) is Z

True

### Neural Networks and GPUs

In [111]:
net = nn.Sequential(nn.LazyLinear(1))
net = net.to(device=try_gpu())

In [112]:
net(X)

tensor([[-0.1139],
        [-0.1139]], device='cuda:0', grad_fn=<AddmmBackward0>)

Confirm model parameters are on GPU.

In [113]:
net[0].weight.data.device

device(type='cuda', index=0)

### Trainer GPU Support

In [114]:
@d2l.add_to_class(d2l.Trainer)  #@save
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()
    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_batch(self, batch):
    if self.gpus:
        batch = [d2l.to(a, self.gpus[0]) for a in batch]
    return batch

@d2l.add_to_class(d2l.Trainer)  #@save
def prepare_model(self, model):
    model.trainer = self
    model.board.xlim = [0, self.max_epochs]
    if self.gpus:
        model.to(self.gpus[0])
    self.model = model

---
## Summary

This chapter covered:
- **Modules**: Building blocks that can be layers, groups of layers, or entire models
- **Parameter Management**: Accessing, initializing, and sharing parameters
- **Lazy Initialization**: Deferring parameter creation until first data pass
- **Custom Layers**: Creating layers with or without learnable parameters
- **File I/O**: Saving and loading tensors and model parameters
- **GPUs**: Utilizing hardware acceleration for faster computation