<a href="https://colab.research.google.com/github/andrews/pytorch_tutorials/blob/main/d2i_parameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

(From the text)

Most of the time, we will be able to ignore the nitty-gritty details of how parameters are declared and manipulated, relying on deep learning frameworks to do the heavy lifting. However, when we move away from stacked architectures with standard layers, we will sometimes need to get into the weeds of declaring and manipulating parameters. In this section, we cover the following:

Accessing parameters for debugging, diagnostics, and visualizations.

Sharing parameters across different model components.

In [1]:
import torch
from torch import nn

In [2]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape
# batch size of 2, 1 output feature



torch.Size([2, 1])

In [3]:
print(X) # 2 rows 4 cols OR batch size 2 and 4 features

tensor([[0.3860, 0.8639, 0.2895, 0.1375],
        [0.2288, 0.2885, 0.3286, 0.4527]])


# Parameter Access

In [4]:
# in Sequential class, layers can be accessed like a list
net[2].state_dict() # output layer.
# 8 because first layer is 8, then relu keeps it 8, then 8 weights in output

OrderedDict([('weight',
              tensor([[-0.3389, -0.2414,  0.1266, -0.3465, -0.0219, -0.2282, -0.2693, -0.0183]])),
             ('bias', tensor([-0.2206]))])

In [5]:
net[2].bias

Parameter containing:
tensor([-0.2206], requires_grad=True)

In [6]:
type(net[2].bias)

torch.nn.parameter.Parameter

In [7]:
net[2].bias.data

tensor([-0.2206])

In [9]:
# no gradient since we did not invoke back propagation
net[2].weight.grad == None

True

In [10]:
# access all parameters
[(name, param.shape) for name, param in net.named_parameters()]
# here the sizes are different from [batch size, num of features]
# it's [output features, input features]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

In [11]:
# share parameters across multiple layers
# give the shared layer a name
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)



tensor([[-0.1818],
        [-0.1781]], grad_fn=<AddmmBackward0>)

In [12]:
# check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])


In [13]:
net[2].weight.data[0, 0] = 100
print(net[2].weight.data[0] == net[4].weight.data[0])
# same object in memory

tensor([True, True, True, True, True, True, True, True])


(From the text)

You might wonder, when parameters are tied what happens to the gradients? Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are added together during backpropagation.

# Parameter Initialization

Not just randomly initializing parameters. Looks like there are different schemas for initializing. Can possibly run into vanishing and exploding gradients. More discussion [here](https://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html#sec-numerical-stability)

In [14]:
import torch
from torch import nn

In [15]:
# pytorch initializes weight and bias matrices uniformly
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape # [batch size, num of output features]



torch.Size([2, 1])

In [19]:
# nn.init module has preset initialization methods
# here we are initializing all weight parameters as Gaussian random variables
# with standard deviation 0.01, while bias parameters are cleared to 0
def init_normal(module):
  if type(module) == nn.Linear:
    print("setting the weights!")
    nn.init.normal_(module.weight, mean=0, std=0.01)
    nn.init.zeros_(module.bias)

In [17]:
type(net)

In [18]:
type(net) == nn.Linear

False

In [20]:
net.apply(init_normal)
net[0].weight.data, net[0].bias.data[0]

setting the weights!
setting the weights!


(tensor([[-0.0005, -0.0017,  0.0034,  0.0200],
         [-0.0058,  0.0106,  0.0013,  0.0011],
         [-0.0220, -0.0092, -0.0226,  0.0016],
         [-0.0023, -0.0020,  0.0052,  0.0110],
         [-0.0082, -0.0036,  0.0106, -0.0116],
         [ 0.0145,  0.0122,  0.0045,  0.0082],
         [ 0.0069,  0.0045, -0.0097, -0.0055],
         [-0.0024, -0.0158, -0.0190, -0.0052]]),
 tensor(0.))

In [21]:
# can also initialize to a constant value
def init_constant(module):
  if type(module) == nn.Linear:
    print("setting the weights!")
    nn.init.constant_(module.weight, 1)
    nn.init.zeros_(module.bias)

net.apply(init_constant)
net[0].weight.data, net[0].bias.data

setting the weights!
setting the weights!


(tensor([[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]),
 tensor([0., 0., 0., 0., 0., 0., 0., 0.]))

In [22]:
# can also apply different initializations to different blocks
# also something called the Xavier initializer??
def init_xavier(module):
  if type(module) == nn.Linear:
    nn.init.xavier_uniform_(module.weight)

def init_42(module):
  if type(module) == nn.Linear:
    nn.init.constant_(module.weight, 42)

net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data)
print(net[2].weight.data)

tensor([[-0.0569,  0.1886,  0.1994,  0.1537],
        [ 0.6438, -0.1747, -0.2422, -0.2468],
        [ 0.3365,  0.5402, -0.6510,  0.2670],
        [-0.3514, -0.5031,  0.6726, -0.2203],
        [ 0.5402,  0.7013, -0.5386, -0.2538],
        [ 0.1620,  0.1076, -0.5887,  0.6259],
        [-0.5505, -0.5028, -0.4610, -0.7055],
        [ 0.2180,  0.4093, -0.0358, -0.5391]])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])


We want to represent this as an initialization function
\begin{split}\begin{aligned}
    w \sim \begin{cases}
        U(5, 10) & \textrm{ with probability } \frac{1}{4} \\
            0    & \textrm{ with probability } \frac{1}{2} \\
        U(-10, -5) & \textrm{ with probability } \frac{1}{4}
    \end{cases}
\end{aligned}\end{split}

In [26]:
def my_init(module):
  if type(module) == nn.Linear:
    print("Init", *[(name, param.shape) for
                    name, param in module.named_parameters()][0])
    nn.init.uniform_(module.weight, -10, 10)
    module.weight.data *= module.weight.data.abs() >= 5

net.apply(my_init)
net[0].weight[:2]
# does this match the distribution of random variable w? double check!!

Init weight torch.Size([8, 4])
Init weight torch.Size([1, 8])


tensor([[ 0.0000,  5.8175, -9.7044, -0.0000],
        [-7.5696,  0.0000,  8.5682, -9.2032]], grad_fn=<SliceBackward0>)

In [31]:
# we always have the option of setting parameters directly
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data
# keep running, it keeps adding 1

tensor([[42.0000, 10.8175, -4.7044,  5.0000],
        [-2.5696,  5.0000, 13.5682, -4.2032],
        [-2.1281,  5.0000,  5.0000,  5.0000],
        [10.1413,  5.0000,  5.0000, -3.9971],
        [11.2375, -0.5513,  5.0000, -1.1889],
        [11.9251,  5.0000,  5.0000,  5.0000],
        [-4.0174,  5.0000,  5.0000,  5.0000],
        [ 5.0000, 13.5916,  5.0000,  5.0000]])