## Deferred Initialization

There’s no way Pytorch(or any other framework for that matter) could predict what the input dimensionality of a network would be. Later on, when working with convolutional networks and images this problem will become even more pertinent, since the input dimensionality (i.e. the resolution ofan image) will affect the dimensionality of subsequent layers at a long range. Hence, the ability to set parameters without the need to know at the time of writing the code what the dimensionality is can greatly simplify statistical modeling. In what follows, we will discuss how this works using initialization as an example. After all, we cannot initialize variables that we don’t know exist.

## Instantiating a Network

In [2]:
import torch
import torch.nn as nn
import numpy as nd
def getnet(in_features,out_features):
    net=nn.Sequential(
    nn.Linear(in_features,256),
    nn.ReLU(),
    nn.Linear(256,out_features)
    )
    return net
net=getnet(20,10)

### In pytorch it is not possible to define a layer without mentioning the in_features for that layer

In [3]:
for name,param in net.named_parameters():
    print(name,'\t\t',param.shape)

0.weight 		 torch.Size([256, 20])
0.bias 		 torch.Size([256])
2.weight 		 torch.Size([10, 256])
2.bias 		 torch.Size([10])


Now it is possible to make a network learn the size from a input data by making a custom nn module using pytorch

 ## Deferred Initialization in Practice
Now that we know how it works in theory, let’s see when the initialization is actually triggered. In order to do so, we mock
up an initializer.The initializer **init_weights** when evoked it initializes the weight of the network.It also sets the weights of the neural network to a non zero value which helps as neural networks tend to get stuck in local minima, so it's a good idea to give them many different starting values. You can't do that if they all start at zero.


In [16]:
def init_weights(m):
    print("Init",m.__class__.__name__)
    m
        

net.apply(init_weights)
print(net[0].weight)
print(net[2].weight)

Init Linear
Init ReLU
Init Linear
Init Sequential
Parameter containing:
tensor([[-0.0466,  0.0647,  0.2163,  ...,  0.0978,  0.1920, -0.0540],
        [-0.1062,  0.1817, -0.0595,  ..., -0.1299,  0.0142, -0.1386],
        [ 0.1698,  0.0321, -0.1018,  ..., -0.1713,  0.0010,  0.1861],
        ...,
        [-0.1661,  0.0226, -0.0038,  ...,  0.1891,  0.1786, -0.0330],
        [-0.1302, -0.0338, -0.1963,  ..., -0.1408,  0.1567, -0.1617],
        [ 0.1900,  0.0453,  0.1034,  ...,  0.0176,  0.2069,  0.0638]],
       requires_grad=True)
Parameter containing:
tensor([[-0.0330, -0.0555,  0.0070,  ..., -0.0128, -0.0463,  0.0476],
        [ 0.0429, -0.0208,  0.0578,  ..., -0.0459, -0.0164,  0.0171],
        [ 0.0593,  0.0171, -0.0435,  ...,  0.0578, -0.0053,  0.0499],
        ...,
        [ 0.0112, -0.0114,  0.0338,  ...,  0.0055, -0.0406, -0.0320],
        [ 0.0511,  0.0041,  0.0606,  ...,  0.0600,  0.0384,  0.0513],
        [ 0.0591, -0.0084, -0.0215,  ...,  0.0515, -0.0211,  0.0217]],
       requ

In [17]:
x=torch.rand((2,20))
y=net(x) # Forward computation
for name,param in net.named_parameters():
    print(name,'\t\t',param.shape)

0.weight 		 torch.Size([256, 20])
0.bias 		 torch.Size([256])
2.weight 		 torch.Size([10, 256])
2.bias 		 torch.Size([10])


The main difference to before is that as soon as we knew the input dimensionality, x is R 20 it was possible to define the
weight matrix for the first layer, i.e. W1 is R 256 * 20. With that out of the way, we can progress to the second layer, define
its dimensionality to be 10 * 256 and so on through the computational graph and bind all the dimensions as they become
available. Once this is known, we can proceed by initializing parameters.

As mentioned at the beginning of this section, deferred initialization can also cause confusion. Before the first forward
calculation, we were unable to directly manipulate the model parameters, for example, we could not use the data and
set_data functions to get and modify the parameters. Therefore, we often force initialization by sending a sample
observation through the network.

## Forced Initialization

Deferred initialization does not occur if the system knows the shape of all parameters when calling the initialize
function. This can occur in two cases:
1. We’ve already seen some data and we just want to reset the parameters.
2. We specified all input and output dimensions of the network when defining it.


The first case works just fine, as illustrated below.

once we see some data and define the parameters and after that we initialize those parameters again using **init_weights** function

In [18]:
net1=nn.Sequential()
net1.add_module("Linear1",nn.Linear(20,256))
net1.add_module("Linear2",nn.Linear(256,10))


In [19]:
def init_weights_forced(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)

In second case we specify the remaining set of parameters when initializing the network

In [20]:
net1.apply(init_weights_forced)
for name,param in net1.named_parameters():
    print(name,'\t\t',param.shape)

Linear1.weight 		 torch.Size([256, 20])
Linear1.bias 		 torch.Size([256])
Linear2.weight 		 torch.Size([10, 256])
Linear2.bias 		 torch.Size([10])


  This is separate from the ipykernel package so we can avoid doing imports until


In the above case we have the data before hand and now we use that datato allow the model itself the way it wants to set the  in_features and the parameters.

The second case requires us to specify the remaining set of parameters when creating the layer. For instance, for dense
layers we also need to specify the in_features so that initialization can occur immediately once  called.

## Summary

1.  Pytorch doesnot provide a inbuilt feature for deferred initialization.
2.  Deferred initialization is a good thing. It allows  to set many things automagically and it removes a great
    source of errors from defining novel network architectures.
3.  We can override this by specifying all implicitly defined variables.


## Exercises

1. What happens if you specify only parts of the input dimensions. Do you still get immediate initialization?
2. What happens if you specify mismatching dimensions?
3. What would you need to do if you have input of varying dimensionality? Hint - look at parameter tying.