<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_03_5_weights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 3: Introduction to PyTorch**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 3 Material

* Part 3.1: Deep Learning and Neural Network Introduction [[Video]](https://www.youtube.com/watch?v=OaJntP14cRA&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_1_neural_net.ipynb)
* Part 3.2: Introduction to PyTorch [[Video]](https://www.youtube.com/watch?v=z5X2qV5h_p0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_2_pytorch.ipynb)
* Part 3.3: Saving and Loading a PyTorch Neural Network [[Video]](https://www.youtube.com/watch?v=NkG8w_Ua2Yo&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_3_save_load.ipynb)
* Part 3.4: Early Stopping in PyTorch to Prevent Overfitting [[Video]](https://www.youtube.com/watch?v=7Fboe7_aTtY&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_4_early_stop.ipynb)
* **Part 3.5: Extracting Weights and Manual Calculation** [[Video]](https://www.youtube.com/watch?v=Fw9VqcqFP_c&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_5_weights.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
try:
    import google.colab
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab


# Part 3.5: Extracting Weights and Manual Network Calculation

## Weight Initialization

The weights of a neural network determine the output for the neural network. The training process can adjust these weights, so the neural network produces useful output. Most neural network training algorithms begin by initializing the weights to a random state. Training then progresses through iterations that continuously improve the weights to produce better output.

The random weights of a neural network impact how well that neural network can be trained. If a neural network fails to train, you can remedy the problem by simply restarting with a new set of random weights. However, this solution can be frustrating when you are experimenting with the architecture of a neural network and trying different combinations of hidden layers and neurons. If you add a new layer, and the network’s performance improves, you must ask yourself if this improvement resulted from the new layer or from a new set of weights. Because of this uncertainty, we look for two key attributes in a weight initialization algorithm:

* How consistently does this algorithm provide good weights?
* How much of an advantage do the weights of the algorithm provide?

One of the most common yet least practical approaches to weight initialization is to set the weights to random values within a specific range. Numbers between -1 and +1 or -5 and +5 are often the choice. If you want to ensure that you get the same set of random weights each time, you should use a seed. The seed specifies a set of predefined random weights to use. For example, a seed of 1000 might produce random weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always get these values when you choose a seed of 1000. 
Not all seeds are created equal. One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others. The weights can be so bad that training is impossible. If you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.

Because weight initialization is a problem, considerable research has been around it. By default, PyTorch uses a [uniform random distribution](https://discuss.pytorch.org/t/how-are-layer-weights-and-biases-initialized-by-default/13073) based on the size of the layer. The Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), is also a common choice for weight initialization. This relatively simple algorithm uses normally distributed random numbers.  

To use the Xavier weight initialization, it is necessary to understand that normally distributed random numbers are not the typical random numbers between 0 and 1 that most programming languages generate. Normally distributed random numbers are centered on a mean ($\mu$, mu) that is typically 0. If 0 is the center (mean), then you will get an equal number of random numbers above and below 0. The next question is how far these random numbers will venture from 0. In theory, you could end up with both positive and negative numbers close to the maximum positive and negative ranges supported by your computer. However, the reality is that you will more likely see random numbers that are between 0 and three standard deviations from the center.

The standard deviation ($\sigma$, sigma) parameter specifies the size of this standard deviation. For example, if you specified a standard deviation of 10, you would mainly see random numbers between -30 and +30, and the numbers nearer to 0 have a much higher probability of being selected.  

The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%) probability. Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By defining the center and how large the standard deviations are, you can control the range of random numbers that you will receive.

The Xavier weight initialization sets all weights to normally distributed random numbers. These weights are always centered at 0; however, their standard deviation varies depending on how many connections are present for the current layer of weights. Specifically, Equation 4.2 can determine the standard deviation:

$$ Var(W) = \frac{2}{n_{in}+n_{out}} $$

The above equation shows how to obtain the variance for all weights. The square root of the variance is the standard deviation. Most random number generators accept a standard deviation rather than a variance. As a result, you usually need to take the square root of the above equation. Figure 3.XAVIER shows how this algorithm might initialize one layer. 

**Figure 3.XAVIER: Xavier Weight Initialization**
![Xavier Weight Initialization](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/images/xavier_weight.png?raw=1)

We complete this process for each layer in the neural network.  

## Manual Neural Network Calculation

This section will build a neural network and analyze it down the individual weights. We will train a simple neural network that learns the XOR function. It is not hard to hand-code the neurons to provide an [XOR function](https://en.wikipedia.org/wiki/Exclusive_or); however, we will allow PyTorch for simplicity to train this network for us. The neural network is small, with two inputs, two hidden neurons, and a single output. We will use 100K epochs on the ADAM optimizer. This approach is overkill, but it gets the result, and our focus here is not on tuning.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

x = torch.Tensor(
    [[0,0],
     [0,1], 
     [1,0], 
     [1,1]])
y = torch.Tensor([0,1,1,0]).view(-1,1)

class Net(nn.Module):
    def __init__(self, input_dim = 2, output_dim=1):
        super(Net, self).__init__()
        self.lin1 = nn.Linear(input_dim, 2)
        self.lin2 = nn.Linear(2, output_dim)
    
    def forward(self, x):
        x = self.lin1(x)
        x = torch.relu(x)
        x = self.lin2(x)
        return x

    def reset(self):
      for layer in self.children():
        if hasattr(layer, 'reset_parameters'):
            layer.reset_parameters()

# Seed the neural network so we have consistant (yet random) startinf
# weights.
torch.manual_seed(60)
model = Net()

loss_func = nn.MSELoss()

#optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

i = 0
loss = 1
while loss>1e-2:
  i += 1
  optimizer.zero_grad()
  pred = model(x)
  loss = loss_func.forward(pred, y)
  loss.backward()
  optimizer.step()
        
  if i % 100 == 0:
    print(f"Epoch: {i}, {loss}")
  #    print("Epoch: {0}, Loss: {1}, ".format(i, loss.data.numpy()[0]))
  if i % 1000 == 0:
    model.reset()

print(f"Final loss: {float(loss)}")      

Epoch: 100, 0.2602725625038147
Epoch: 200, 0.11703130602836609
Epoch: 300, 0.01011891569942236
Final loss: 0.009912579320371151


The output above should have two numbers near 0.0 for the first and fourth spots (input [0,0] and [1,1]). The middle two numbers should be near 1.0 (input [1,0] and [0,1]).  These numbers are in scientific notation. Due to random starting weights, running the above through several cycles is sometimes necessary to get a good result. So that we get consistent yet random values in the book and videos, we use a random seed value of 60. We chose the value of 60 because it provides good results and ensures the underlying values do not change from the book or videos if this code is rerun.

Now that we've trained the neural network, we can dump the weights.  

In [3]:
for layerNum, layer in enumerate(model.children()):
  for toNeuronNum, bias in enumerate(layer.bias):
        print(f'b{layerNum} -> l{layerNum+1}l{toNeuronNum} = {bias}')
    
  for fromNeuronNum, wgt in enumerate(layer.weight):
      for toNeuronNum, wgt2 in enumerate(wgt):
        print(f'l{layerNum}n{fromNeuronNum} '
              f'-> l{layerNum+1}l{toNeuronNum} = {wgt2}')

b0 -> l1l0 = 1.036415696144104
b0 -> l1l1 = 1.3471280336380005
l0n0 -> l1l0 = -1.0405253171920776
l0n0 -> l1l1 = -1.0412225723266602
l0n1 -> l1l0 = -0.6748256683349609
l0n1 -> l1l1 = -0.6751413345336914
b1 -> l2l0 = 0.1878463476896286
l1n0 -> l2l0 = -1.6615756750106812
l1n0 -> l2l1 = 1.1424684524536133


If you change the seed, you probably get different weights.  There are many ways to solve the XOR function.

In the next section, we copy/paste the weights from above and recreate the calculations done by the neural network. I converted these weights to Python code.



In [4]:
b0_l1l0 = 1.036415696144104
b0_l1l1 = 1.3471280336380005
l0n0_l1l0 = -1.0405253171920776
l0n0_l1l1 = -1.0412225723266602
l0n1_l1l0 = -0.6748256683349609
l0n1_l1l1 = -0.6751413345336914
b1_l2l0 = 0.1878463476896286
l1n0_l2l0 = -1.6615756750106812
l1n0_l2l1 = 1.1424684524536133

We can now calculate the output of the neural network for any two inputs.

In [5]:
# Choose your two inputs
input0 = 0
input1 = 1

# Calculate the hidden layer
hidden0Sum = (input0*l0n0_l1l0)+(input1*l0n0_l1l1)+b0_l1l0
hidden1Sum = (input0*l0n1_l1l0)+(input1*l0n1_l1l1)+b0_l1l1

# Relu Activation function
hidden0 = max(0,hidden0Sum)
hidden1 = max(0,hidden1Sum)

# Calculate the output layer
output = (hidden0*l1n0_l2l0)+(hidden1*l1n0_l2l1)+b1_l2l0

print(f"Final output: {output}")

Final output: 0.9555699518847405


As you can see, we get an output near the value of 1.0 for our inputs of 0 and 1. This value is consistent with the XOR operator and is the same output that PyTorch would provide for these weights.