<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_5_weights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 3: Introduction to TensorFlow**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 3 Material

* Part 3.1: Deep Learning and Neural Network Introduction [[Video]](https://www.youtube.com/watch?v=zYnI4iWRmpc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_1_neural_net.ipynb)
* Part 3.2: Introduction to Tensorflow and Keras [[Video]](https://www.youtube.com/watch?v=PsE73jk55cE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_2_keras.ipynb)
* Part 3.3: Saving and Loading a Keras Neural Network [[Video]](https://www.youtube.com/watch?v=-9QfbGM1qGw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_3_save_load.ipynb)
* Part 3.4: Early Stopping in Keras to Prevent Overfitting [[Video]](https://www.youtube.com/watch?v=m1LNunuI2fk&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_4_early_stop.ipynb)
* **Part 3.5: Extracting Weights and Manual Calculation** [[Video]](https://www.youtube.com/watch?v=7PWgx16kH8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_03_5_weights.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


# Part 3.5: Extracting Weights and Manual Network Calculation

### Weight Initialization

The weights of a neural network determine the output for the neural network.  The process of training can adjust these weights so the neural network produces useful output.  Most neural network training algorithms begin by initializing the weights to a random state.  Training then progresses through a series of iterations that continuously improve the weights to produce better output.

The random weights of a neural network impact how well that neural network can be trained.  If a neural network fails to train, you can remedy the problem by simply restarting with a new set of random weights. However, this solution can be frustrating when you are experimenting with the architecture of a neural network and trying different combinations of hidden layers and neurons.  If you add a new layer, and the network’s performance improves, you must ask yourself if this improvement resulted from the new layer or from a new set of weights.  Because of this uncertainty, we look for two key attributes in a weight initialization algorithm:

* How consistently does this algorithm provide good weights?
* How much of an advantage do the weights of the algorithm provide?

One of the most common, yet least effective, approaches to weight initialization is to set the weights to random values within a specific range.  Numbers between -1 and +1 or -5 and +5 are often the choice.  If you want to ensure that you get the same set of random weights each time, you should use a seed.  The seed specifies a set of predefined random weights to use.  For example, a seed of 1000 might produce random weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always get these values when you choose a seed of 1000. 
Not all seeds are created equal.  One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others.  In fact, the weights can be so bad that training is impossible.  If you find that you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.

Because weight initialization is a problem, there has been considerable research around it.  In this course we use the Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), produces good weights with reasonable consistency.  This relatively simple algorithm uses normally distributed random numbers.  

To use the Xavier weight initialization, it is necessary to understand that normally distributed random numbers are not the typical random numbers between 0 and 1 that most programming languages generate.  In fact, normally distributed random numbers are centered on a mean ($\mu$, mu) that is typically 0.  If 0 is the center (mean), then you will get an equal number of random numbers above and below 0.  The next question is how far these random numbers will venture from 0.  In theory, you could end up with both positive and negative numbers close to the maximum positive and negative ranges supported by your computer.  However, the reality is that you will more likely see random numbers that are between 0 and three standard deviations from the center.

The standard deviation ($\sigma$, sigma) parameter specifies the size of this standard deviation.  For example, if you specified a standard deviation of 10, then you would mainly see random numbers between -30 and +30, and the numbers nearer to 0 have a much higher probability of being selected.  

The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%) probability.  Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By defining the center and how large the standard deviations are, you are able to control the range of random numbers that you will receive.

The Xavier weight initialization sets all of the weights to normally distributed random numbers.  These weights are always centered at 0; however, their standard deviation varies depending on how many connections are present for the current layer of weights.  Specifically, Equation 4.2 can determine the standard deviation:

$ Var(W) = \frac{2}{n_{in}+n_{out}} $

The above equation shows how to obtain the variance for all of the weights.  The square root of the variance is the standard deviation.  Most random number generators accept a standard deviation rather than a variance.  As a result, you usually need to take the square root of the above equation.  Figure 3.XAVIER shows how one layer might be initialized. 

**Figure 3.XAVIER: Xavier Weight Initialization**
![Xavier Weight Initialization](images/xavier_weight.png)

This process is completed for each layer in the neural network.  

### Manual Neural Network Calculation

In this section we will build a neural network and analyze it down the individual weights.  We will train a simple neural network that learns the XOR function.  It is not hard to simply hand-code the neurons to provide an [XOR function](https://en.wikipedia.org/wiki/Exclusive_or); however, for simplicity, we will allow Keras to train this network for us.  We will just use 100K epochs on the ADAM optimizer.  This is massive overkill, but it gets the result, and our focus here is not on tuning.  The neural network is small.  Two inputs, two hidden neurons, and a single output.

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import numpy as np

# Create a dataset for the XOR function
x = np.array([
    [0,0],
    [1,0],
    [0,1],
    [1,1]
])

'''
y = np.array([
    0,
    1,
    1,
    0
])
'''

#Changed for AND function
y = np.array([
    0,
    0,
    0,
    1
])

# Build the network
# sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

done = False
cycle = 1

while not done:
    print("Cycle #{}".format(cycle))
    cycle+=1
    model = Sequential()
    model.add(Dense(2, input_dim=2, activation='relu')) 
    model.add(Dense(1)) 
    model.compile(loss='mean_squared_error', optimizer='adam')
    model.fit(x,y,verbose=0,epochs=1000)

    # Predict
    pred = model.predict(x)
    
    # Check if successful.  It takes several runs with this 
    # small of a network
    #done = pred[0]<0.01 and pred[3]<0.01 and pred[1] > 0.9 \
    #    and pred[2] > 0.9 
    
    #changed for AND validation
    done = pred[0]<0.01 and pred[1]<0.01 and pred[2] <0.01 and pred[3] > 0.9 
    
    print(pred)

Cycle #1
[[ 0.04159322]
 [ 0.46762845]
 [-0.02182114]
 [ 0.46762845]]
Cycle #2
[[4.4047832e-05]
 [1.1861324e-05]
 [4.9145758e-01]
 [5.0707573e-01]]
Cycle #3
[[0.05464108]
 [0.4342929 ]
 [0.13322935]
 [0.5341744 ]]
Cycle #4
[[ 3.1945109e-04]
 [-5.5730343e-06]
 [ 4.8719850e-01]
 [ 4.8719850e-01]]
Cycle #5
[[-0.04376695]
 [ 0.21955067]
 [ 0.13795224]
 [ 0.74178535]]
Cycle #6
[[-0.00074727]
 [ 0.51930183]
 [-0.00074727]
 [ 0.48225167]]
Cycle #7
[[-0.01157847]
 [ 0.00891877]
 [ 0.07052737]
 [ 0.95593256]]
Cycle #8
[[-0.01391426]
 [ 0.25073278]
 [ 0.0401625 ]
 [ 0.77133065]]
Cycle #9
[[0.31546926]
 [0.31546926]
 [0.05141005]
 [0.31546926]]
Cycle #10
[[0.27912396]
 [0.27912396]
 [0.18518391]
 [0.27912396]]
Cycle #11


[[-0.07997365]
 [ 0.28215003]
 [ 0.04408159]
 [ 0.75185347]]
Cycle #12
[[0.32330984]
 [0.32330984]
 [0.00404537]
 [0.32330984]]
Cycle #13
[[0.30516782]
 [0.30516782]
 [0.23576722]
 [0.22861147]]
Cycle #14
[[0.32338554]
 [0.01282173]
 [0.32338554]
 [0.32338554]]
Cycle #15
[[ 0.13187174]
 [-0.04817165]
 [ 0.3846255 ]
 [ 0.59107524]]
Cycle #16
[[ 0.3378831 ]
 [-0.01186562]
 [ 0.3378831 ]
 [ 0.3378831 ]]
Cycle #17
[[0.24999624]
 [0.24999624]
 [0.24999624]
 [0.24999624]]
Cycle #18
[[-0.05550189]
 [ 0.00483248]
 [ 0.37518388]
 [ 0.6945916 ]]
Cycle #19
[[0.24999842]
 [0.24999842]
 [0.24999842]
 [0.24999842]]
Cycle #20
[[-0.02857309]
 [ 0.02693377]
 [ 0.02856149]
 [ 0.97106445]]
Cycle #21
[[0.24999624]
 [0.24999624]
 [0.24999624]
 [0.24999624]]
Cycle #22


[[-0.02682036]
 [ 0.4287374 ]
 [-0.00311719]
 [ 0.61190635]]
Cycle #23
[[-0.02005848]
 [ 0.2929457 ]
 [ 0.04565657]
 [ 0.754847  ]]
Cycle #24
[[ 0.12646687]
 [-0.047885  ]
 [ 0.01280123]
 [ 0.5705865 ]]
Cycle #25
[[ 1.4396310e-03]
 [ 4.8564982e-01]
 [-7.7724457e-05]
 [ 4.8564982e-01]]
Cycle #26
[[-0.05470636]
 [ 0.02306806]
 [ 0.02806042]
 [ 0.9787191 ]]
Cycle #27
[[-0.19201668]
 [ 0.15022136]
 [ 0.16391195]
 [ 0.8465635 ]]
Cycle #28
[[-0.07819801]
 [ 0.03910092]
 [ 0.12668857]
 [ 0.91210103]]
Cycle #29
[[-0.00236635]
 [ 0.00304632]
 [ 0.00200506]
 [ 0.99630266]]


In [17]:
pred[3]



array([0.9929306], dtype=float32)

The output above should have two numbers near 0.0 for the first and forth spots (input [[0,0]] and [[1,1]]).  The middle two numbers should be near 1.0 (input [[1,0]] and [[0,1]]).  These numbers are in scientific notation.  Due to random starting weights, it is sometimes necessary to run the above through several cycles to get a good result.

Now that the neural network is trained, lets dump the weights.  

In [11]:
# Dump weights
for layerNum, layer in enumerate(model.layers):
    weights = layer.get_weights()[0]
    biases = layer.get_weights()[1]
    
    for toNeuronNum, bias in enumerate(biases):
        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')
    
    for fromNeuronNum, wgt in enumerate(weights):
        for toNeuronNum, wgt2 in enumerate(wgt):
            print(f'L{layerNum}N{fromNeuronNum} \
                  -> L{layerNum+1}N{toNeuronNum} = {wgt2}')

0B -> L1N0: -0.2524045705795288
0B -> L1N1: -0.6790608167648315
L0N0                   -> L1N0 = -1.0603439807891846
L0N0                   -> L1N1 = 0.6886779069900513
L0N1                   -> L1N0 = 0.24870148301124573
L0N1                   -> L1N1 = 0.6872497797012329
1B -> L2N0: -0.006746573373675346
L1N0                   -> L2N0 = 0.054775845259428024
L1N1                   -> L2N0 = 1.4345310926437378


If you rerun this, you probably get different weights.  There are many ways to solve the XOR function.

In the next section, we copy/paste the weights from above and recreate the calculations done by the neural network.  Because weights can change with each training, the weights used for the below code came from this:

```
0B -> L1N0: -1.2913415431976318
0B -> L1N1: -3.021530048386012e-08
L0N0 -> L1N0 = 1.2913416624069214
L0N0 -> L1N1 = 1.1912699937820435
L0N1 -> L1N0 = 1.2913411855697632
L0N1 -> L1N1 = 1.1912697553634644
1B -> L2N0: 7.626241297587034e-36
L1N0 -> L2N0 = -1.548777461051941
L1N1 -> L2N0 = 0.8394404649734497
```

In [12]:
input0 = 0
input1 = 1

hidden0Sum = (input0*1.3)+(input1*1.3)+(-1.3)
hidden1Sum = (input0*1.2)+(input1*1.2)+(0)

print(hidden0Sum) # 0
print(hidden1Sum) # 1.2

hidden0 = max(0,hidden0Sum)
hidden1 = max(0,hidden1Sum)

print(hidden0) # 0
print(hidden1) # 1.2

outputSum = (hidden0*-1.6)+(hidden1*0.8)+(0)
print(outputSum) # 0.96

output = max(0,outputSum)

print(output) # 0.96

0.0
1.2
0
1.2
0.96
0.96
