<a href="https://colab.research.google.com/github/davidAcode/davidAcode.github.io/blob/master/Dave_teaches_how_to_create_NN_on_Github_02142019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The best teacher is the person who just learned yesterday the stuff you are studying today, because she remembers what she struggled with and how she overcame it, and she can pass those shortcuts on to you.

Today will be a giant step forward, both in your learning and your confidence in your mastery.  You are about to create your first neural network (NN) that will be able to guess, to learn from its mistakes, and to do better next time.  Your NN will solve a simple binary classifaction problem, i.e., "given these 1's and 0's in the matrix X as inputs, what is your prediction of the 1's and 0's in matrix y will be?"

Thanks to super-teachers Andrew Trask and Siraj Raval, today's material--a working neural network in about 15 lines of Python--is the foundation of what you need to progress towards an understanding of the cutting edge techniques being used in deep learning today.  My mentor Adam Koenig, a Stanford PhD in aerospace engineering, helped me learn this stuff, and now I'm going to stand on these teachers' shoulders and help you learn it too.

First read Andrew Trask's [blog post](http://iamtrask.github.io/2015/07/12/basic-python-network/) several times and absorb his teaching as best you can.  Then, return to this post.

I suggest you open this blog post in two side-by-side windows and show the code in the left window while you scroll through my explanation of it in your right window.  First, I'll show you the entire code we'll be studying today, and below that is my detailed step-by-step explanation of what it does.  Get ready for the wonder of watching a computer learn from its mistakes and recognize patterns!  We're about to give birth to our own little baby brain... :-)

Here's the code first.  Display this in your left window:


In [0]:
#This is the "3 Layer Network" near the bottom of: 
#http://iamtrask.github.io/2015/07/12/basic-python-network/

#First housekeeping: import numpy, a powerful library of math tools.
import numpy as np
#1 Sigmoid Function: changes numbers to probabilities and finds slope to use in gradient descent
def nonlin(x,deriv=False):
  if(deriv==True):
    return x*(1-x)
  
  return 1/(1+np.exp(-x))
#2 X Matrix: This is a set of inputs in the training set that we will use to 
#train our network.
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
#3 y Vector: The output layer: Our training set of 4 target values. Once our NN
#can correctly predict these 4 target values from the inputs provided by X, 
#it is now ready to predict in real life.
y = np.array([[0],
             [1],
             [1],
             [0]])
#4 SEED: This is housekeeping. Ya gotta seed the random numbers we will generate
#in the training process, to make debugging easier.
np.random.seed(1)

#5 SYNAPSES: aka "Weights." These 2 matrices are the "brain."  It remembers, learns, improves.
syn0 = 2*np.random.random((3,4)) - 1 # First layer of weights, Synapse 0, connecting l0 to l1.
syn1 = 2*np.random.random((4,1)) - 1 # Second layer of weights, Synapse 1 connecting l1 to l2.

#6 FOR LOOP: this iterator takes our NN through 60,000 guesses, tweaks, and improvements.
for j in range(60000):
  
  #7 FEED FORWARD NETWORK: Think of l0, l1 and l2 as 3 matrices as the "neurons" 
  #that combine with the "synapses" matrices in #5 to think, predict, improve, remember.
  l0=X
  l1=nonlin(np.dot(l0,syn0))
  l2=nonlin(np.dot(l1,syn1))
  
  #8 TARGET, and how much we missed it by. y is a 4x1 vector containing our 4 target 
  #values. When we subtract the l2 vector (our first 4 guesses) from y, our target,
  #we get l2_error: how much our neural network missed the target by on this iteration.
  l2_error = y - l2
  
  #9 PRINT ERROR: in 60,000 iterations, j divided by 10,000 leaves a remainder of 0
  #only 6 times. We're going to check our data every 10,000 iterations to see if
  #the l2_error is reducing, and we're missing our target by less each time.
  if (j% 10000)==0:
    print("Our average l2 error after 10,000 more iterations: " + str(np.mean(np.abs(l2_error))))

  #10 In what DIRECTION is y, our desired target value, from our NN's latest guess? We
  #take the slope of our latest guess, multiply it by how much that latest guess
  #missed our target of y, and the resulting l2_delta tells us by what value to update
  #each weight in our syn1 synapses so that our next prediction will be even better.
  
  #AK: The term "direction" seems misleading to me.  The delta modifies the error with weights 
  #from the derivatives to induce large changes in low confidence values and smalll
  #changes in high confidence values
  l2_delta = l2_error*nonlin(l2,deriv=True)
  
  #11 BACK PROPAGATION: After we "fed forward" in Step 7, no we work backwards to 
  #find the l1 error.  l1 error is the difference between the ideal l1 that would 
  #provide the ideal l2 we want and the most recent computed l1.  To find l1_error, 
  #we have to multiply how much our latest prediction missed the target (l2_error) 
  #by our last guess at the optimal weights (syn1). We'll then use l1_error to update syn0 below.
  l1_error = l2_error.dot(syn1.T)

  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Similar to #10 above, we want to tweak this middle layer
  #so it sends a better prediction to l2, making it easier for l2 to better predict target y.
  
  #AK: I don't think l2 should be mentioned here.  It is similar to what was 
  #done before.  Add weights to produce large changes in low confidence values 
  #and small changes in high confidence values
  l1_delta = l1_error * nonlin(l1,deriv=True)
  
  #13 UPDATE SYNAPSES: aka Gradient Descent. This step is where the synapses, the true
  #"brain" of our network, learn from their mistakes, remember, and improve--learning!
  syn1 += l1.T.dot(l2_delta)
  syn0 += l0.T.dot(l1_delta)

#Print results!
print("Our l2 error value after all 60,000 iterations of training: ")
print(l2)

l0
[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]
syn0
[[-0.16595599  0.44064899 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.30887855]
 [-0.20646505  0.07763347 -0.16161097  0.370439  ]]
l1
[[0.44856632 0.51939863 0.45968497 0.59156505]
 [0.28639589 0.32350963 0.31236398 0.51538526]
 [0.40795614 0.62674606 0.23841622 0.49377636]
 [0.25371248 0.42628115 0.14321233 0.41732254]]
syn1
[[-0.5910955 ]
 [ 0.75623487]
 [-0.94522481]
 [ 0.34093502]]
l2
[[0.47372957]
 [0.48895696]
 [0.54384086]
 [0.54470837]]
l2_delta
[[-7.51235207e-06]
 [ 7.55068179e-06]
 [ 6.74174863e-06]
 [-8.00425612e-06]]
l1_delta
[[ 9.46832171e-10 -3.08497572e-05  5.08949926e-05 -3.15810072e-05]
 [-5.49981023e-05  9.22271485e-11 -1.98772953e-05  4.06826431e-05]
 [-5.58922384e-05  4.01319985e-05 -2.09610752e-05  8.85971300e-11]
 [ 7.74711336e-05 -1.94497871e-05  8.78140259e-11 -1.97665444e-05]]
Our average l2 error after 10,000 more iterations: 0.4964100319027255
l0
[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]
syn0
[[  

Now, let's go through each step of the code in detail, beginning with
lines 6-11:

"nonlin" above is a type of standard logistic function known as a Sigmoid function.  Logistic functions are very commonly used in science, statistics, and probability.  This Sigmoid function is written in a more complicated way than necessary because it serves two functions:

1) to take each of the matrices within its parentheses and convert each value to a number between 0 and 1 (aka a statistical probability).  This is done by line 11: `return 1/(1+np.exp(-x))` 
We will see below that this is very important, because this conversion to a 0-1 number gives us **FOUR** very **big advantages**.  I will discuss these four in detail below, but for now, just know that the sigmoid function converts every number in every matrix within its parentheses into a number between 0 and 1 that falls somewhere on the S-curve illustrated here:

![alt text](https://iamtrask.github.io/img/sigmoid.png)
(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

So, Part 1 of the Sigmoid function has converted a number into a number between 0 and 1, i.e., a statistical probability, which is also known as a confidence measure.  In other words, the number answers the question, "how confident are we that this number correctly predicts an outcome?"  

Now for Part 2.  The second part of this sigmoid function is in lines 8 and 9:
'  if(deriv==True):
    return x*(1-x)'
When called to do so by `deriv=True` in the code below, line 9 takes the confidence measure from Part 1) and converts it into a slope of the Sigmoid S curve, which will be used to tweak the synapse matrices of our NN and nudge them towards greater accuracy in prediction.  

So the sigmoid function plays a super-important role in making our NN learn, but don't worry if you don't understand it all yet.  I'll explain it in detail below.  Let's move on to Step 2:


In [0]:
#2 X Matrix: The input. Our training set of 4 initial "guesses" that we'll use to
#make better-and-better predictions that will eventually attain our target, correct answers.
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])

#2) Creating X input: Lines 12-17
Lines 12-17, step 2, create a 4x3 matrix of input values that we will use to train our network.  X will become layer 0, or l0 of our network, so this is the beginning of the "toy brain" we are creating!  

Think of each row of this matrix as a training example we'll feed into our network, and each column is one node of our input.  So our Matrix X can be visualized as this:

![alt text](https://lh3.googleusercontent.com/1cxj3K8bj_IOL8lhF9pRQSbDQyo3imZGVmZlEsjfvX1ImXBIJZHeJgOtX2kOQoaEbfhCbqtR2JFzSXK_r41dGGVdH6rde7QgoIqEGGn-v37CcD6IWw-nk9put1txPGeKvP-fxo3Lwk3nlwd3nkIdgjMu8HQf9ccyAZGlgRj7YH3avElDzwJr6kEDygOaLxmR9xo427aQHMwSjhdQKrrHwfTSsrUZmXyKh5F2anucPUAB-BAfTNyIG4brluYZwwKq3Mhx4eOJkn51V0D4zlcDVDPEK7k789ZUYVSfEcJ4_QbIY7YVeSHkqO-jQ3XB6XeK1WydnTFF8-OcOgBEzuKB6LDUNUYttH8eKE-k7OB5vi7h56PmIK35-ciOix52zaQRP0s8kcBeL5NAimjhIZjC6j5Am7CAf7VeBZUs9q9UvemlNYRZze_SWmlJIQBVx05GOShiGlHXAYZxdEWSOT9P4B09RmeqVCxakgQbgzOpS-05KVFzcIDn8HYsKOLj05uURaLfPL8AP3kKEOSE4z3iSdBxVnRuu6taQcRBU9q9UbZbdo7z0vXPbacLuxc1nPlvOkb_0fcHQxqU1Dn-i17fBUeV1Mcv_sjxhvXzvlVJq3qhJAyYD5YYDi0AgLLIU-o2XZ31DGFkvNQxioM8mPT5tOeycr-n4sfS=w518-h389-no
)

Notice that I have turned the 4 rows (aka 4 training examples) onto their sides to make 4 columns.  I don't mean to confuse you. It is very important to visualize matrices as nodes, and you will see why below.  But first, we have to do some housekeeping and create the other parts of our NN.  Next: creating our output layer, also known as our target values:

#3) Create y output: Lines 18-24
This code creates our output layer: Our training set of 4 target values. Once our NN can correctly predict these 4 target values from the inputs provided by matrix X above, it is now ready to predict in real life.  Think of X, the input layer, layer 0, as the beginning of our NN.  y, the output layer, is the end:
![alt text](https://lh3.googleusercontent.com/6J6pt5FDEceBFHqP18HZ3rdWE3oRHhOlFG2ZAeng6vROeQZuLajnYrCw2oAmPv39yYnDHi2R1K91kF1B9fyAZePNaTcZhyxOiW7U1MPeQUyBcMWH02wT4yrVyWIHn9Yz8vXlASd9G14mVgoxTvRvWtDBsQlVwf2gn48TLy3Xm7Z7JepB5X0yk_xusS5-7nwtspG8NkKQwVrJjCEzwR6wAN5KgUfFd15r1RI-6Hwb082hcipNWl5TX8pqvNdrMkrOjYHnNexc-CUGwIESRd3ramL72W4Wb1B8WciircrSFlt3bHRNO_11ZVB5W061dOhV8O9xhmPRiFQQImS99UpbK7Lwuu5FG4hLyr6rYPJsyMsMlT5yOsSjlypBTrKEJq8USJJmsPrGdbQ_hKtwuo-ywnmOTS4Bpz0eSI_JxNfzV4hFN4uo9LxcLqlSvUBH72HlDlPSsMpyMroieRJifYraW6XXpeZ__DXSbxWkm9oz4x-W1YJNQV_pljbHEEHTv4iGS22YyyLd84gPYoagm74ptwXodx2jJi9KHy2E_2HBvpVmQVgVgNFZq7-JO13PzzSOhhpT57EzBr1ToMbjRAaNl4KHCBjJY-5-l7mE1Q5-NTtMH5MV219sPi4za2Fq06hfDmJKMKDei1BgFxkczdmr92PdCm4-98z9=w518-h388-no)

Adam, am I correct that the above is only 1 node?


#4) Seed your random numbers: Lines 25-27
This step is housekeeping. We have to seed the random numbers we will generate in synapses/weights for the next step in our training process, to make debugging easier.  You don't have to understand how this codes works, you just have to include it.

#5) Create "Synapses" of your brain--Weights: Lines 29-31
These 2 matrices are the "brain" of our NN.  These layers are the part of our NN that learn by trial-and-error making predictions, then improve their next prediction, then remember their improvements--learning!

![alt text](https://lh3.googleusercontent.com/S-ihGzoU2SJQvm36_2f0ORWDmJFe-gs2tjVB-MEZgKXAfhpLzwroAg7x7w7ynCDbycJtytdC5KmRTj0yNQ0mFBYLgE26L7CTl8HeZfjXnqOQkPqyR0tWRSmKg2VxscEEwmhEAu06kmI4YddkLXVIqx0cjPcuLqQSdOfprAUbH661aB5-H5XPdkiW83RCRGUYvUYZM0rpQhJM3LuV8jAO9ZEO3il87MQEahMKOPrsNp2KWsbDX9HF3B17_cz_ZyYRa7to30FdKOaOxL1R8ccFOzGHfEtfv5B5AuBX8Nqxq6LV-j1DSU_OTosxenEY238Bpc8aUnTLniN6T65YLQ1Re17yK28z3B-Mcc4OHsRYHObk2pZU7ASedCSO8SgW4pUyT1IJZpei0wNGLYRi50kLCgDraQr07aDodr-HpkP7jRAQcEFNWnFGPNsvA3sQz4QT2wZhEx9kVJLZdVHEsglCE1_bzdzKnE3MPv0QNAz72juBmicCxcNEv_p-BAlQsvkDrES1QdIn0PTXJAo8Wd2f1RndNWqdxlA4qBU1cwhNPjouv1CFbAQnDL9IiwbeXFD5AUjL4jDDUZYt8MGwZ1Mg6YGakLOWz-RZ05uitJFtQSV-UgC4tZ6K02kbLbv-8JYDK5-JoRfap6S6JeTapIa_nf_f1IFg397-=w968-h238-no)

Notice how this code, `syn0 = 2*np.random.random((3,4)) - 1` creates a 3x4 matrix and seeds it with random values.  This will be the first layer of weights, Synapse 0, that connects l0 to l1.  You'll visualize it below in a minute.

The function np.random.random produces random numbers uniformly distributed between 0 and 1 (with a corresponding mean of 0.5).  But we want this initialization to have a mean zero.  Why?  So that the initial weight numbers in this matrix do not have an a-priori bias towards values of 1 or 0, because this would imply a confidence that we do not yet have (i.e. in the beginning, the network has no idea what is going on so it should display no confidence until we update it after each iteration).  

So, how do we convert a set of numbers with an average of 0.5 to a set with a mean of 0?  We first double all the random numbers (resulting in a distribution between 0 and 2 with mean 1) and then we subtract one (resulting in a distribution between -1 and 1 with mean 0).  That's why you see 2* at the beginning of our equation, and - 1 at the end: `2*np.random.random((3,4)) - 1`

Notice that we are generating a 3x4 matrix.  Why?  Because l0 (aka our X matrix) is a 4x3, and matrix multiplication requires the inner 2 size numbers to match, i.e., a 4x3 matrix must be multiplied by a 3x_?_ matrix--in this case, a 3x4.  See how those inner two numbers must be the same?

Then this line of code, `syn1 = 2*np.random.random((4,1)) - 1` creates a 4x1 vector and seeds it with random values.  This will be our NN's second layer of weights, Synapse 1, connecting l1 to l2.  Keep an eye on the size of each matrix we are creating, because this will become *very* important soon.



#6) For Loop: Lines 33-34
This is a for loop that will takes our NN through 60,000 iterations.  For each iteration, our network will take X, our input data, and based on that data, give its best guess at a prediction of what our y output is. It will then analyze how it did, learn from its mistakes, and give a slightly better prediction on the next iteration.  60,000 times, until it has learned by trial-and-error how to take the X input and predict accurately what the y output is.  Then our NN will be ready to take *any* input data you give it and correctly predict its future!


#7) Feed Forward Network: Lines 36-40
This is where our NN makes its first guess at a prediction. Think of l0, l1 and l2 as 3 matrices as the "neurons" that combine with the "synapses" matrices we created in #5 to think, predict, improve, remember.  This is where matrix multiplication becomes key.

First, we take the dot product of the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4:
![alt text](https://lh3.googleusercontent.com/Qbej-Y21C2su8KnB4dzus7l39oJ8c1VZ-3oLOJLqWrb6sr_pAjyq2vBPJ8Y8VG8JPaQQKOybzyTo7sqgDI6OPdd24VAR71aMXSzXRpaPpxLsYwgvEhWJxC0CwtUIlJp5Xjar-jvUMBrybLnnsbhJ9s-cf1Exci0nlUNFCiFjtH1O5SdST71fPXhZoUUOxN9-hWv6Msr5Obpg0ve5RrZc_VvSGD9I85nqxRvpaILZkQQLeggj4tlBW528lymNR9e28gwOy9fuAopRaM37Mfw0or9aIZiK4jjiSuU0FoHX757Mv0JgvmM6QgVOb2sNxpX6Lx2eASFX6nxctGwXW0hX1wnzZRRf_KttjjeOTSa6r70KZdKy3f8RH4JK3oxroGBI4RXWOaSwNHHgmd7ljoo0uvUmKy3ScE0cMummCWfTNAv8dcvvi53JqbixVglUL_DQ-v-P4ZyW5HY2xJTWZ8KgByme0FlZTQFURPkvHFw7g375kI3Ad188o5L9p30SlvyKZ_N5Dp2VXHS5cavs42-OVWY47v0fYijR8VpPBOAfWNrcVAn1jOO_YLOwv7z3VFc8RbyMSoAFve0b3m6qGNHh-oYvpVhAuelaUzRWb5x0HYSLUq-wQZRksy8klBo2LWhQxsYTFoZSKLocyNj83uBnCd5-lrAve-SU=w968-h594-no)


Notice that line 39 uses the `np.dot()` syntax because we are taking the dot product of two matrices.  Later we'll use the simpler multiplication syntax, i.e.,`(*)` when we are simply multiplying two vectors.  But you must be clear on the size of matrices you are multiplying in order to use the right syntax.

Still on line 39, we next take the Sigmoid function of l1 because l1 may have values between 1 and -1, and we need it to have values between 0 and 1, hence: `l1=nonlin(np.dot(l0,syn0))`

It is on line 39 that we see ***Big Advantage #1*** of the ***Four Big Advantages of the Sigmoid Function.***  When we pass the dot product matrix of l0 and syn0 through the `nonlin()` function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.  This means, "the closer the value is to 1, the more certainty there is that such-and-such is the case, whereas the closer the value is to 0, the more certainty that such-and-such is NOT the case.  "So, what?" you may ask.  Well, it doesn't matter in lines 39 and 40, but it matters a *ton* when we hit line 61 and beyond.  Stay tuned.

Exactly the same thing happens on line 40, as we take the dot product of 4x4 l1 and 4x1 syn1 and then run that product through the Sigmoid function to produce a 4x1 l2 with each value becoming a statistical probablility from 0-1.

Now we have completed the Feed Forward portion of our network.  This would be a good time to visualize what we have done so far, both in terms of the matrices involved, and also the layers of "neurons" and the "synapses" connecting those neurons:

![alt text](https://lh3.googleusercontent.com/Rsh6bYtBEu_oAMLVXppmxO0ayEBDJVQg8YwDZGz0_I6JKB588v5aVQbzQKrx-srfY1sGZm4V83NuSIPxaWBntZH3Hm584xbax4RR0f3L5IAOLSB3jH5ZNXJnnhl7gAQR9w2UsqOuxsdoRDXj1xrmLZkC_ZCEZrooNcHJRpEPySTsTiTb61j9AQg-uH2TLUewt02YPshI4i0f3Ynoc9IHa6IgywBH_k_Baae6FcOIuTPApdSiLDviMoZuIZLMOdGusJ1oV_E4sl7wWHcr4JQOlpz8rGArnyLkUywKh_1BJigH_z8s1EcUuA6NfUY8W_4qGJsEqLKDac2hhvvgqoUQTiLqis7PKp_8eS-MjpSv5tz17gclQAj4snNmlF_gEKLWMgRuPVju1zGmoMZeMPqMnXDE-QWVVAIKFYwtCe83KELSDKmkRCI_hySi0MYualR9tL7EPRdSjpS6VWoChupgnHodSdPKmNkx7qcGI8vE-BMWiVzxC1gxICM9DUc5h7LCAkwR2p4tN94tQB8z12AxI5-TI8l_tT1-RvgrhqO5QJv2tBzud7P6rJBUZxSVHqCt4wlP4ZbVgAviCwJDpzLzbsI3HE9666hHMNZLcUqsTd2YRwaV-p7m8DTcDmy4IzB36qNMDSNip8VG4CunowqCbQlw6kXBzVYp=w968-h698-no)





#8) By How Much Did We Miss the Target? Lines 42-45
The 4x1 y vector is our goal, our target.  Given our input X of layer 0, we want to produce an output, layer 2, that is as close to the 4 values of y as possible.  Each one of our 60,000 iterations should bring us, by trial-and-error and learning from our mistakes, closer to the 4 target values of y.  So, for each iteration, we take our best prediction so far, the 4x1 vector l2, and subtract it from the 4x1 vector y.  The remainder is l2_error, i.e., how much each value of l2 missed its target value in y.  

This is the exciting first step in the learning process of our NN.  Once we know what we missed by, in the following steps we will seek to correct that error and do better next time.

#9) Print Error: Lines 47-51
Line 50 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations.  The line, `if (j% 10000)==0:` means, "If your iterator is at a number of iterations that, when divided by 10,000, there is no remainder, then..."  ` j%10000 `would have a remainder of 0 only six times: at 0 iterations, 10,000, 20,000, and so on to 60,000.  So this print-out gives us a nice report on the progress of our NN's learning.

The code `+ str(np.mean(np.abs(l2_error))))` simplifes our print out by taking the absolute value of each of the 4 values, then averaging all 4 into one mean number and printing that.

#10) In What DIRECTION is y?
You might call Step 10, "How much do I tweak my NN before its next iteration and prediction?"  For statistics and calculus buffs, we could simply say, "In line 61 we compute how much the l2_delta needs to modify the error with weights from the derivatives to induce large changes in low confidence values and smalll changes in high confidence values."  Whew!  For the rest of us mere mortals, let me unpack that a bit:

Here is where you will see the beauty of the Sigmoid function in four *magical* steps.  To me, the genius of neural networks' ability to think is found largely in these four steps.  We saw how, in line 39, Step 1 was then the `nonlin()` transformed each value of our matrix into a statistical probability between 0 and 1.  But I have yet to mention that that statistical probability is ***also*** a simple measure of confidence--numbers approaching 1 suggest high confidence that the NN's (neural network's) prediction is correct.  Numbers approaching 0 suggest high confidence that the NN's prediction is incorrect.  But what matters here is ***confidence.***  If our NN's prediction, the l2 value, is high-confidence and high-accuracy, that's an oustanding prediction, and we want to leave the syn0 and syn1 weights that produced that oustanding prediction alone.  We don't want to mess with what's working; we want to fix what's NOT working.

That's why we focus our attention on the numbers in the middle:  all numbers approaching 0.5 in the middle are wishy-washy, and lacking confidence.  So, how can we tweak our NN to produce four l2 values that are both high-confidence and high-accuracy?  

The key lies in the values, or ***weights*** of syn0 and syn1.  As I mentioned above, syn0 and syn1 are the center, the absolute *brains* or our neural network.  We are going to take the four values of the l2_error and perform beautiful, elegant math on them to produce an l2_delta.  l2_delta means, basically, "the amount by which we need to increase-or-decrease our weight values in syn0 and syn1 in order to reduce wishy-washyness and maximize the high-confidence, high-accuracy values of the prediction of our next iteration."

***Get ready for beauty.***

Here is ***Big Advantage #3*** of the ***Four Big Advantages of the Sigmoid Function:*** Do you remember that diagram of the beautiful S-curve of the Sigmoid function that I showed you above?  Well, lo-and-behold, each of the 4 probability/confidence numbers of l2 lies somewhere on the S curve of the sigmoid graph (pictured again below, but this time with more detail).  If we search for that number (e.g. 0.9) on the Y axis of the graph below, we can see that it falls on the S curve roughly where you see the green dot: ![alt text](https://iamtrask.github.io/img/sigmoid-deriv-2.png)
(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Note that the S curve above has very shallow slope at both the upper extreme (near 1) and the lower extreme (near 0).  Does that sound familiar?  Wonder of wonders, a shallow slope on the sigmoid curve coincides with high confidence and high accuracy in our predictions!  This means that a shallow slope equals a tiny number.  

Therefore, when we go to update our synapses, we basically want to leave our high confidence weights alone because they already have good accuracy.  And, miracle-of-miracles, because our high-confidence/low-slope numbers end up being so tiny, multiplying the values of syn0 and syn1 by these teeny-tiny numbers has exactly the effect we want: our confident, accurate, high-performing values are largely left alone, and unchanged;

The great news is that our wishy-washy indecisive weights, those in the middle of the S-curve, are the numbers that leave the biggest "footprint" on our S-curve.  What I mean is, the values around 0.5 can be traced on the Y axis of our graph below to the middle of the S-curve, where the slope is steepest, and therefore the value of that slope is a big number.  

When we update our synapse matrix by multiplying the corresponding element with that large slope number, it's going to give that element a big nudge in the right direction towards confident and accurate prediction.  When I say, "in the right direction," what I mean is that some values of our l2_delta are going to be negative, because we want them to reduce the weight values closer to 0 in syn0 and syn1.  Other values of our l2_delta are going to be positive, because we want them to increase the weight values closer to 1 in syn0 and syn1.  

So it's important to notice that there is a sense of "direction" involved here.  Have you heard of gradient descent described as, "a ball dropped in a bowl and rolling back-and-forth until it comes to a rest at the global minimum, the bottom of the bowl?"  That's what the Sigmoid does for us.  It helps us to find the bottom of the bowl, the minimum of the cost function, the lowest error in our predictions.  Think of our 60,000 iterations as the ball rolling back-and-forth in the bowl until it no longer needs to change direction because it has come to rest at the ideal, perfect bottom of that bowl.  Picture it like this:

![alt text](https://mail.google.com/mail/u/0?ui=2&ik=e3f869f938&attid=0.2&permmsgid=msg-a:r4352876950048414936&th=1691255aa52a4d54&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ8FdFORGv3w0jn-Bs8GhlKpg2D1XPRzSF6OaNCqE8hchNYMIAymIg-nK1xCdIsQup54rJmkW2l0qttCzg03Hq8PJOv4KX0ae14e2dkswvLMt74Rzdhwt2ZJQBQ&disp=emb&realattid=ii_jsexnu8o2)

START HERE with text explaining the above and transitioning into the one below. Give credits to Grant Sanderson for both!


![alt text](https://lh3.googleusercontent.com/jIup60T65tIKtXg0B-Np6jeNXk4TvQTRgBI1btNRZUZ4yy_ZEyL1bN3RwiSjzKNcbyXQN6z7vdV55NzGFxJfUpZXkyU6HTmrScht0rbk5BXGC6eO79LrZuuVpJdHE4fr4QYwvdbO)





Take your time with the above points and make sure you understand them.  Do you see why the sigmoid function is a thing of beauty?  It takes any random number in one of our matrices and turns it into a statistical probability, which turns into a confidence level, which turns into a big-or-small tweak of our synapses, always in the direction of greater confidence and accuracy.  The sigmoid function is the miracle by which mere numbers in a matrix can "learn." A single number, along with its many colleagues in a matrix, can indicate probability and confidence, and that matrix can learn and remember that learning, over-and-over again as it improves with each iteration!  


In Step 8, when we subtracted 
In what DIRECTION is y, our desired target value, from our NN's latest guess? We
  #take the slope of our latest guess, multiply it by how much that latest guess
  #missed our target of y, and the resulting l2_delta tells us by what value to update
  #each weight in our syn1 synapses so that our next prediction will be even better.
  
  #AK: The term "direction" seems misleading to me.  
  l2_delta = l2_error*nonlin(l2,deriv=True)
b) Each of these statistical probabilities is also a simple measure of confidence

c) Each of these confidence numbers 


(Adam, as you can see, I have several chunks of code to explain before I get to the code where we're finding l2_delta and updating the synapses and such, but I'm very proud of my explanation of the Sigmoid function and why it matters so much.  Can you please edit any mistakes or ambiguities?  If Colab will allow you to SUGGEST EDITS, please do that.  If not, then please edit in ALL CAPS and if something needs deleting, just write DELETE THIS UP TO... in all caps. Thanks, Dave)




In [0]:

*teach my Sigmund Freud ditty
*diagram showing matrices, etc.
*the best teacher  etc...
import numpy as np
#create sigmoid function to change numbers to probabilities
def nonlin(x,deriv=False):
  if(deriv==True):
    return x*(1-x)
  
  return 1/(1+np.exp(-x))
# neural layers input: each row is a training example, each column is a layer of nodes
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
# output layer: 4 target values (training examples). Final Layer of the Network,
# which is our hypothesis, and should approximate the correct answer as we train
y = np.array([[0],
             [1],
             [1],
             [0]])
# seed your random numbers to make debugging easier
np.random.seed(1)

# randomly initialize your weights with mean of 0
syn0 = 2*np.random.random((3,4)) - 1 # First layer of weights, Synapse 0, connecting l0 to l1.
syn1 = 2*np.random.random((4,1)) - 1 # Second layer of weights, Synapse 1 connecting l1 to l2.

for j in range(60000):
  
  # Feed forward through layers 0, 1, and 2
  l0=X
  l1=nonlin(np.dot(l0,syn0))
  l2=nonlin(np.dot(l1,syn1))
  
  #Purpose of the line below is to determine by how much did the neural 
  #network miss in its prediction of the target value? (see notebook drawing) l2 
  #is a 4x1 vector containing the NN's best guess of what the value is in the 
  #vector y.  y is a 4x1 vector of test data. We do element-wise subtraction
  #of l2 from y.
  l2_error = y - l2
  
  if (j% 10000)==0:
    print("Error: "+str(np.mean(np.abs(l2_error))))
    
  # Purpose: to obtain a 4x1 vector, l2_delta, containing the pos or neg values 
  #by which each element in syn1 (a 4x1) should be adjusted.  Each of these 4 
  #"tweaks" improves the ability of syn1 to bring l2 closer ***to a local (or global?) minima.***Adam, is this correct?  Local, or global, or NEITHER?
  
  #First we take the derivative of all 4 elements of the 4x1 l2.  To take the deriv here
  #simply means to determine the slope of the tangent to the point where each value lies
  #on the S curve of the Sigmoid function.
  
  #YES to this Question: How does one number give us both rise and run to determine WHERE the 
  #number falls on the S curve? See Trask: does each l2 value represent the y axis,
  #such that the 3 Trask dots on the S curve should be: (0.27, 0.5, and 0.9)?
  
  #Each derivative represents a confidence value, i.e., the values closer to 0 or to 1 mean 
  #the NN is very confident of its prediction.  
  #
  #AK: The above sentene is unclear.  Both the value of the sigmoid and its derivative are related to confidence, but the relationships are
  #different.  For
  #
  #All values somewhere in the middle, around  
  #.5 mean the NN is very wishy-washy, and not confident about its prediction.  As you see in the diagram of the S-curve, middling
  #numbers have the steepest slope, i.e. a bigger value, which means nonlin(l1,True) will yield a relatively large value for that element in the vector l2_delta.
  # For our purposes, we don't care about the highly confident numbers near 0 or 1.  Their slope is so shallow that their derivative will become a tiny weight in 12_delta, which will 
  #barely tweak syn1 when we update it below.
  
  #Question: for, say, 0.5, what numbers determine rise/run? What IS the slope of 0.5?
  #Answer: Don't worry about this.  Let the computer use np.deriv to determine the slope and the derivative for you.
  #The 4x1 vector of nonlin(l1,True) is multiplied element-wise by the 4x1 
  #l2_error value.  The "least confident" values of nonlin(l1,True) are the
  #biggest values, so they yield a relatively big value among the l2_delta elements
  #that will eventually be multiplied by the syn1 elements when we update the synapses.
  
  l2_delta = l2_error*nonlin(l2,deriv=True)
  # l2_delta is the error of the network scaled INVERSELY by the confidence. 
  # It's almost identical to the l2_error except that very confident errors are muted.
  
  # Purpose of the line below is back propogation: to determine: By how much did each l1 weight 
  #contribute to the l2 error (according to the weights)?
  l1_error = l2_error.dot(syn1.T)
  #Question: dot prod is rowXcol, so why is syn1 transposed?
  #Question: I would have thought that l2_error DIVIDED by syn1 = l1_error...
  #So: by multiplying l2_error by the weights in syn1, we can calculate the error in the middle/hidden layer?
  
  #Purpose of line below: uses the "confidence weighted error" from l2 to establish an 
   # error for l1. To do this, it simply sends the error across the weights 
   #from l2 to l1. This gives what you could call a "contribution weighted error" 
   #because we learn how much each node value in l1 "contributed" to the error in l2. 
   #This step is called "backpropagation" and is the namesake of the algorithm. 
   #We then update syn0 using the same steps we did in the 2 layer implementation.
    
     # Weighting l2_delta by the weights in syn1, we can calculate the error in the middle/hidden layer.
  
  #Line 43: uses the "confidence weighted error" from l2 to establish an 
   # error for l1. To do this, it simply sends the error across the weights 
   #from l2 to l1. This gives what you could call a "contribution weighted error" 
   #because we learn how much each node value in l1 "contributed" to the error in l2. 
   #This step is called "backpropagating" and is the namesake of the algorithm. 
   #We then update syn0 using the same steps we did in the 2 layer implementation.
    

  # In what direction is the target l1?
  # Were we really sure?  If so, don't change it too much
  l1_delta = l1_error * nonlin(l1,deriv=True)
  # This is the l1 error of the network scaled by the confidence. Again, it's 
  # almost identical to the l1_error except that confident errors are muted.
  
  syn1 += l1.T.dot(l2_delta)
  syn0 += l0.T.dot(l1_delta)
print("Error after training:")
print(l2)

Sigmund Freud ditty and memorizing code

1) that converted number between 0 and 1 becomes a statistical probability and;
2) it also becomes a confidence measure, and that's where Part 2 comes in.  Another part of this sigmoid function, i.e.,
'  if(deriv==True):
    return x*(1-x)'
takes the confidence measure from Part 1) and converts it into **big advantage #3**, which is;
3) a slope, and then;
4) a weight with which to tweak the matrices of our NN and nudge them towards greater accuracy in prediction.  


In [0]:
import numpy as np
def nonlin(x,deriv=False):
  if(deriv==True):
    return x*(1-x)
  
  return 1/(1+np.exp(-x))
  X = np.array([[-0.1,4.5,2.7,1.1],
              [0.1,7,6.9,7],
              [9.4,6,6.2,5.8],
              [9.6,8.5,10.4,8.1]])
  l1= np.random.random((4,4))
  
print(l1)

NameError: ignored