<a href="https://colab.research.google.com/github/davidAcode/davidAcode.github.io/blob/master/Dave_teaches_how_to_create_NN_on_Github_02142019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Building a Neural Network: A Simple, Complete Explanation That Skips No Steps
Do you ever feel overwhelmed when you study A.I.?  Discouraged?  Incapable?  Filled with self-doubt?
Do you ever feel like everybody else must get these A.I. concepts easily, but you're too dumb or poorly educated to ever succeed?
Does it seem like other people have a gift for this learning, but you don't, so why even bother?
I have felt all of the above.  Chronically.  For 18 months.  That's how long I have been teaching myself linear algebra and neural networks.  I am writing to you now for two reasons:  1) I don't *ever* want you to go through the hell I went through in learning this stuff; and 2) Teaching this material to you forces me to understand it well.  

The best teacher is the person who just learned yesterday the stuff you are studying today, because she remembers what she struggled with and how she overcame it, and she can pass those shortcuts on to you.

The best way to master Deep Learning is to master the fundamentals, build a neural network (NN), and then memorize the code, because then you can apply these fundamentals to any NN you meet in future academic papers or work projects.  It all starts here.  Today we will build a NN that can learn by trial-and-error: make a prediction, learn from its mistakes, and do better next time.  Our NN will solve a simple binary classifaction problem, i.e., "given these 1's and 0's in the matrix X as inputs, what is your prediction that the 1's and 0's in matrix y will be?"

Thanks to super-teachers Andrew Trask and Siraj Raval, I mastered today's material--a working neural network in about 15 lines of Python--and it is the foundation of what you need to progress towards an understanding of the cutting edge techniques being used in deep learning today.  My mentor Adam Koenig, a Stanford PhD in aerospace engineering, helped me learn this stuff, and now I'm going to stand on these teachers' shoulders and help you learn it too.

Here is the Big Picture: a diagram of the exact 3 layer neural network we will build today (For now, just focus on the bottom labels: "Input Layer, Synapses, etc."  We'll get to the labels at the top later):

  ![alt text](https://lh3.googleusercontent.com/Xo38Db_mZRpNGXbtVCF1qtJEottBhoN2TzvQwJbGUWPuXjoi3SYAVQAvW7xCVaRguJapbcwJvGQ-OvZTxsoTz9mIaGDgTlYhAE_c7WbUGRLTeKxrxAEtfSwZ0M4lbzap6ViWyTCQCWCJET5Vcl0bjWBdr6WSq1k_I_aqTl2vwN3vCpif5Y-Ew8DXltcYZYR6cx4eoOW9qHLi88E9S5KpNxQeFYqXnyh5XQVDKFj-NV4Lr55cGaZnC-BcBvYWI8hdA0UNz1rv_f0FBWBgiDkIgqmPg1O54dVxt_KP25qs9eFnWTHC7QaJEbGf5ekVMGtf3sqj_imlvO7inyOFOnP2Zqh8bbJpGLT8VO9kdvxBPGk8fS8KF5pa_TGUtR_XeD1z1Sq9u8mu4vd_Vvs3YNsNQVVaBj_lP0PqZ2xQ7usWSjoxmpBnNtSeTS_JMMeUSFPPQGpNtq6uxZp-ak8JE6ivpPGZhL7QKZOeQcp0lBL_BylUumqoa718v8Qm2jCCFOawLb4k6-OtwDv8OLFw4b3vzDvoviBd3_1QDKADFaBJcdgXeqVk3TFqHRmLBcnOaP9mmTO5qKZa_Cu5FwNzi-5HwcKa50BX8xQrWpzt-8pibDLOoY54wlETSkE1LCFBcyd6oIiJhZ6FkB0rcWa8NzTv-sLM2NY4GMFx=w970-h588-no
  )
Let me help you get your bearings in this diagram.  This is a 3-layer, feed-forward neural network.  The input layer is on the left: 3 neurons (sometimes known as nodes or features).  This layer is connected to the hidden layer (layer 1) by 12 synapses.  Layer 1 is then connected to the output layer, layer 2, by 4 synapses.  

What does this diagram mean?  The network is saying to you, "Based on the information you input into my left side, I will successfully predict the correct answer on my right side, the output.  I will do this in my hidden middle layer through trial-and-error: I will guess at the correct answer, compare my guess to the actual correct answer, then learn from my mistakes and improve, guessing better next time.  I will do this over-and-over 60,000 times, until I can predict near-perfectly what the outcome is of the information you input into me."

Now, let's get an overview of our code.  I suggest you open this blog post in two side-by-side windows and show the code in the left window while you scroll through my explanation of it in your right window.  First, I'll show you the entire code we'll be studying today, and underneath that is my detailed step-by-step explanation of what it does.  As you see from the comments below (the lines beginning with a #), I have broken this process of building a NN down into 13 steps.  Get ready for the wonder of watching a computer learn from its mistakes and recognize patterns!  We're about to give birth to our own little baby brain... :-)

I am grateful for Andrew Trask's [blog post](http://iamtrask.github.io/2015/07/12/basic-python-network/) from which the code below is taken (though the comments are mine). Display this in your left window:

DCQ: In lines 14 and 21, why do the X and y arrays need that second set of brackets?

**AK: The inner brackets specify bounds on the rows of a matrix.  The outer brackets specify the ends of the matrix.**

DCQ: In Lines 39, 57, 64 and 71, why does multiplication use "np.dot" in some places and just ".dot" or "*" in others?

**AK: Syntactically, I'm pretty sure * gives elementwise multiplication and .dot gives normalmatrix multiplication.  These are selected based on the needs of the given operation.  Also, I'm pretty sure l1.dot(l2_delta) and np.dot(l1,l2_delta) will produce the same result.**



In [0]:
#This is the "3 Layer Network" near the bottom of: 
#http://iamtrask.github.io/2015/07/12/basic-python-network/

#First housekeeping: import numpy, a powerful library of math tools.
import numpy as np
#1 Sigmoid Function: changes numbers to probabilities and finds slope to use in gradient descent
def nonlin(x,deriv=False):
  if(deriv==True):
    return x*(1-x)
  
  return 1/(1+np.exp(-x))
#2 X Matrix: This is a set of inputs in the training set that we will use to 
#train our network.
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
#3 y Vector: Our training set of 4 target values. Once our NN
#can correctly predict these 4 target values from the inputs provided by X, 
#it is now ready to predict in real life.
y = np.array([[0],
             [1],
             [1],
             [0]])
#4 SEED: This is housekeeping. One has to seed the random numbers we will generate
#in the training process, to make debugging easier.
np.random.seed(1)

#5 SYNAPSES: aka "Weights." These 2 matrices are the "brain." It learns, remembers, improves.
syn0 = 2*np.random.random((3,4)) - 1 # 1st layer of weights, Synapse 0, connects l0 to l1.
syn1 = 2*np.random.random((4,1)) - 1 # 2nd layer of weights, Synapse 1 connects l1 to l2.

#6 FOR LOOP: this iterator takes our NN through 60,000 guesses, tweaks, and improvements.
for j in range(60000):
  
  #7 FEED FORWARD NETWORK: Think of l0, l1 and l2 as 3 matrix layers of "neurons" 
  #that combine with the "synapses" matrices in #5 to predict, improve, remember.
  l0=X
  l1=nonlin(np.dot(l0,syn0))
  l2=nonlin(np.dot(l1,syn1))
  
  #8 TARGET, and how much we missed it by. y is a 4x1 vector containing our 4 target 
  #values. When we subtract the l2 vector (our first 4 guesses) from y, our target,
  #we get l2_error: how much our neural network missed the target by on this iteration.
  l2_error = y - l2
  
  #9 PRINT ERROR: in 60,000 iterations, j divided by 10,000 leaves a remainder of 0
  #only 6 times. We're going to check our data every 10,000 iterations to see if
  #the l2_error is reducing, and we're missing our target by less each time.
  if (j% 10000)==0:
    print("Avg l2_error after 10,000 more iterations: "+str(np.mean(np.abs(l2_error))))

  #10 In what DIRECTION is y, our desired target value, from our NN's latest guess? We
  #take the slope of our latest guess, multiply it by how much that latest guess
  #missed our target of y.  In line 75 we then multiply the resulting l2_delta by l1 to update
  #each weight in our syn1 synapses so that our next prediction will be even better.
  l2_delta = l2_error*nonlin(l2,deriv=True)
  
  #11 BACK PROPAGATION: After we "fed forward" our input in Step 7, now we work backwards
  #to find the l1 error. l1 error is the difference between the ideal l1 that would 
  #provide the ideal l2 we want and the most recent computed l1.  To find l1_error, 
  #we have to multiply l2_delta (i.e., what we want our l2 to be in the next iteration)
  #by our last guess at the optimal weights (syn1). We'll then use l1_error to update syn0.
  l1_error = l2_delta.dot(syn1.T)

  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Pos. or neg.? Similar to #10 above, we want to tweak this 
  #middle layer so it sends a better prediction to l2, so l2 will better predict target y.
  #In other words, add weights that will produce large changes in low confidence values and 
  #small changes in high confidence values
  l1_delta = l1_error * nonlin(l1,deriv=True)
  
  #13 UPDATE SYNAPSES: aka Gradient Descent. This step is where the synapses, the true
  #"brain" of our network, learn from their mistakes, remember, and improve--learning!
  syn1 += l1.T.dot(l2_delta)
  syn0 += l0.T.dot(l1_delta)

#Print results!
print("Our y-l2 error value after all 60,000 iterations of training: ")
print(l2)

Avg l2_error after 10,000 more iterations: 0.4964100319027255
Avg l2_error after 10,000 more iterations: 0.008584525653247157
Avg l2_error after 10,000 more iterations: 0.0057894598625078085
Avg l2_error after 10,000 more iterations: 0.004629176776769985
Avg l2_error after 10,000 more iterations: 0.0039587652802736475
Avg l2_error after 10,000 more iterations: 0.003510122567861678
Our y-l2 error value after all 60,000 iterations of training: 
[[0.00260572]
 [0.99672209]
 [0.99701711]
 [0.00386759]]


Now let's go through each of the 13 steps of the code in detail:

#1) The Sigmoid Function: lines 6-11:

"nonlin()" is a type of standard logistic function known as a Sigmoid function.  Logistic functions are very commonly used in science, statistics, and probability.  This Sigmoid function is written in a more complicated way than necessary here because it serves two functions:

1) to take each of the matrices within its parentheses and convert each value to a number between 0 and 1 (aka a statistical probability).  This is done by line 11: `return 1/(1+np.exp(-x))` 
We will see below that this is very important, because this conversion to a 0-1 number gives us **FOUR** very **big advantages**.  I will discuss these four in detail below, but for now, just know that the sigmoid function converts every number in every matrix within its parentheses into a number between 0 and 1 that falls somewhere on the S-curve illustrated here:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

So, Part 1 of the Sigmoid function has converted each value in the matrix into a statistical probability, which is also known as a confidence measure.  In other words, the number answers the question, "how confident are we that this number correctly predicts an outcome?"  You may wonder, So what?  Well, our goal is a neural network that confidently makes accurate predictions.  The fastest way to achieve that goal is to fix the non-confident, wishy-washy, low-accuracy predictions, while leaving the good predictions alone.  Remember this concept of wishy-washy, non-confident numbers.  It will be important below.

Now for Part 2.  The second part of this sigmoid function is in lines 8 and 9:
'  if(deriv==True):
    return x*(1-x)'
When called to do so by `deriv=True` in the code below, line 9 takes the confidence measure from Part 1 and converts it into a slope at a particular point on the Sigmoid S curve, which will be used to tweak the synapse matrices of our NN and nudge them towards greater accuracy in prediction.  

So the sigmoid function plays a super-important role in making our NN learn, but don't worry if you don't understand it all yet.  I'll explain it in detail below in Step 10.  Let's move on to Step 2:


#2) Creating X input: Lines 12-17
Lines 12-17, step 2, create a 4x3 matrix of input values that we will use to train our network.  X will become layer 0, or l0 of our network, so this is the beginning of the "toy brain" we are creating!  
```
Line 14 creates the X input (which becomes l0, layer 0, in line 38)
X: 
[0 0 1]
[0 1 1]
[1 0 1]
[1 1 1]

```
Think of each row of this matrix as a training example we'll feed into our network, and each column is one node of our input.  So our Matrix X can be visualized as the 4x3 matrix that is l0 in the diagram below:

![alt text](https://lh3.googleusercontent.com/Xo38Db_mZRpNGXbtVCF1qtJEottBhoN2TzvQwJbGUWPuXjoi3SYAVQAvW7xCVaRguJapbcwJvGQ-OvZTxsoTz9mIaGDgTlYhAE_c7WbUGRLTeKxrxAEtfSwZ0M4lbzap6ViWyTCQCWCJET5Vcl0bjWBdr6WSq1k_I_aqTl2vwN3vCpif5Y-Ew8DXltcYZYR6cx4eoOW9qHLi88E9S5KpNxQeFYqXnyh5XQVDKFj-NV4Lr55cGaZnC-BcBvYWI8hdA0UNz1rv_f0FBWBgiDkIgqmPg1O54dVxt_KP25qs9eFnWTHC7QaJEbGf5ekVMGtf3sqj_imlvO7inyOFOnP2Zqh8bbJpGLT8VO9kdvxBPGk8fS8KF5pa_TGUtR_XeD1z1Sq9u8mu4vd_Vvs3YNsNQVVaBj_lP0PqZ2xQ7usWSjoxmpBnNtSeTS_JMMeUSFPPQGpNtq6uxZp-ak8JE6ivpPGZhL7QKZOeQcp0lBL_BylUumqoa718v8Qm2jCCFOawLb4k6-OtwDv8OLFw4b3vzDvoviBd3_1QDKADFaBJcdgXeqVk3TFqHRmLBcnOaP9mmTO5qKZa_Cu5FwNzi-5HwcKa50BX8xQrWpzt-8pibDLOoY54wlETSkE1LCFBcyd6oIiJhZ6FkB0rcWa8NzTv-sLM2NY4GMFx=w970-h588-no
)
You may wonder, "How does Matrix X become layer 0 in the diagram above?"  We'll get to that soon.  Next, let's create our list of the four correct answers we want our NN to be able to predict.

#3) Create y output: Lines 18-24
This code creates "the truth," or "the future."  Here's what I mean: think of our neural network as a psychic that learns by trial-and-error.  Our psychic can "read our palm," the input layer X above, and from that palm reading she can "predict the future."  Her prediction will be layer 2 (l2, a vector), but the actual, true future is the y vector (a vector is a one-column matrix or array) we are creating here.  For each iteration, our psychic neural network will take input X, make a prediction, l2, and compare it to y, the truth, to see how she did.  She'll learn from her mistakes and do better next time.  60,000 times!  If the network is properly trained, the predicted l2 will approach closer-and-closer to the true future, y, with each iteration.  

To use another metaphor, I also like to think of y as our "target" values, and I picture an archery target.  Once our NN can correctly predict these 4 target values from the inputs provided by matrix X above, it is now ready to predict in real life.  Think of X above, the input layer, layer 0, as the beginning of our NN.  And y is our truth.  Our truth looks like this:



```
Line 21 creates the y vector, a set of target values we strive to predict.
y: 
[0]
[1]
[1]
[0]

```


#4) Seed your random numbers: Lines 25-27
This step is housekeeping. We have to seed the random numbers we will generate in synapses/weights for the next step in our training process, to make debugging easier.  You don't have to understand how this codes works, you just have to include it.

#5) Create "Synapses" of your brain--Weights: Lines 29-31
These 2 matrices are the "brain" of our NN.  These layers are the part of our NN that learn by trial-and-error making predictions, then improve their next prediction, then remember their improvements--learning!

Notice how this code, `syn0 = 2*np.random.random((3,4)) - 1` creates a 3x4 matrix and seeds it with random values.  This will be the first layer of synapses, or weights, Synapse 0, that connects l0 to l1.  It looks like this:


```
Line 30: syn0 = 2*np.random.random((3,4)) - 1: creates synapse 0
syn0: 
[ 5.67534974  5.1809666  -6.96032933 -4.91055814]
[-3.94870047 -6.6558582  -7.25683472 -4.61369466]
[ 1.77928043 -2.53186624  2.87700966  7.11388595]
```
Here's where it fits in our diagram below as "syn0 (3x4)":

![alt text](https://lh3.googleusercontent.com/Xo38Db_mZRpNGXbtVCF1qtJEottBhoN2TzvQwJbGUWPuXjoi3SYAVQAvW7xCVaRguJapbcwJvGQ-OvZTxsoTz9mIaGDgTlYhAE_c7WbUGRLTeKxrxAEtfSwZ0M4lbzap6ViWyTCQCWCJET5Vcl0bjWBdr6WSq1k_I_aqTl2vwN3vCpif5Y-Ew8DXltcYZYR6cx4eoOW9qHLi88E9S5KpNxQeFYqXnyh5XQVDKFj-NV4Lr55cGaZnC-BcBvYWI8hdA0UNz1rv_f0FBWBgiDkIgqmPg1O54dVxt_KP25qs9eFnWTHC7QaJEbGf5ekVMGtf3sqj_imlvO7inyOFOnP2Zqh8bbJpGLT8VO9kdvxBPGk8fS8KF5pa_TGUtR_XeD1z1Sq9u8mu4vd_Vvs3YNsNQVVaBj_lP0PqZ2xQ7usWSjoxmpBnNtSeTS_JMMeUSFPPQGpNtq6uxZp-ak8JE6ivpPGZhL7QKZOeQcp0lBL_BylUumqoa718v8Qm2jCCFOawLb4k6-OtwDv8OLFw4b3vzDvoviBd3_1QDKADFaBJcdgXeqVk3TFqHRmLBcnOaP9mmTO5qKZa_Cu5FwNzi-5HwcKa50BX8xQrWpzt-8pibDLOoY54wlETSkE1LCFBcyd6oIiJhZ6FkB0rcWa8NzTv-sLM2NY4GMFx=w970-h588-no)

The function np.random.random produces random numbers uniformly distributed between 0 and 1 (with a corresponding mean of 0.5).  But we want this initialization to have a mean zero.  Why?  So that the initial weight numbers in this matrix do not have an a-priori bias towards values of 1 or 0, because this would imply a confidence that we do not yet have (i.e. in the beginning, the network has no idea what is going on so it should display no confidence until we update it after each iteration).  

So, how do we convert a set of numbers with an average of 0.5 to a set with a mean of 0?  We first double all the random numbers (resulting in a distribution between 0 and 2 with mean 1) and then we subtract one (resulting in a distribution between -1 and 1 with mean 0).  That's why you see `2*` at the beginning of our equation, and - 1 at the end: `2*np.random.random((3,4)) - 1`

Notice that we are generating a 3x4 matrix.  Why?  Because l0 (aka our X matrix) is a 4x3, and matrix multiplication requires the inner 2 size numbers to match, i.e., a 4x3 matrix must be multiplied by a 3x_?_ matrix--in this case, a 3x4.  See how those inner two numbers must be the same?

Then this line of code, `syn1 = 2*np.random.random((4,1)) - 1` creates a 4x1 vector and seeds it with random values (depicted with 4 question marks in the diagram).  This will be our NN's second layer of weights, Synapse 1, connecting l1 to l2.  Meet syn1:


```
Line 31: syn1 = 2*np.random.random((4,1)) - 1: creating synapse 1
syn1:  
[ -9.39072641]
[  9.43509921]
[-12.43520534]
[ 10.32941201]

```
Keep an eye on the size of each matrix we are creating (i.e., 4x3, 3x4, 4x4, etc.), because this will become *very* important soon.



#6) For Loop: Lines 33-34
This is a for loop that will takes our NN through 60,000 iterations.  For each iteration, our network will take X, our input data, and based on that data, give its best guess at a prediction of what our y output is. It will then analyze how it did, learn from its mistakes, and give a slightly better prediction on the next iteration.  60,000 times, until it has learned by trial-and-error how to take the X input and predict accurately what the y output is.  Then our NN will be ready to take *any* input data you give it and correctly predict its future!


#7) Feed Forward Network: Lines 36-40
This is where our NN makes its first guess at a prediction. Think of l0, l1 and l2 as 3 matrices that are the "neurons" that combine with the "synapses" matrices we created in #5 to think, predict, improve, remember.  

This is an exciting part of our deep learning process, so I'm going to teach this same deep learning process from three perspectives: 
1) First, I will tell you a spellbinding fairy tale of feed forward;  2) Second, I will draw stunningly beautiful pictures of feed forward; and 3) I will open up the hood and show you the matrix multiplication that is the engine of feed forward.

I'm Irish.  Who doesn't love a good story?  My mentor Adam Koenig suggested the following analogy, which I have ridiculously exaggerated into a fairy tale, because **I am an *artiste*:**

##*The Princess and The Castle*, Chapter 1 of 4: The Feed Forward Network

Imagine yourself as a neural network.  You happen to be a neural network with a valid driver's license, and you're the type of neural network that enjoys driving cars and loves romance.  You eagerly wish to meet The Love Of Your Life.  Well, Miracle of Miracles, you have just found out that if you drive to a certain castle, your Prince/Princess Charming is waiting to meet you for the first time, sweep you off your feet, and live happily every after.  Joy!

Needless to say, you're fairly motivated to find the princess's castle.  After all, the princess is our y vector: she is The Truth, the future we're trying to predict.  Unfortunately, finding her castle is going to require some patience and persistence, because you have already attempted to drive to the castle thousands of times and you keep getting lost.  

But there is fabulous news:  you know that every day, with every driving trip, you're getting closer-and-closer to the Princess (who is of course The Truth, the y we strive to predict).  The bad news is, alas, each time that you don't arrive at her castle, POOF! you wake up the next morning back at your house (Matrix X, Layer 0) and have to start again from there.  It looks a bit like this:

![alt text](https://lh3.googleusercontent.com/NvKKJXSOSSYZSn0N7swTG5fIigN519BM19xQtFzmnHVPdiaF-TDutS3oRfwTtA9165otPBJplP9nnvbS2g7Ah1FDmJtOAWr6Nk_z3lM_CPkTJoMjs01VN7kGU_B3SaLNQqeo0Ka5r9Jm3B8SKsyzA0UOmWRd1k7MwOUvuIC06Mm-l5b-YiRmypMVpL18X6Y2g-MEhm5ciw_QUHYrYShY0hhszJHkx3UHIIsRlTAB8WnLZG8PNvVOmB_qxxjadYHzufsJU-S62w36nnDA1fey_vHionTEx8v8eTy_qo98rfw24yY6Wlk4DjKSyBYHAcBdz7RJgtTWNp8uiQXDPnMZcLePtdIDQwcg1KJlHlgU9baiNDYH4bAlqETs9IeEpIYUzI3tL2EKXtzcbzXrARi3Lbkb7lEDqbYlk66jWdWjNoL0JCz3RCkyo5KhDzLA-tA2wIsjHZjxOhnfn0kck7hUMQIUIPKl7oBeSW6kFEzcpCcC3d9yEa3TXAw1Rh5bSzyv-V8sEqVUYlh1Con0JP0OX7QypP8-fjZ7-HhmLQzCqXk4uMkAIeGnbsw2JnhILI45NJrEvgSwmlqWfTfukdv1604XWdUMrvTKPTlh6DnqouLwuTrIW2pKmwFdlQBq-hB6YoU2pX0B1fMomw2l7A6WaDvbmSz9G6Bo=w970-h768-no)

Fortunately, this story has a happy ending, but you will have to keep attempting to arrive at her house and correcting your route for, say, another 58,000 trials (and errors!) before you fall into her arms.  Don't worry, it'll be worth it.

Let's take a look at one of those drives (an iteration) from your house, X to her castle, y:  each trip you make is the feed forward pass of lines 36-40, and each day you arrive at a new place, only to discover it is NOT the castle.  Drat!  Of course, you want to figure out how to arrive a little closer to your beloved on the next drive.  I'll explain that in Steps 8 and 10 below.  Stay tuned for *The Princess and the Castle, Chapter 2: Learning from Your Errors.* 

Above is an analogy for the feed forward process.  Below is a view under-the-hood of the math that makes it happen.  

We're going to walk through one example of one weight only, out of the 16.  The weight we will study is the tippy-top line of the 12 lines of syn0, between the top neurons of l0 and l1, and we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0).  Here's what it looks like:
![alt text](https://lh3.googleusercontent.com/CDts4VOB4z3Quo58n7Sz9MvVLFZk-UHniYTX4r4CIOP0e0xRjoy9Un67N73KX492-gq91XB5KSMjDY3WBKPNZ5yYrLA6NURJ_crOYKpcQUILfQNavN8_WQDpBWxkkJ3g5mWNDhstKQ1IiJ6Rk_6SifQFk-qiRKH34usHsWj8-HXuBDEiwKcz8Aq-s0Bf3JclMJZhqbpdHllSi-ZH2qFc9UZx_1NVHSw02h33j3Pncnl22AIvw-84YGRk8ALriyYXHLYEiqK_Vuk19bjEQxFSBvOyOeUshMmlStpiFVR452QsWaKahKYF7xemmqT_Ret7ck0uOCFfcN0VLN9UxoJwT1k44C1ugs9HJ5xbhcMXlZcXjA8d8M2L3DVpJqxdLMhEsxHqDv34e3SSb9MUMGyMvrwRQgAAbNPILRBlaJQw_eT_Gr6BBcCyoqWbow9LNorEyv0gHYo-qj3q4naltOM_06lsQ-Cb5369RXff_TT-QjSg59oL45ss-mrD2FtG92n_3Hi9T_XXRAzLWMxeHydPEcPAQQKeudWQ_9RxwhW_tBATGbRpNDA3ELb0gUw3fCy3v7YWLZs26IITnkWbwLHAlN7hYKiQZUaZwgzxMgDX9r2LcV_1MYeqcF4Pspx4NEhXjExK3QzCWx1ZZ9ZLQwuRxOrz-WWAp7MS=w973-h722-no)

Why are the circles representing the neurons of l2 and l1 divided down the middle?  The left-hand (depicted in the variables I use with an "LH") is the value used as input for the sigmoid function and right hand side is the output of the sigmoid function: l1 or l2.  In this context, recall that the sigmoid is simply taking the product of the previous layer times the previous synapse and "squishing" it down to a value between 0 and 1.

OK, here is Feed Forward using one of our training examples, row 3 of Matrix X, aka l0: [1,0,1].  Here's what it looks like:

![alt text](https://lh3.googleusercontent.com/_1cLOu2Rxc7xchgANum00LLlcGEectc2ffdXbpB4VEZX2cT_8czgF8PebXO7R_9WNj3TBDB6AearSjfszEednS_9GvXQ1RCmfPG9cdOQFbkDsefjx2MPgrgCfuzBPLLcbPEv9ZXl1fjv9_MBGzY5KOtlo0mW2iy1xNcQTkcWmUUyiN5MAKhRVeolOEQ8s-Ct7J0Kgd2YYspn4u5D_EyFuLdlfNTCGPyEdm8YQP5FDPxNNwtzm9Yv_LIFIlpFtF5aOZgzTWoqsG08Tr6MDoXNpLVT6Yk8xKEpLJsZiejHA4e9dzHExEJKPWO2dtUnzDSBBlMs-Vvfw2ECHi6Q5M3WLkuA2aAM72e2pKDgtZnM1b85Zktx54jF-tktJAJg6LAB6YJOsUBQI67tW5O3tt-C3Gu9Td5K0bwtChbvXfFeEMIU5SLp1yKGq-bybnVsUWZ0JegxCZeFjZefvN-2qCcYjJSn92BKS2qHRen0prilTfs9kcNhNAdwaPWM5vh-IOB1085hQHNSdg1Qnk0sZk3djKYnplktOvbNGwXoY0x7-8yfBzk00-awZb-Sqt7ngb_8IvXyRAMo7iS8KgnIKuGOp-yqCvCxt3Zc2hORrWfY06Omz1TZXQmWlYtIyJPGrU6yyqdvxYABz4VlCj-AxXke0WeZMgNgfmdU=w973-h722-no)

Here's what Feed Forward looks like in pseudo-code, and you can follow the forward, left-to-right process in the diagram above. (Note that I add "LH," meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of the circle representing l1," which means, "before the product has been passed through the nonlin() function.")
```
l0 x syn0 (don't forget to add the products of the other l0 values x the other syn0 values)  = l1_LH ->  

nonlin(l1_LH) = l1  ->  l1 x syn1 (again, add the products of the other syn1 multiplications) = l2_LH ->  

nonlin(l2_LH) = l2  ->  y-l2 = l2_error
```
Again: nonlin() is the part of the Sigmoid function that renders any number as a value between 0-1.  It is the code, `return 1/(1+np.exp(-x))`.  It does not take slope.  But in back prop, we're going to use the *other* part of the Sigmoid function, the part that does take slope, i.e., `return x*(1-x)` because you will notice that lines 57 and 71 specifically request the Sigmoid to take slope with the code, `(deriv==True)`.

I artificially assigned Syn0,1with a beginning value of 2.  2 is just a random value we assigned, it could be any number, but hey--ya gotta start somewhere, right?  Let's walk through the math of Feed Forward slowly: 
l0 x syn0 = l1LH, so in our example 1 x 2=2, but don't forget we have to add the other two products of l0 x the corresponding weights of syn0.  In our example, l0,2 x syn0,2= 0 x something = 0, so it doesn't matter.  But l0,3 x syn0,3 *does* matter because l0,3=1, so let's just make up a simple, convenient value for syn0,3 of 3.  Therefore, l0,3 x syn0,3 = 1 x 3 = 3  Our product of l0,1 x syn0,1 + our product of l0,3 x syn0,3 = 2+3 = 5, and 5 is l1_LH.  Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1.  Nonlin(l1_LH) uses the code, `return 1/(1+np.exp(-x))`, so in our example that would be: 5/(1+(2.718^-5))=0.98, so l1 (the RH side of the l1 node) is 0.98.

So, what just happened above?  The computer used some fancy code, `return 1/(1+np.exp(-x))`, to do what we could do manually with our eyeballs--it told us the corresponding y value of x=5 on the sigmoid curve as pictured in this diagram:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Notice that, at 5 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis.  Our code converted 5 into a statistical probability between 0 and 1.  It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here.  The computer did what we did: it used math to "eyeball what 5 on the X axis would be on the Y axis of our diagram."  Nothing more.

Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value.  l1 x syn1 = l2LH which in our example would be 0.98 x 3 (3 is a random number we just assigned because hey--ya gotta start somewhere) = 2.94.  But again, don't forget that to 2.94 we have to add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake we'll just pretend those all added up to -2.  So you end up with -2 + 2.94 = 0.94, which is l2_LH.  Next we run l2_LH through our fabulous nonlin() function, which would be: 1/(1+2.718^-(-2)) = ~0.7, which is l2, which is our very first prediction of what the truth, y, might be!  Congratulations!  You just completed your first forward feed!

Now, let's assemble all our variables in one place, for clarity:
```
l0=1
syn0,1=2
l1_LH=5
l1=0.98
syn1,1=3
l2_LH=0.94
l2=~0.7
y=1 (this is value 3 of vector y, which corresponds to training example #3, row 3 of l0)
l2_error = y-l2 = 1-0.7 = 0.3
```
OK, above was forward feed.  Our next goal is to find Step 8: by how much did we miss our target truth y, the princess' castle?  Well, turns out we missed by 0.3.  But any distance between us and our beloved princess is too much, so how can we reduce that l2_error of 0.3 to put us finally in her arms?  The back propagation step below will soon teach us the exact amount we want to increase/decrease syn0,1 in order to decrease l2_error and firmly embrace our beloved.

#Is the stuff below now redundant?  Too complex?
This is where matrix multiplication becomes key (for those of you who are rookies to matrix multiplication and linear algebra, Grant Sanderson teaches it brilliantly, with lovely graphics, in [14 YouTube videos](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).  Watch those first, then return here).

First, on line 39 we multiply the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4 matrix.  
```
X (aka,l0):      syn0:
[0 0 1]          [ 5.67534974  5.1809666  -6.96032933 -4.91055814]
[0 1 1]     X    [-3.94870047 -6.6558582  -7.25683472 -4.61369466]   =
[1 0 1]          [ 1.77928043 -2.53186624  2.87700966  7.11388595]
[1 1 1]
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

Product of l0 x syn0:
[  1.77928043  -2.53186624   2.87700966   7.11388595]
[ -2.16942003  -9.18772444  -4.37982506   2.50019128]
[  7.45463017   2.64910035  -4.08331967   2.20332781]
[  3.5059297   -4.00675784 -11.34015439  -2.41036686]

Now we pass it through the "nonlin()" function in line 39, which is a fancy math expression you don't need to understand: "1/(1 + 2.781281^-x)=" Just trust me that it gives us values between 0 and 1, and I'll explain it more in Step 10:

This is layer 1, the hidden layer of our neural network:
l1:  
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]

Note that each value is written in exponential notation, so the first value is the same as 0.855607991.  And don't make the same mistake I made at first: that hyphen in the "e-01" does NOT mean your number is negative.  If this number were negative, it would be written as, "-8.55607991e-01" with the negative sign out front. Rather, the hyphen behind the e tells you whether you're moving your decimal point in a negative direction, i.e., to the left, or a positive direction.  For example, "8.55607991e+01" is 85.5567991.
```
If you find yourself feeling faint at the mere sight of matrix multiplication, fear not.  We're going to start simple, and break down our multiplication into tiny pieces, so you can get a feel for how this works.  Let's take one, single training example from our input.  Until now, we've been talking about l0 as a 4x3 matrix.  For a clearer, simpler example we will take only the 3rd row, `[1,0,1]` *not the first row* because that will give us the simplest demonstration.  

We're going to take the first row/example from that matrix, which would be, [0,0,1].  In other words, a 1x3 matrix.  We're going to multiply that by syn0, which would still be a 3x4 matrix, and our new l1 would be a 1x4 matrix.  Here's how that simplified process can be visualized:
```
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

row 1 of l0:     col 1 of syn0:
[0 0 1]    X     [ 5.67534974] +       [ 1.77928043  -2.53186624   2.87700966   7.11388595]
[0 0 1]    X     [-3.94870047] +   =   [ (row 2 of l0 x cols. 1, 2, 3, and 4 of syn0...)  ]
[0 0 1]    X     [ 1.77928043]         [                                  etc.            ]
                                                                                      [                                      etc.        ]
                                                                                      
Then pass the above 4x4 product through "nonlin()" and you get the l1 values
l1:  
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]
```
Note that, on line 39, we next take the Sigmoid function of l1 because we need l1 to have values between 0 and 1, hence: `l1=nonlin(np.dot(l0,syn0))`

It is on line 39 that we see ***Big Advantage #1*** of the ***Four Big Advantages of the Sigmoid Function.***  When we pass the dot product matrix of l0 and syn0 through the `nonlin()` function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.  This means, "the closer the value is to 1, the more certainty there is that such-and-such is the case, whereas the closer the value is to 0, the more certainty that such-and-such is NOT the case. 

You may be thinking, "Such-and-such?  Dave, that's a bit abstract--what do you mean?  Well, right now, our NN is dealing with a pretty abstract problem, i.e., binary 1's and 0's. Because our simple little toy network is binary, it can only handle on/off or yes/no questions.  But for clarity, let's give those 1's and 0's some meaning: 

Imagine that our problem is one of image recognition, e.g., "Is there a fish in this image? "1" means, Yes, and "0" means No. Within this context, any number we pass through the sigmoid function yields a statistical probability.  For example, an output of 0.999 is the equivalent of the network saying "It is highly probable that there is a fish in this picture". A number of 0.001 is the equivalent of "There is definitely not a fish in this picture". 

"OK," you might be thinking, "So I can eyeball the values in l1 above and deduce whether they are predicting a probable 1 or a probable 0.  So, what?"  Well, it doesn't matter in lines 39 and 40, but it matters a *ton* when we hit line 61 and beyond.  Stay tuned.

Here's where these values appear in our picture of neurons and synapses:

![alt text](https://lh3.googleusercontent.com/QtRwLfpxhGoXP6_bwxjlKdn0CSirkxyfuR1EKOaG94HSMmR3sjCwX5ueB3SPR6xvq6L997dTjvHKH67pnFGFXIeesj9iMCBGT46RINqOr1OtH0MWdqcvGn6K9NcIMv-ahzl1Yy28FlXF2qqnZ4WZ-rFRu4BDGha1pPgxKbcGaFsoN9tQ5LoS5r66D2Jho5ejjXXUMd3M45OFvoTIEtIk2l2LXdxF7X85yXST5iDiXhMBajh4cp65b7UrOd4qlSEZlk-t7MAE1ZleppKEzTGOFNoXoVfRJQZf1F7KvvlNqldydwn93FWLBB1kwMrLSFZkqvpX8mmc5aQxpJowev6FXJuG2HDSMNhuGtLpWBMJCq3eZM126bo_iAzxvzxakl6BksyUNR9IIqh61dbLYq2EHcxDtP4l68OeI5tWRnFBEU-b_jXWz4BdEkCmEXICgYpaHVdQnnuwiy74KmUjKW3Kvac1xaqb1P4xNl3Mr67COckWvi2sRqPY-scqnpHQ7QCstxLJTopN9QJAXmYO6eC0YFxhg4uFB4nwvep9kWScIJdD-cs0siu6w62VPyx1CEVQk2CbTpt8uCkiRmrUdRhPxV5s29OXhQ8cvS6HtCRaAXcQ8wYayRcRkn8i9dc7LZNQLGqHpO8DAEkQvdVC3b3Gce339fJBgbLj=w970-h710-no)

So, the above diagram gives you a picture of how row 1 of our input X, which is one of four training examples, feeds through its first step in a NN.  But you may recall that we have 4 examples, not one.  The diagram below shows the same training example, [0 0 1] in its natural habitat--stacked on top of the other 3 training examples.  Once you can visualize how the neurons of example 1 (i.e., row 1 of Matrix X) are simply stacked on top of the other 3 rows/training examples of Matrix X, then you have the key to The Kingdom: hopefully, now you could visualize even a typical NN, with thousands or millions of neurons, containing thousands or millions of examples, stacked on top of each other.  The same image above still holds--there's just a whole lot of stacking going on!  These stacks are referred to as a "full batch" configuration, which is a very common model.

![alt text](https://lh3.googleusercontent.com/csNXEukbaM4zJP6fg3AmE6HWsR3QNOK9ZSgY-PJ72a3D_WRqBUEMrVi3S760aHxd0krvHk6_fGM6s3RfWXueb1FZeGv8dw5ClJ3zcBNoPE6QHxQCbdlJM7myAGXuIpUtWqbhnqQ3Y7QFf93klMO3u1OJrhnEMo_ykyNbfrJmBr-Ay7X-uCqV0eUqKZ4FwmurAo1xGAvXZ9uFxNRXWX7w8LWYY9LlAXi7jogTedzApJUcj1fSB4lcGp9FAvBacrFlAqZpMvV8JEv3ad1yczvWCmY2z0CGG8x7krXFEC-OTY04gZNwXfYM5lIzAjPIH12rvUBbBBzEZ4v0sG5rYVlP8xsYEeIFmmEWQDNnjDdftfwc2jcQlq3Xemyh4tOuKg8j-hj4vgQLfamI4TI7KS-cSB9HUZH-VujwGnUTZKtGa150hYtu4AWcUG16h95TirKEXM3RaXPyEQErKCYWRLggb8fblUVqjvIRLz90UxXmgcgoL5cS0e8fHGHJwy8B_udVzK5zAly1-C3fTG0V0T8oUwprUAdaBcWURn2NC1wXX9efF_B3enpLhtk8E40FWcFPk6ae0WXFIPcUBOvSjflw1c98P7Bw_wunp8V8Vxsy-CI_CeK11NX1BHVvmRkcoXYN70LyN4fPd8A3nenmfoEMHEvDPm67esoj=w968-h923-no)

Exactly the same thing happens on line 40, as we take the dot product of 4x4 l1 and 4x1 syn1 and then run that product through the Sigmoid function to produce a 4x1 l2 with each value becoming a statistical probablility from 0-1.


```
l1 (4x4):                                                         syn1 (4x1): 
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]     [ -9.39072641]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]  X  [  9.43509921]  =
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]     [-12.43520534]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]     [ 10.32941201]

Then pass the above 4x1 product through "nonlin()" and you get l2, our prediction:
l2: 
 [1.52039467e-04]
 [9.99781882e-01]
 [9.99801142e-01]
 [3.04170696e-04]
   ```
We have now completed the Feed Forward portion of our network.  If you can visualize what we have done so far, both in terms of the matrices involved, and also as layers of "neurons" and the "synapses" connecting those neurons, then you have done outstanding work.  Bravo to you.









#8) By How Much Did We Miss the Target? Lines 42-45
```
l2_error = y - l2
```
The 4x1 y vector is our goal, our target.  Given our input X of layer 0, we want to produce an output, layer 2, that is as close to the 4 values of y as possible.  Each one of our 60,000 iterations should bring us, by trial-and-error and learning from our mistakes, closer to the 4 target values of y.  So, for each iteration, we take our best prediction so far, the 4x1 vector l2, and subtract it from the 4x1 vector y.  The remainder is l2_error, i.e., how much each value of l2 missed its target value in y.  

In Step 7, you might say that we made our first try, or trial--as in "trial-and-error."  This is our first attempt at a prediction of what y, the truth, might be.  Step 8 is the exciting first step of figuring out our "error," in the learning process of our NN.  Once we know what we missed by, in the steps below we will seek to correct that error and do better next trial.

But before we move on to more steps, let's take a careful look at what we have in l2_error.  There's a lot of important information here.  y has 4 values, l2 has 4 values, and we subtracted each value of l2 from the corresponding value of y.  We ended up with 4 values in l2_error: 4 "misses" of the y target.

So, what? You may ask.  Well, consider: some of those misses were quite small.  Our l2 prediction was pretty close to correct, so when we subtract that l2 value from its corresponding y value, the remainder in l2_error is a small number; a Small Miss.  But there were also Big Misses.  Pay close attention to those big misses, because they will matter a *lot* in the steps below:

DCQ: Are the numbers below just too darn small to be effective at teaching Big Misses and Small Misses?  Instead of using iteration number 10,000 below, should I use iterations, say, 1 or 2?  But I thought *those* numbers would be too big and weird to use in my examples of syn0, syn1, and such...
```
l2_error = y - l2      (Note that this example is already after 10,000 iterations, so the numbers are relatively small.)

y:        l2:           l2_error:    Relatively speaking...
[0]      [0.00015]     [-0.00015204] a small miss
[1]  _   [0.9998]   =  [ 0.00021812] a Big Miss 
[1]      [0.9998]      [ 0.00019886] a small miss
[0]      [0.0003]      [-0.00030417] a Big Miss
```
Why do we care about the big misses?  Because correcting the bigger misses improves our network's accuracy faster and cheaper than messing around with the small misses.  If it ain't broke, don't fix it.  Throughout our network, we want to focus on the Big Misses and the Low Confidence/Wishy-Washy/Big Slope Ratio numbers (which I'll explain below).  For now, just remember this key point: the Big Misses matter bigtime.


#9) Print Error: Lines 47-51
Line 50 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations.  The line, `if (j% 10000)==0:` means, "If your iterator is at a number of iterations that, when divided by 10,000, leaves no remainder, then..."  ` j%10000 `would have a remainder of 0 only six times: at 0 iterations, 10,000, 20,000, and so on to 60,000.  So this print-out gives us a nice report on the progress of our NN's learning.

The code `+ str(np.mean(np.abs(l2_error))))` simplifes our print out by taking the absolute value of each of the 4 values, then averaging all 4 into one mean number and printing that.  Here's an example:
```
Avg l2_error after 10,000 iterations: 0.00021829659275871905     (Not bad, huh?  :-)
```

Some notes from Jackie on things up until this point. Mostly high level compared to the feedback from AK!
https://docs.google.com/document/d/1o9aY2AZQED7BK_i4ISArGhSXXB4zCD12QBXS9L1p5ew/edit


#AK adam edits to implement:

2) Your matrix multiplication example is confusing (I don’t understand the logic for the data you have chosen to show).  If you want to show the mechanics of matrix multiplication, use a single term of l1.  For example

[0 0 1] x [5.67  -3.94  1.78 ]T  = 1.78

4) Your first network picture basically shows the mechanics of the matrix multiplication, perhaps the other example is not necessary

5) To me, your l2_error example is all small misses.  I think it would be most helpful to artificially change the weights from one of the first 10 iterations to produce at least one big miss.  An l2_error that would be more illustrative is
[0.004  -0.2  0.0001  0.05]   [small  HUGE  tiny big].

6) In step 10, I still don’t like the analogy of asking the lady for directions at l1 because l1 is hidden.  The only output that you can directly compare with the truth (how far off am I) is l2.  Going from l0 to l2 is ALWAYS predetermined by your current weights.  That is syn1 is not “new directions”, it is a set of directions you already planned when you set out for the day

7) using l2_error values that all have the same order of magnitude will make back propagation less obvious

8)  Your l2_delta computation is wrong.  You don’t put the slopes through the nonlin function (the deriv=True option uses the function to compute slopes INSTEAD of evaluating nonlin).

#10) In What DIRECTION is y?  Lines 53-57  
```
 l2_delta = l2_error*nonlin(l2,deriv=True)
```
Now we have entered the brain of the beast; here is the secret sauce of Deep Learning.

You might call Step 10, "How much do I tweak my NN before its next iteration and prediction?"  For statistics and calculus buffs, we could simply say, "In line 57 we compute how much the l2_delta needs to modify the l2_error with weights from the derivatives to induce large changes in low confidence values and small changes in high confidence values."  Whew!  For the rest of us mere mortals, let's unpack that a bit, by returning to our (spellbinding) fairy tale:

##*The Princess and the Castle, Chapter 2: Learning from Your Errors.*

You may recall, back in Step 7, you made a feed forward pass and drove to l2, your best guess as to where the castle y is located, but you arrived at l2 only to discover you were *closer* to the castle, but not yet arrived.  And you know that soon, you will (POOF!) disappear and wake up the next morning back at your house, l0, and start over.

How can you improve your driving directions to get closer to the love of your life tomorrow?

First, when you arrive at today's destination, you eagerly ask a local knight how far today's arrival place is from the Princess's castle.  This chivalrous knight tells you the distance you are from Castle y (this is the l2_error, or "how much you missed the princess by").  Every day, at the end of each trip, before you disappear for the day, you want to compute **by how much** you want to change today's failed l2 prediction such that tomorrow your l2 prediction will be perfect and you can fall into your beloved's arms.  This is the l2_delta.  It is the amount you want to change today's l2 so that tomorrow, that new-and-improved l2 will hopefully lead to the castle drawbridge!

Note that the l2_delta is NOT the same as l2_error because l2_error only tells you how many miles you are from your princess.  l2_delta also factors in how confident you were in the turn-by-turn directions by which you missed the castle today.  These confidence numbers are the derivatives (forget calculus, you don't need it here, so let's just use the word "slope," as in Good Ol' rise-over-run), or slope of each value of l2.  Think of these slopes as the confidence levels you had in each of the turns in the directions we're using for today's trip.  Some of those turns you were super-confident of.  With other turns, you weren't certain if they were right or not.  

But wait: perhaps this concept of using confidence levels to compute where you want to arrive tomorrow seems a bit abstract and confusing?  Actually, you use confidence levels to navigate all the time--you just aren't conscious of it.  

Think about a time when you got lost.  At first, you started out assuming you were on the right route, or you wouldn't have taken that route in the first place.  You started out confident.  But your trip seems to be taking longer than you expected, and you wonder, "Gee, did I miss a turn?"  Less confident.  Then, as time passes and you should have arrived by now, you become more certain you missed that turn.  Low confidence.  And you know you are not at your destination, but you are not sure where your destination is from your current location.  So now, you stop and ask a nice lady for directions, but she tells you more turns and landmarks than you can remember, so you can only follow her directions part-way before you are again unsure how to proceed.  So you ask directions again, but this time you are closer, so the directions are simpler, and you follow them to the letter and arrive joyfully at your destination.

It's very important to notice a couple of things: 

First, you just learned by trial-and-error, you had varying confidence levels.  A bit later below, I will explain in detail how those confidence levels allow our network to learn by trial-and-error, and then I will explain how our beloved Sigmoid function gives us those all-important confidence levels.  

Second, notice that your trip had two segments--the first segment was your route up to where you asked the nice lady for directions (l1), and segment two was your route from the nice lady to l2, the place you thought was your destination, but you had to ask how far you were from your true destination.  At first, you were sure you were on the correct route. Then, you wondered if you missed a turn. Then you were certain you missed a turn, and stopped to ask directions before proceeding further.  Those 2 segments of your daily trip look like the dog legs pictured here, and each day with your improvements, the dog leg gets a bit straighter.  It's like the process you go through as our romantic, driving, neural network: 

![alt text](https://lh3.googleusercontent.com/F1eSirIXLvjK9Dxs_jYgG_7jo9BYfiKxtYgcXqic8zWZhNVP6POfCTzeqg9tbGO-7vIfw-TenaJ0Rb4OK_7FGL5ilwtx8osTl_LyFhpWixZSVOjmAfzKJuicGJ2CXPt4u-Da7JmR7LEpqwjXFIs84kpQc8rj-NHOHGuwaoKxiB2un3FgMDp0JqLfP73g_Gc7j4GxpmUXOVGHSJY3YQNjDiBeoey-GzEkAIdZaV4ygSb-sIV7gSNfnkMGp0bgy3HeIn_sVGadqjviQswdEHbleQOaKy6lMr2FhRdYQoRZrKwMTRX3ziDaDylQtCVOIygLAzA0ezkhr9V4Aq-qf1kBe_679XfFsRuvE8zjLXnM0D-sqBQAL9fNao9-8gEEHu3Z4tLURbh8ve_IiGYFAOLS4Sedu3jHwexmmFfPs8Zd6UUVEWhiGBhdqS-mpiw8Ptg4t9qsJnH__h4bxhoyziri6MWzR_qkBsTCrKRSnhag0X6qeKQGfcnk_QPQusVnN6JNhIeeJtwLcLBBPEn_sWgBxCMAzeGwniphYXyYEc2_56Nm53u9Glb6bh1oZmPrRLotBqzEAZBS6TexIKkKNfX7WVU83AAWOmUAJBxW74hitwLqski032aBVWK9rUHNekhVIymCow66_09qT8i5okt9ZIT1l8shph6i=w970-h744-no)

Every day, on every trip, you (our 3-layer network) start out with a set of directions to the princess (syn0).  When those directions end, you stop at l1 and ask for further directions (syn1).  These take you to your final destination of the day, your prediction of where you *thought* the princess was.  But no castle stands before you.  So you ask the knight, "How far to the castle? (l2_error)" And because you are a calculus genius, you can multiply the l2_error by how confident you were in each turn of your directions (the derivative, or slope, of l2) and come up with where you want to arrive tomorrow (your l2_delta).  

So you must compute 3 facts before you can learn from them and re-attempt your quest for the princess's castle.  You must know:
1) Your current location (l2);
2) How far you are from the princess's castle (l2_error); and
3) What changes you need to make in your set of turns to increase your certainty that your next driving attempt will get you closer to the castle (l2_delta).

Once you possess these three facts, then you can compute the required changes in the navigation turns (i.e., the weights of the synapses).   This is line 75, the change to the weights of syn1, which is the product of l1 and l2_delta.  The changes in syn1 will help to realize the changes you seek in your next l2 (i.e., to end up closer to that darn castle).

Now, of course the smartypants readers will notice that I have only told you how to improve tomorrow's directions for Part 2 of our journey, from l1 to l2.  Ahh, there's always a stickler, isn't there?  Well, we're going to learn how to improve the directions (syn0) of Part 1 of our journey (l0 to l1) in Step 11 of our process.  Right now, I want to do a deep dive into using confidence levels to compute the l2_delta.  Wake up now, because below is some fascinating and important stuff:

Here is where you will see the beauty of the Sigmoid function in four *magical* steps.  To me, the genius of our neural network's ability to learn is found largely in these four steps.  We saw how, in line 39, Step 1 was when the `nonlin()` transformed each value of our matrix into a statistical probability (i.e., a prediction) between 0 and 1.  But I have yet to mention that that statistical probability is ***also*** a simple measure of confidence--numbers approaching 1 suggest high confidence that the NN's (neural network's) prediction of "1" is correct.  Numbers approaching 0 suggest high confidence that the NN's prediction of "0" is correct.  This is ***Big Advantage #2*** of the ***Four Big Advantages of the Sigmoid Function:***  If our NN's prediction, the four values of l2, is high-confidence and high-accuracy, that's an oustanding prediction, and we want to leave the syn0 and syn1 weights that produced that oustanding prediction alone.  We don't want to mess with what's working; we want to fix what's NOT working.

Let me explain the above from a different angle: right now, our NN is dealing with a pretty abstract problem, i.e., 1's and 0's.  Let's give those 1's and 0's meaning: imagine that our problem is one of image recognition, e.g., "Is there a fish in this image?  "1" means, Yes, and "0" means No. Within this context, the output is simultaneously a prediction and a confidence measure.  An output of 0.999 is the equivalent of the network saying "I am extremely confident that there is a fish in this picture".  A number of 0.001 is the equivalent of "There is definitely not a fish in this picture".  Low confidence numbers are in the vicinity of 0.5.  For example, a value of 0.4 would be similar to "I don't think there is a fish in this picture, but I'm not sure."

That's why we focus our attention on the numbers in the middle:  all numbers approaching 0.5 in the middle are wishy-washy, and lacking confidence.  So, how can we tweak our NN to produce four l2 values that are both high-confidence and high-accuracy?  

The key lies in the values, or ***weights*** of syn0 and syn1.  As I mentioned above, syn0 and syn1 are the center, the absolute *brains* or our neural network.  We are going to take the four values of the l2_error and perform beautiful, elegant math on them to produce an l2_delta.  l2_delta means, basically, "the change we want to see in the output of the network (l2) so that it better resembles y (the truth)."  In other words, l2_delta is the change you want to see in l2 in the next feed-forward pass in the next iteration.

***Get ready for beauty.***

Here is ***Big Advantage #3*** of the ***Four Big Advantages of the Sigmoid Function:*** Do you remember that diagram of the beautiful S-curve of the Sigmoid function that I showed you above?  Well, lo-and-behold, each of the 4 probability/confidence values of l2 lies somewhere on the S curve of the sigmoid graph (pictured again below, but this time with more detail).  If we search for that number (e.g. 0.9) on the Y axis of the graph below, we can see that it corresponds with a point on the S curve roughly where you see the green dot: ![alt text](https://iamtrask.github.io/img/sigmoid-deriv-2.png)
(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Did you notice not only the green dot but also the green line through the dot?  That green line is meant to represent the slope of the *tangent* to the line at the exact point where that dot is.  You don't need to know calculus to take the slope of a curve at a particular point--the computer will do that for you.  But you do have to notice that the S curve above has very shallow slope at both the upper extreme (near 1) and the lower extreme (near 0).  Does that sound familiar?  Wonder of wonders, a shallow slope on the sigmoid curve coincides with high confidence and high accuracy in our predictions!  And you also need to know that a shallow slope on the S-curve comes out to a tiny number for slope.  That's good news.  Why?

Because, when we go to update our synapses, we basically want to leave our high confidence weights alone since they already have good accuracy.  To "leave them alone" means to multiply them by tiny numbers, near zero, so the values remain virtually unchanged.  And here comes ***Big Advantage #4*** of the ***Four Big Advantages of the Sigmoid Function:*** Miracle-of-miracles, our high-confidence numbers correspond to shallow slope on the S-curve, which corresponds to tiny slope numbers.  Therefore, multiplying the values of syn0 and syn1 by these teeny-tiny numbers has exactly the effect we want: the values in our synapses are left virtually unchanged, so our confident, accurate, high-performing values in l2 remain so.

By the same token, our wishy-washy, indecisive, low-accuracy l2 values, which correspond to points in the middle of the S-curve, are the numbers that have the biggest slope on our S-curve.  What I mean is, the values around 0.5 can be traced on the Y axis of our graph below to the middle of the S-curve, where the slope is steepest, and therefore the value of that slope is a big number.  Those big numbers mean a big change when we multiply them by the wishy-washy values in l2, as we do in line 61.

In detail now, how do we compute the l2_delta?  

In line 45, we found l2_error, which measures how much our first prediction, l2, missed the target values of y, our truth, our future, and our princess.  You may recall we are particularly interested in the Big Misses.  

In line 57, the first thing we do is use **the second part** of our beloved Sigmoid function, `(x,deriv=True)` to find the slope of each of the 4 values in our l2 prediction.  This slope tells us which predictions were confident, and which were (wait for it...) Wishy-Washy.  This is how we find and fix the weakest links in our network, the low-confidence predictions.  We then launder our 4 slopes with `nonlin()` and multiply those 4 confidence measures by the four misses in`l2_error`and the product of this multiplication will be `l2_delta`.  Oh, Lordy!  Line 57 is an important step--did you notice that we are multiplying the Big Misses by the Wishy-Washy Predictions (i.e., the l2 predictions that had big slopes)?  Super-duper key point, as I'll explain below.  But first, let's make sure you can visualize what I just said:
DCQ: are the matrix values of this 10,000th iteration big enough to illustrate my point?
```
Below is the matrix multiplication of this line of code, in order of operations: l2_delta = l2_error*nonlin(l2,deriv=True)
    
Take l2 predictions, find their slopes, pass them through nonlin(), multiply them by the l2_error, and product is l2_delta

l2:                 l2 slopes:        l2 slopes after nonlin():    l2_error:                    l2_delta: 
[0.0001520]         [0.00015202]      [0.500038] Wishy Washy!      [-0.00015204] small miss    [-0.0000760] small change
[0.999781882]       [0.00021807]      [0.500054] Wishy Washy!   X  [ 0.00021812] Big Miss    = [ 0.0001090] WWxBM=Big Change
[0.999801142]       [0.00019882]      [0.500049] Wishy Washy!      [ 0.00019886] small miss    [ 0.0000994] small change
[0.00030417]        [0.00030408]      [0.500076] Wishy Washy!      [-0.00030417] Big Miss      [-0.0001521] WWxBM=Big Change
```
Notice that, the Big Misses are (relatively speaking), the biggest numbers in l2_error.  And the Wishy-Washy's have the steepest slope, so they are the biggest numbers in `nonlin(l2,deriv=True)`.  So, when we multiply the Big Misses X The Wishy-Washy's, we are multiplying the biggest numbers by the biggest numbers, which will give us--guess what?--the biggest numbers in our vector, l2_delta.  

Why is that fabulous news?  Think of l2_delta as "the change we want to see in l2 in the next iteration."  The **big** l2_delta values are the **big** changes we want to have in the l2 prediction of the next iteration, and we'll make those happen by making **big** tweaks in the corresponding values of syn1 and syn0 below, so in our next feed-forward pass, when we multiply l0xsyn0, the biggest values in syn0 will make a big change to our least accurate, wishy-washy-est values in l0.  So l0 will be improved.  Then, when we multiply the new-and-improved l1 by our better-weighted syn1, that will yield our best l2 prediction yet!  Happy Happy!  Joy Joy!

##Super Key Point: Positive-or-Negative Direction, and Gradient Descent
When we update our synapse matrix by multiplying its corresponding element (aka, its value or number) with that large slope number, it's going to give that element a big nudge in the right direction towards confident and accurate prediction.  When I say, "in the right direction," what I mean is that some values of our l2_delta are going to be negative values, because we want the product of these negative values, when multiplied by the weight values of our synapse matrix, to approach 0.  Other values of our l2_delta are going to be positive, because we want them to increase the weight values and thereby nudge the elements in syn0 and syn1 to approach 1.

So it's important to notice that there is a sense of "direction" involved here.  When we talk about "what direction is the target y value from our current l2 value?" we mean, do we need to multiply each weight in syn1 by a positive l2_delta value to move it in a positive, larger direction, or by a negative l2_delta value to move it in a negative direction?

Have you heard of gradient descent?  Gradient descent is all about direction.  It is often described as, "a ball dropped in a bowl and rolling in a **back-and-forth direction** until it comes to a rest at the global minimum, the bottom of the bowl."  That's what the Sigmoid does for us.  It helps us to find the bottom of the bowl, which minimizes the cost function, which minimizes the error in our predictions.  Think of our 60,000 iterations as the ball rolling in a back-and-forth direction in the bowl until it no longer needs to change direction because it has come to rest at the ideal, perfect bottom of that bowl, where error as at a minimum, and life is good.  Picture it like this:

![alt text](https://mail.google.com/mail/u/0?ui=2&ik=e3f869f938&attid=0.2&permmsgid=msg-a:r4352876950048414936&th=1691255aa52a4d54&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ8FdFORGv3w0jn-Bs8GhlKpg2D1XPRzSF6OaNCqE8hchNYMIAymIg-nK1xCdIsQup54rJmkW2l0qttCzg03Hq8PJOv4KX0ae14e2dkswvLMt74Rzdhwt2ZJQBQ&disp=emb&realattid=ii_jsexnu8o2)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)

Above is a nice, simple picture of the "rolling ball" of gradient descent.  Line 61 computes each value of l2 as a slope value.  Of the 3 "bowls" pictured in the diagram above, it is clear that the true global minimum is the deepest bowl on the far left.  But, for simplicity's sake, let's pretend that the bowl in the middle is the global minimum.  So, a steep slope downwards to the right (i.e., a negative slope value, as depicted by the green line in the picture) means our ball will roll a lot in the negative direction (i.e., to the right), causing a big, negative adjustment to the corresponding weights of syn1 that will be used in the next iteration to predict l2.  But if, for example, you have a shallow slope downwards to the left, that would mean the prediction value is already accurate and confident, which produces a tiny, positive slope value, so the ball will roll very little to the left (i.e., in a positive direction), thus adjusting by very little the corresponding weight in syn1, so the next iteration's prediction of that value will remain largely unchanged.  This makes sense because the back-and-forth motion of the rolling ball is becoming smaller and smaller before it soon comes to rest at the global minimum, the bottom of the bowl, so there's no need to move much.  It is already close to the ideal.

The above 2-dimensional diagram is a tad oversimplified, so here's a more accurate picture of what gradient descent looks like:
![alt text](https://lh3.googleusercontent.com/jIup60T65tIKtXg0B-Np6jeNXk4TvQTRgBI1btNRZUZ4yy_ZEyL1bN3RwiSjzKNcbyXQN6z7vdV55NzGFxJfUpZXkyU6HTmrScht0rbk5BXGC6eO79LrZuuVpJdHE4fr4QYwvdbO)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)

Think of the dotted white line in the diagram as the path our gradient descent ball takes toward "The Bottom of the Bowl where Accurate Prediction Lives."  Each dot of that dotted white line represents a tweak, or update of the weights syn0 and syn1, that will take our ball closer-and-closer to the bottom of the bowl, which is the global minimum error, where our NN's predictions are most accurate.

#Tying It All Together with One Picture
DCQ: Adam, this is the best image I have found on the Net of the Gradient Descent bowl sitting on the "table top" plane of the syn0, syn1 grid with an arrow showing Feed Forward and a tiny arrow showing slope/gradient descent.  But, I don't know how to paste it into the colab yet.  Take a look and LMK if you think it's worth trying: https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing

Here is a perfect example of why the best teacher is someone who learned yesterday the material you are learning today.  I only discovered the following insight after a year of studying gradient descent, because all the experts take this point so for-granted that they don't bother mentioning it.  Here is a BIG chance for you to learn from my mistakes, and gain a super-key insight that eluded me for over a year, even though it was right under my nose.  Here goes:

Take another look at the red, 3-D "warped bowl" in the diagram above.  Think of it as just that, a warped bowl.  Notice that the warped bowl is not drifting in space.  It is sitting on a "white table top," that white grid.  Do you notice that the only place our warped bowl actually touches the plane of our white table top is at the global minimum, i.e., the bottom of the lowest dip in the bowl?  Perfect.  Now you have all the info you need for this stunning insight:

The grid of our white table top is the axis of syn0 and syn1!  That means that, for example, every value in syn0 is a point on the X axis of our grid, and every value in syn1 is a point on the Y axis of our grid.  When we do a forward feed through out network, the value we arrive at is simply the height of our "ball" from the syn0, syn1 coordinate on that grid.  Once we have the height from the grid cooridinates of the plane below, we know *exactly* where our ball is on the surface of our bowl.  And when we compute gradient descent, it tells us the slope of the surface of the bowl at the exact coordinate of (syn0, syn1, and the value from Forward Feed).  Finding the slope of where our ball is tells us the direction our ball should roll to make the quickest descent to the bottom of the bowl where error is 0 because height is 0 because our ball is touching the syn0, syn1 grid plane.

And that's it.  That is the best geometric representation of what a neural network does, in one picture.  Why doesn't EVERYBODY teach it like this?  If you can SEE what a neural network does in 3-D, it makes it SO much easier to understand why we do all these abstract steps in math and in code.

DCQ: Adam, where does the slope of the sigmoid S-curve fit into this geometic picture?  Is the slope of the S-curve somehow linked to the (gradient descent) slope of the surface of the bowl where our ball is?

Take your time with the above points and make sure you understand them.  Do you see why the sigmoid function is a thing of beauty?  It takes any random number in one of our matrices and:

1) turns it into a statistical probability, 

2) which is also a confidence level, 

3) which turns into a big-or-small tweak of our synapses, and 

4) that tweak is always in the direction of greater confidence and accuracy.  

The sigmoid function is the miracle by which mere numbers in this matrix can "learn." A single number, along with its many colleagues in a matrix, can represent probability and confidence, which allows a matrix to "learn" by trial-and-error.  That is a thing of beauty, but there is more elegance to come!  As you learn other networks, you will see there are many functions that "learn" in even more beautiful ways than the sigmoid we have studied here.


#Draft: The Big Picture on Back Prop
Let's bust two myths, shall we?  
Myth #1: Back Propagation is Super-Hard

False.  Back propagation requires patience and persistence.  If you read a text on back prop once and throw up your hands because you understand nothing, you're done.  But if you watch Grant Sanderson's [video 3 on back prop](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) 5 times at reduced speed, then watch [video 4 on the math of back prop](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&t=320s&index=5) five times, you'll be well on your way.  Grit.

Myth #2: To understand back prop, you need calculus

Many people post online that you need multivariable calculus in order to understand AI.  Not true.  Andrew Trask says that, even if you took three semesters of college-level calc, only a tiny subset of that material would be useful for learning back propagation: the Chain Rule.  But, even if you took those 3 semesters of college calculus, the chain rule is often presented very differently in college from the way you would use it in back propagation.  So, bottom line?  Don't make the same mistake I did: I panicked every time I heard the word, "derivative," and it was self-defeating.  You must fight that inner voice saying, "I don't have the background to master this."  There are workarounds--simple ways to do calculus without calling it "calculus."  But there is no workaround for grit.  

Here is my favorite saying: "There is the task, and there is the drama ***about*** the task."  
Leave your drama here now.  Please give me your grit, and your trust.  Let's learn back prop.

Here is the big picture.  What is the purpose of back prop?  To find the best way to tweak our network so that it gives a better prediction in the next iteration.  Let's break that down:

We have control over 16 variables in our network: 12 variables in the 3x4 matrix syn0, and 4 variables in the 4x1 vector syn1.  Look at this diagram and understand that every line (aka, "edge" or synapse) you see represents one variable, containing one number, aka one weight.  
[[[insert diagram of 3 layer network with 16 edges, label edges as "weights."  ]]]
These 16 weights are all we can control.  l0, our input, is fixed and unchanging.  l1 is determined *exclusively* by the weights in syn0 by which you multiply the fixed values of l0.  And l2 is determined *exclusively* by the weights in syn1 by which you multiplied l1.  Those 16 lines pictured above, the synapses, the weights, are the only numbers you can tweak to achieve your goal, which is an l2_error that gets smaller and smaller until l2 almost equals y (in other words, the l2_error ball has come to rest at the bottom of the bowl and your gradient descent is complete).  l2_error is what we call your "cost," and back propagation seeks to minimize your cost.

So, here's the challenge: every time you tweak one of your 16 lines/weights/variables in syn0 and syn1, it has a ripple effect through the network.  How can you calculate the best value for each of the 16 weights while taking into consideration its effect on all the other 15 weights, all at the same time?  That sounds crazy-complex, right?

Can do.  Let me show you the World's Greatest Parlor Trick.  Those of you who know calculus will understand when I say we are going to use the Chain Rule to take derivatives.  But those of us who don't know calculus are not intimidates in the least because we will use slope, Good Ol' rise-over-run, to juggle 16 bowling pins in the air at the same time.  The secret?  In this context, to find the slope is to take the derivative.  They are exactly the same thing.

We're going to walk through one example of one line/weight only.  First we will walk through forward-feed, as in Step 7, then we'll walk backwards through back propagation.  The weight we will study is the top line of syn0, between the top neurons of l0 and l1, and we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0).  Our question is, "When we nudge syn0,1 up or down, how much does that move the l2_error up or down?"  In other words, think of a derivative as a "sensitivity," or a relationship, or a ratio: we know that if we wiggle syn0,1, up or down, then the l2_error will wiggle up or down in proportion to that nudge.  But will it move a little?  A lot?  What is the ratio of l2_error's wiggle to syn0,1's wiggle?  How sensitive is l2_error with respect to syn0,1?

Said another way, we know that changing syn0,1 has a ripple effect on l2_error.  First we must measure each ripple of that ripple effect.  Our ripple effect has 4 ripples.  Let me show you those 4 in a diagram, then explain below.
[[[insert diagram "back prop, aka The Ripple Effect]]]
Ignore the math and look at the bottom diagram first.  When we increase or decrease the value syn0,1, that's Ripple 1.  The change in syn0,1 will cause l1 to increase/decrease by a certain proportion, aka ratio, and to calculate that ratio of change is to measure Ripple 2.  Then, syn1,1 will obviously be affected in proportion to the change in l1, and measuring that ratio of change will give us Ripple 3.  You can probably guess l2 will change in proportion to the change in syn1,1, and that is Ripple 4.  Our goal is to calculate the ratio that each ripple ripples, in order to know the amount we want to increase/decrease syn0,1 in order to minimize l2_error on our next iteration.  When we say our neural network "learns," we really mean it reduces l2_error with each iteration such that the network's predictions become more and more accurate each time.

Why are the circles representing the neurons of l2 and l1 divided down the middle?  To represent the values of l2 both *pre* slope-finding, `(l2,deriv=True)` (the left side of the circle) and *post* slope-finding (the right side of the circle).   

Let's make a Big Picture comparison between Feed Forward and Back Prop.  

Here's what Feed Forward looks like in pseudo-code (note that I add "LH," meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of the circle representing l1," which means, "before the product has been passed through the nonlin() function.")

l0*syn0 = l1LH ->  nonlin(l1LH) = l1  ->  l1*syn1 = l2LH ->  nonlin(l2LH) = l2  ->  y-l2 = l2_error

Note that nonlin() is the part of the Sigmoid function that renders any number as a value between 0-1.  It is the code, `return 1/(1+np.exp(-x))`.  It does not take slope.  But in back prop, we're going to use the part of the Sigmoid function that does take slope: `return x*(1-x)`because you will notice that lines 57 and 71 specifically request the Sigmoid to take slope with the code, `(deriv==True)`.

OK, now we are going to walk through Forward Feed using one of our training examples, row 3 of Matrix X, aka l0: [1,0,1].  Here's what it looks like:
[[[insert diagram of fwd feed with real numbers but only 1 edge]]]

Syn0,1has a beginning value of 2, which we seek to change (2 is just a random value we assigned, it could be any number, but hey--ya gotta start somewhere, right?)  That change will ripple through our network to decrease l2_error.  Let's walk through the math slowly: 
l0*syn0 = l1LH, so in our example 1*2=2, but don't forget we have to add the other two products of l0* the corresponding weights of syn0.  In our example, l0,2*syn0,2= 0*something = 0, so it doesn't matter.  But l0,3*syn0,3 *does* matter because l0,3=1, so let's just make up a simple, convenient value for syn0,3 of 3.  Therefore, l0,3*syn0,3 = 1*3 = 3  Our product of l0,1*syn0,1 + our product of l0,3*syn0,3 = 2+3 = 5, and 5 is l1_LH.  Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1, so nonlin(l1_LH) which is the code, `return 1/(1+np.exp(-x))`, so in our example that would be: 5/(1+(2.718^-5))=0.98, so l1 (the RH side of the l1 node) is 0.98.

So, what just happened above?  The computer used some fancy code, `return 1/(1+np.exp(-x))`, to do what we could do manually with our eyeballs--it told us the corresponding y value of x=5 on the sigmoid curve pictured in this diagram:
[[[insert Trask's sigmoid diagram again]]]
Notice that, at 5 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis.  Our code converted 5 into a statistical probability between 0 and 1.  It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here.  The computer did what we did: it used math to "eyeball what 5 on the X axis would be on the Y axis of our diagram."  Nothing more.

Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value.  l1*syn1 = l2LH which in our example would be 0.98*3 (3 is a random number we just assigned because hey--ya gotta start somewhere) = 2.94.  But again, don't forget that to 2.94 we have to add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake we'll just say those all added up to -2.  So you end up with -2 + 2.94 = 0.94, which is l2_LH.  Next we run l2_LH through our fabulous nonlin() function, which would be: 1/(1+2.718^-(-2)) = ~0.7, which is l2, which is our very first prediction of what the truth, y, might be!  Congratulations!  You just completed your first forward feed!

Now, let's assemble all our variables in one place, for clarity:
l0=1
syn0,1=2
l1_LH=5
l1=0.98
syn1,1=3
l2_LH=0.98
l2=~0.7
y=1 (this is value 3 of the vector y, which correspondes to our training example #3, which is row 3 of matrix X, aka l0)
l2_error = y-l2 = 1-0.7 = 0.3

OK, above was forward feed.  Our first goal was to find Step 8: by how much did we miss our target truth y, the princess' castle?  Well, turns out we missed by 0.3.  But any distance between us and our beloved princess is too much, so how can we reduce that l2_error of 0.3 to put us finally in her arms?  Back propagation will now teach us the exact amount we want to increase syn0,1 in order to decrease l2_error and firmly embrace our beloved.

ADAM STOP HERE.  BELOW BE DRAGONS...

**AK: I like the lower arrows in the diagram (e.g. dl2_error/dl2_prenonlin), but i don't get why your ripple 2 and ripple 4 arrows go backwwards.**
DCQ response to AK: I made the arrows on ripple 4 that way because I thought ripple 4 is the ratio of the l2 predictions over l2 before its derivatives are taken.  Therefore: RH side of the l2 circle over LH side of l2 circle.  Incorrect?

And I thought ripple 2 was "l1 post-slope-taking nonlin" over "l1 pre-slope-taking nonlin", that is to say, "RH side of the l1 circle over LH side of the l1 circle.  No?"  How/where would *you* draw the arrows to rep Ripples 4 and 2?

Last Thu after dinner when you looked at the 4 ratios in our chain, you labelled them, "slope, weight, slope, weight."  What does that mean?  For example, d l2_error / d l2 pre-slope-taking doesn't involve taking any slope at all.  The slope was only taken in l2_delta, not l2_error.  So, when you labelled this first ration "slope," do you mean, "taking the slope of l2, as we do in l2_delta"?  Or are you referring to another slope?

**AK: All of the sensitivities you need to compute are sensitivity of the output (right side) to a change in the input (left side), so it would make more sense to have all arrows going to the right.  Same for ripple 2, which would be better described as the sensitivity of the output of the sigmoide (right side) to a change in the input (left side).  Notice that I have not used the word slope.  The slope should not be included in any of the derivative expressions.  The slope IS the derivative.  This is a confusion between what you are trying to compute and the mechanics of the computation.  As for the slope,weight,slope,weight sequence, that is what you get when you compute the sensitivities.**

**There are too many arrows from equations.  You should have one arrow per equation.  The first arrow should go from nonlin(l2,deriv=true) to dl2_error/dl2_prenonlin.  The second arrow should be from .dot(syn1.T) to dl2_prenonlin/dl1_postnonlin.  The third arrow should be from nonlin(l1,deriv=true) to dl1_postnonlin/dl1_prenonlin.  The fourth arrow should go from l0 to dl1_prenonlin/dsyn01.**

It seems your arrows always go from the *last* part of the line of code to the ratio.  Why?  Let's take line 64 and ratio 2: why does ".dot(syn1.T)" point its arrow to d l2 pre-slope-taking / d l1 post slope taking?  Is it because syn1 is what lies between RH half of circle l1 and LH half of circle l2?  But what's the math linking these two things by an arrow?  Is it because d l2 pre-euler's # / d l1 post-euler's number = syn1?  Perhaps if you walked me through an example of the chain rule, using real numbers, I could grasp this better?  Feel free to do a pencil diagram and send me a photo of it, so you can highlight stuff and draw arrows easily.  

DCQ: And on line 76, unlike the first three lines of code, you draw the arrow from the *first* part of the equation, l0. Why?  What's different about this one?

**AK: You start from the most downstream part (l2_error) and work your way upstream through the sensitivities.  For each equation, the arrow goes from the new information (the more upstream component) to the corresponding sensitivity.  It will be easier to explain this in person with a few diagrams that I don't have time to write explanations for now**

**You probably noticed that I did not include an arrow for l2_error.  One thing that bothers me about the problem formulation is that there is no direct expression for the scalar cost to be minimized (note that for the batch update l2_error is a 4x1 vector, not a scalar).  It seems implied that the scalar cost C would be defined as C = 0.5xl2_error.Txl2_error.  This would provide dC/dl2_error = l2_error. **

DCQ: I don't completely understand the above paragraph, but I *think* you are referring to the same confusion I have about why the overall equation we're trying to solve is d l2_error over d syn0,1.  It seems like you've tossed some extra computations into d l2_error, i.e., y - l2 = l2_error.  I would have expected simply l2 on top of this ratio.  On none of the other ratios do we have multiple math steps, do we?  So, if we're introducing y for the first time, and subtracting l2 from it to come up with another variable we're seeing for the first time, l2_error, isn't that a problem?  Don't we have to give this ratio equation "simple, clean, singular" variables to compare?

**AK:  I think you get the main point.  The problem is l2_error is a vector, not a single number.  For the appraoch to completely make sense, the output needs to be a single number (e.g. the length of l2_error).  I think he his hiding this for some reason that is unclear to me.**



![alt text](https://lh3.googleusercontent.com/Y4B2dRd-RDZJGa7-bd6kevegaX-SdVqCpF2Ap3U5Cvxki1nE_iBq0_7LxcoUYjlS90WQwQF94BCVDUBYWJXMKbWz5MHgW2h5__r5hYxqjR1l2Xfn3qC_V6oOp67ne2VA9Cb6hBbrjP11sNPPyVe7J6Vl0-DLlpzreUMqMQTxHbK2c23o2fMo4hpM7dfEdgmTgEZJCevHmL52_4oOLvGcDrKMJ-4puWxN8b3zMKvVkbntRyHCjzEWRafLBCnYx_gWnInMjBLrDPcO3AENqp5In1geRPhUb3N4FK-nsU3k1YlAVke4r6YVCsKVjFYYDBs1004aPsjDKe-XtDU9mX3j9-_KKo3PUw53WnVy-i1070fyB2bSgf4ZqTmdin-vcc-LEw-b89fj91yuI5HwKKel5heB4ToW7t1ezqLM_-KJ4veUU35LQkD4B3ZNHTNGLQlVw6fhhj4ZftmnmJc1s3BWNpp_7hf2hLifp6c3Fahyny_XqA5rA6BHGHge6XXIdoTtUJY6Sfor_O_LJDI0pUlsq7TbnRgETzq_HopqbCchPFufdYhte2_x8TthNb_F1-exlFFtUQlKkaFKH9YaDZb2cka0c15Q40SLeH_LT4WnjH0xlfCdqLXhQHqf8wCGPHQGGJZHEPLVqLOziTW3GdiVKxNsQIi0FtS1=w970-h750-no)

DCQ: Adam, below is basically my attempt to verbally teach through everything that is in the above diagram.  How did I do?  Please edit below.
I don't see your answer to a question I asked you on my actual pencil diagram: What does "pre-nonlin" mean the way we used it on our Thu night diagram?  Does it mean: 1) pre-changing the values into numbers 0 to 1, or 2) pre-taking the slopes of the values (in our case, of l2 and l1)? i.e., x*(1-x) ? Tx, DC
**It means “what l2 would be if we did not use the nonlin function
Or dk2 using the variables in the calculation sequence I added.  Your l2_delta computation is wrong.  You don’t put the slopes through the nonlin function (the deriv=True option uses the function to compute slopes INSTEAD of evaluating nonlin).**
You mean, "...if we did not use the nonlin function in either of its 2 forms"???  Wow: didn't see that comin'...

DCQ: Is "d l2_error" exactly the same thing as "l2_delta"?  And "d l1_error" is the same as "l1_delta"?

**AK: No, any quantity preceded by d basically means an infinitesimally small change.  Trask's intermediate variables (l2_delta, l1_error, l1_delta) are combinations of the sensitivities used to capture the chain rule, but using a more intuitive explanation.  Unfortunately, attachment to these variables makes the chain rule more difficult to explain.**

We will begin with one line/weight, the top line of syn0, between the top neurons of l0 and l1, and we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0).  Our question is, "When we nudge syn0,1, how much does that move the l2_error, our cost?"  In other words, think of a derivative as a "sensitivity," or a relationship, or a ratio: we know that if we nudge syn0,1, that l2_error will move proportionally to that nudge.  But will it move a little?  A lot?  What is the ratio of l2_error's movement to syn0,1's movement?  How sensitive is l2_error with respect to syn0,1?

Said another way, we know that changing syn0,1 has a ripple effect on l2_error.  First we must measure each ripple of that ripple effect.  Our ripple effect has 4 ripples.  Let me show you those 4 in a diagram, then explain below.
[[[insert diagram labelling R and L half of l2 and l1, also listing where the 4 ripples are]]]
Why are the circles representing the neurons of l2 and l1 divided down the middle?  To represent the values of l2 both *pre* slope-finding, `(l2,deriv=True)` (the left side of the circle) and *post* slope-finding (the right side of the circle).   

Let's make a Big Picture comparison between Feed Forward and Back Prop.  

Here's what Feed Forward looks like in pseudo-code (note that I add "LH," meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of the circle representing l1," which means, "before the product has been passed through the nonlin() function.")

l0xsyn0 = l1LH ->  nonlin(l1LH) = l1  ->  l1xsyn1 = l2LH ->  nonlin(l2LH) = l2  ->  y-l2 = l2_error

Note that nonlin() is the part of the Sigmoid function that renders any number as a value between 0-1.  It is the code, `return 1/(1+np.exp(-x))`.  It does not take slope.  But in back prop, we're going to use the part of the Sigmoid function that does take slope: `return x*(1-x)`because you will notice that lines 57 and 71 specifically request the Sigmoid to take slope with the code, `(deriv==True)`.

OK, above was forward feed.  Below is what Back Prop looks like.  It's a super-key point to note that, from line 57 onward, the order of forward-feed is reversed.  Back propagation is indeed working backwards from the results you got on your previous iteration to the new tweak to the synapses you should use to get *better* results in your next iteration.  

We will represent back prop as a ratio or slope, and we're going to use the letter "d" to mean, "the change in." So, every time you read a "d," read it as, "the change in." Here we go.  

First, here is our overall goal: to find how a change in syn0,1 causes a change in the l2_error for the better in the next iteration.
```
d l2_error (y - l2, the R half of the l2 circle)       
----------                                    
d syn0,1 (the top synapse btw l0 & l1)       
```
In other words, we want to know the ripple effect of a change in syn0,1 on the l2_error.  Let's follow our code as it works backwards through those 4 ripples to find a ratio that will tell us, "Hey--if you move syn0,1 *this* much, it will move l2_error *that* much:
Line 57: `l2_delta = l2_error*nonlin(l2,deriv=True)`

This means, "take the slope of l2. Times it by l2_error to get l2_delta, the change we want to see in our next l2 iteration."  

This is Step 1 of our back propagation, which is to find the 4th ripple of the ripple effect.  To do this, we need to create a ratio of change, which will tell us "if we change this, how much will it change *that*?".  So, l2_error * nonlin(l2,deriv=true) is expressed as the ratio,  d l2_error / d nonlin(l2,deriv=true), and nonlin(l2,deriv=true) means, "the slope of l2," so you could write this even more simply as, d l2_error / d the slope of l2.  Let's represent this code as a ratio of change:
```
  d l2_error (y - the R half of the l2 circle)   
  --------------------------------------------       
   d l2 (L half of l2 circle, before taking slope)
```
The above ratio answers the question, what is the sensitivity of l2_error, with respect to l2?  In other words, if I change l2, how much will that affect l2_error?
#Adam, I only got to here this morning (Monday morning).  No need to check below. Tx, Dave

**AK:  The above bit is too long and does not make sense.  You want to comptue d l2_error/dl2_prenonlin.  That is how much does the number coming out of the sigmoid change if you change the number going in, which is simply the slope of the sigmoid**

The above reads as, "Hey--a change in l2 will affect the l2_error *this* much."  That's our 4th ripple effect--only 3 more ripple effects to go!  Here's our next line of code:
Line 64: `l1_error = l2_delta.dot(syn1.T)`
Again: instead of multiplying forward, we are dividing backward to get a ratio of change, so:
```
change in l1 (aka, "l1_error")     d l2_delta
                               =   ----------
                                    d syn1,1
```
The above reads as, "Hey--a change in syn1,1 will affect the l2_delta *this* much."  This is our 3rd ripple in the ripple effect that changing syn0,1 has on l2_error.  

**(AK: This is wrong.  The ONLY weight we consider is syn0,1)**

Let's move on to Line 71:
```
l1_delta = l1_error * nonlin(l1,deriv=True)

Working backwards: d slope of l1 (the R. half of the l1 neuron circle)
                   ---------------------------------------------------  = l1_delta
                    d l1_error (the L. half of the l1 neuron circle)
```
The above reads as, "Hey--a change in the l1 error will affect the l1_delta *this* much.  This is our 2nd ripple in the ripple effect that changing syn0,1 has on l2_error.  Let's move on to Line 76, where we actually update syn0 for the next iteration, where it will partner with syn1 to produce a better l2 prediction:

**(AK: I'm starting to get lost here.  You never need to take a derivative of the slope in this process. The derivative IS the slope)**
```
syn0 += l0.T.dot(l1_delta)

Working backwards: d l1 (aka the l1_delta)
                   -----------------------  
                    d syn0,1
```
The above reads as, "Hey--a change in syn0,1 will affect l1 *this* much: the l1_delta
 
 **(AK: The terms seem mixed here.  I change in l1 is not equal to the change in l1_delta)**

To sum up:
The slope of l2_error (Ripple 4) = l2_delta x change in weight of syn1 (Ripple 3) = l1_error x slope of l1_error (Ripple 2) = l1_delta x input l0 = change in weight of syn0 (Ripple 1).  

In the other direction: Ripple 1, a change in value of the synapse/line syn0,1, causes Ripple 2, a change in the value of l1 before computing slope (L. half of l1 circle) over l1 after computing slope (R. half of l1 circle).  This makes sense because l0 X a change in syn0 = a change in l1 before the next step, which is to take the slope of l1.

This change in slope, Ripple 2, causes Ripple 3, the proportional change to syn1,1 given the change to syn0,1.  This change in weight syn1,1 equals the change in ratio of l2 before computing slope (L. half of l2 circle) over l1 after computing of slope (R. half of the l1 circle).  Ripple 3, the change in syn1,1, leads to Ripple 4, which would be a change to the ratio of l2 (LH half of circle) over the final values of l2 after slope-finding (RH side of circle).

So, the ripple of Ripple 4 would be when we pass the changed values of l2 (LH half of circle) *as changed by the ripple effect of changing syn0,1* through the slope-finding part of Sigmoid function to become the final values of l2 (RH side of circle).  Therefore, measuring the ripple of Ripple 4 is measuring the slope of the Sigmoid function.  This is what we did in line 57 of our code! `l2_delta = l2_error*nonlin(l2,deriv=True)` Let's break down line 57 piece-by-piece: 

**(AK:  I'm still lost, too many mentions of the word ripple.  Probably better consider the ripples independently before combining in the end)**

Let's start with`(l2,deriv=True).` As I explained in Step 10 above, **Big Advantage 3** of the **Four Big Advantages of the Sigmoid Function** is that we can take the slope of l2 and get a number which we can read on the graph of the Sigmoid as either a high-confidence, shallow-slope number (near the top or bottom of the graph), or a low-confidence, steep-slope number (near the middle of the graph).  `(l2,deriv=True)`means, "Compute the slope (i.e., the rise-over-run in the graph of the Sigmoid) of l2, and this will be the confidence level of that value of l2.  Then, as a housekeeping measure, to run this confidence level through`nonlin()`simply makes it a number beween 0 and 1.  Finally, since we're working BACKWARDS through the code, instead of multiplying this confidence level by l2_error, `l2_error*()`we DIVIDE the confidence level by l2_error to get a ratio of rise over run, which in this case equals the ripple-induced-change in confidence levels of l2 over the ripple-induced change l2_error.  This ratio is the l2_delta, and the change in l2 caused by the original ripple of changing syn0,1. 
DCQ: is this last sentence correct?  
**AK: It is not correct due to the problems in the setup leading towards the last senttence**

Let's continue backwards to compute the slope ratio of our third ripple, which is a reverse of `l1_error = l2_delta.dot(syn1.T)`.  This is the change in l2, before we computed slope, over the change in l1, post-computing of slope, 
DCQ: Am I on the right track here, or is this explanation redundant? (**AK: This section still needs a major structural change.  It needs to start with the big picture of what we are aiming for.**)

**AK: I think you are conflating the derivatives and the intermediate terms (e.g. l2_delta, l1_error, l1_delta)**

**AK: Overall, a good start to teaching back prop without calculus.  However, it can be greatly condensed without losing any of the value.  I would recommend showing each of the derivatives/ripples individually (via the suggested chain using intermediate variables), then show how they are used to compute the intermediate variables in the code.**


#11) Back Propagation: Lines 59-64
```
  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Pos. or neg.? Similar to #10 above, we want to tweak this 
  #middle layer so it sends a better prediction to l2, so l2 will better predict target y.
  #In other words, add weights that will produce large changes in low confidence values and 
  #small changes in high confidence values
  l1_delta = l1_error * nonlin(l1,deriv=True)
```
First, let's do a quick review:
In Step 8, we computed l2_error, which is the difference between the prediction of the network (l2) and the truth (y).  Next, in Step 10 we computed l2_delta by multiplying l2_error times the confidence levels we had in each value of l2.  And soon in Step 13 we will update syn1 by multiplying l2_delta by l1.  This multiplication is needed so that larger changes are applied to weights that have more impact on l2.  However, we do not apply this update until after we are ready to update syn0, for the sake of consistency.    

This is an important distinction.  We know our goal is to get to The Ideal l2 Prediction, which is close to or the same as our truth, y.  But we'd only be using half our horsepower to get there if we only update syn1.  We want to update both syn1 and syn0 in order to maximize our efficiency in creating The Ideal I2.  Finding the l1_delta is the key to updating syn0, and that's our next job now.

In Steps 11 and 12, we are going to compute the l1_delta, but in a slightly different manner.  We are going to use back propagation.    

Welcome back to our fairy tale:
##*The Princess and the Castle, Chapter 3: Back Propagation Lets You REALLY Learn From Your Mistakes.*

Continuing your (unmistakeable) genius, now you can work backwards to learn from your mistakes.  Your l2_delta tells you the confidence you had in each of the turns (syn1) that brought you to l2.  So, you go back through your turns and eyeball the low-confidence turns to figure out the first place you went wrong.  In math terms, you will multiply tomorrow's ideal arrival and the confidence you had in today's predictions (l2_delta) by today's not-so-perfect directions (syn1) to figure out the first place you went wrong on today's route (l1_error, which is the distance between today's screwy l1 stop and tomorrow's fabulous l1 stop).  

You can then multiply the l1_error by how confident you were in your *first* set of turns/predictions that got you to l1, and the product will be the l1_delta, with which you can eagerly update syn0 to bring you ever-closer to your betrothed.  I am brushing back the tears, just thinking about it...

Consider this: back when we were finding l2_error, life was easy.  We knew l2, and we knew "The Ideal l2" we were shooting for, which was simply y.  But with l1_error, things are different.  We don't know what "The Ideal l1" is.  There's no y to tell us that.  y only tells us The Ideal l2.  So, how are we going to figure out what The Ideal l1 is?  We must first figure out how much l1 would need to change to produce the desired effect on l2.  We're going to have to take what we *do* know and work backwards.

We know syn1.  It is the current navigational turns we used today to drive to where we are now (l2), which is not yet at the castle.  We know l2_delta.  It is "how much we missed the mark (Castle y) by" (l2_error) but it is also our level of confidence (l2,deriv) about what went wrong and what needs fixing.  

So, if we multiply the current (mistaken) navigation we used today (syn1) by the changes we'd like to see in the tomorrow's l2 (i.e., l2_delta), we can discover the mistakes in l1 (the l1_error) that led us off the best route to arriving at Castle y.  Think of this as the "contribution weighted error."  In other words, "how much did the mistakes in l1 (the l1_error) contribute to the mistakes in l2 that took us off the perfect route to the princess?"

If we multiply our mistakes in navigation (syn1) by our mistakes in l1 (which contains all the mistakes we made the *first* time we got lost), that would take us to l2, the location we ended up at the *second* time we got lost.  But we're working backwards now.  So, if we multiply the changes we need in our second set of directions to get us to the castle on the next attempt (l2_delta) by the navigation mistakes we just made that got us to our current, lost location (syn1), that product will tell us our l1_error (how much our first set of directions took us off the ideal route to the gas station).  Again: l1_error tells us how much our first set of directions (the product of l0 * syn0) **contributed** to ending up lost at this second, lost location (l2).  

This multiplication of the values of syn1 by the values of l2_delta creates an l1_error that has large values for those compnents of l1 that have the biggest impact on l2.  And that's very good news, because now we know which components of l1 matter the most to l2, and we can tweak the weights of syn1 accordingly in the next iteration.  Using our princess castle analogy, tweaking the key weights in syn1 will take us from the "first place we ended up when we got lost," l1, to a better place this time, an l2 closer to the gas station.  We'll still end up lost a second time (l2), but we'll be closer to Castle y because we improved our navigation slightly on *both* routes that got us lost, syn0 and syn1.  We got a *tad* less lost on both routes this time, and arrived closer to the gas station.  

For calculus buffs: the above is merely the chain rule in disguise.  For the rest of us mere mortals, here's the intuitive explanation:  We are trying to find the error in the hidden layer, l1.  If a value of the hidden layer l1 has no impact on the next layer (l2), then it can't really be "wrong".  In this case, we are really trying to find the effects of the weights of syn0 on the output l2.  To do this, we need to include the current values of the weights in syn1.
```
l1_delta = l1_error * nonlin(l1,deriv=True)
Take l1 predictions, find their slopes, pass them through nonlin(), multiply them by the l2_error, and product is l2_delta

DCQ: Are the numbers below just too darn small to be effective at teaching Big Misses and Small Misses? Instead of using iteration number 10,000 below, should I use iterations, say, 1 or 2? 

l1:                              l1 slopes:                           l1 slopes after nonlin():       
[0.8556 0.07365 0.9467  0.999]   [ 1.779  -2.5318   2.877   7.114]   [0.8556 0.07365 0.9467 0.9991]   
[0.1025 0.00010 0.0124  0.924]   [-2.169  -9.1877  -4.380   2.500]   [0.1025 0.00010 0.0124 0.9241] X  
[0.9994 0.09339 0.0166  0.901]   [ 7.454   2.6491  -4.083   2.203]   [0.9994 0.93396 0.0166 0.9005]    
[0.9709 0.01786 0.00001 0.082]   [ 3.506  -4.007   -11.34  -2.410]   [0.9709 0.0179  0.00001 0.0824]   


l1_error: 
[[ 0.00071393 -0.00071731  0.00094539 -0.0007853 ]
 [-0.00102426  0.0010291  -0.00135632  0.00112664]
 [-0.0009338   0.00093822 -0.00123654  0.00102714]
 [ 0.00142841 -0.00143516  0.0018915  -0.00157119]]








```



[[[insert matrix multiplication to show how "wrong values" are aggressively changed and "right values" are basically ignored (or promoted?)]]]

DCQ: why does multiplying the confidence-weighted error by the syn1 weights give us the contribution-weighted error?
DCQ: In other words, why does multiplying "our confidence levels in how much we missed the mark by (l2_delta)" times "the navigation that got us to l2 (syn1)" give us "how far we were off the ideal route the first time we got lost (l1_error)?"  Is this Pythagorean?  Or nested derivatives?  Or something else?

**AK:  The mathematically rigorous answer:  it's the chain rule from calculus in disguise.  The intuitive answer:  We are trying to find the error in the hidden layer.  If a node of the hidden layer has no impact the next layer (l2), then it can't really be "wrong".  In this case, we are really trying to find the effects of the weights of syn0 on the output l2.  To do this, we need to include the current values of the weights in syn1.**

A picture of the process would look like this:

![alt text](https://lh3.googleusercontent.com/VUKlbC8K8bVPdWmuKUY8g8bYbrPgWfHzczy_Dg0-B8dI0lvDHH6N9ItPGDW7g1JBKB25ld7K4a_uVEwJifL1XnGRXrqZ8mAcxSqEXHt5T7WV6O2Lzi5LQ5hhn2oSGR77Zd2FGXX74IJXUCD_GRV7X7mOrrIfl3_-LNOhP1uWreS2BxrkVnWsUrhpBxXFpA6fdBLdUQMhtjcrBNFPQvClXT4l_AuMpoo_OhJI0ToFM4MMnASgpq3xllqoQrxVtLSm4lePvDEPgTct2Oh5DLBhhhLDI9XpgnKydebZqTbnzCCglaZSOWzb3ldXOliQV4K9OuFxB4ttR4J0CeKLQ9_sjDlsARKrCgL-hNDfymabbd0BIReaQODKPYrqsXXwAO1FAraCxm_a1DzltVr6daZTGXRRL-_eUaFXN1bhvciarTkeeLcgsTTmzisnMGSR0zTyXoN-2GVAfTwSnyY4oiSd-MdVkuFXOpVg0zUFHCNg5_xC7R3svd-WtOe68YK4HLPRIh1RfqWGbScgExQcrE65r1GIVuoxDfFm906rsGxaW0U5txPxgmroVYo1nF6xy9x68HByAByKV3yjUo5pgRdV5VPS5zVQK9tb1sYu9imxRWxxYo0L1EQ-4nF0TF3SuYpQxPqD2NX6uZ6-yxi-7fcW2t_SID6PgAlh=w970-h752-no)

Key points of this diagram: (DCQ: are these 4 points accurate?)

-Route 1 is not the best route to Castle y.  It is *a* route.  You will get lost.  Ditto for route 2.  You will get lost again, but you'll be closer to the gas station;

-l1 is the first location you ended up at when you got lost.  You ask a nice lady for directions.  She gives you a set of turns, syn1;

-Feed forward is stopping at a nice lady's house, l1, to ask for directions when you realize you're lost.

-Back Prop is returning to that nice lady's house and asking, "Where am I right now?" (l1) Then you calculate how far that location is from the ideal route (l1_error).

**AK:  It's a good start, but needs improvement.  My recommendation would be to break up the explanation into three parts: 1) feed-forward, 2) back-propagation, 3) update synapses.  Feed forward is you drive the car all the way to l2.  Back propagation is when you realize you ask the nice lady at l2 for directions.  She tells you how where the gas station is relative to where you are now and you figure out how you ended up here based on the route you took (e.g. It went wrong when I took that first exit.).  Updating the synapses is when you plan your new route incorporating the directions you got from the nice lady.  I think you should only have one "route" per instance of l2 since you don't really stop in the middle.**

**DCresponse: I understand completely what you wrote above.  I would make one distinction: It seems to me that I do indeed "stop to ask directions" at l1 because my driving route to l2 is more like 2 vectors than one, straight line.  Here's what I mean: 
At l1, multiplying by syn1 is like getting the "new set of directions," that puts me on a new route to l2.  I don't continue in one straight line from l0 to l2.  There is an update happening at l1xsyn1 that does change my direction, if only slightly.


Said another way, even if I use your analogy and "ask the lady only at position l2" and discover "I went wrong when I took that first exit," the point at that first exit represents a change in direction of my vector, no?

And, to rep that accurately in my diagram, my first iteration should be more of a "dog's leg," then the 2nd has slightly less of a bend in that leg, since I "improved" my directions from l0 to l1 AND from l1 to l2 by updating both synapases, and each following iteration has less-and-less of a bend in the dog's leg as the 2 "segments" of my journey become more efficient, thus resembling more of a straight line from l0 to l2, no?

So, I guess I'm using, "stop to ask directions in 3 ways: 

1) while still driving to l2 in forward-feed, I get corrections in my route when I multiply l1 by syn1; 

2) at l2, to "stop to ask directions" is more like, "Hey lady, how far is the gas station from here? What directions should I use to arrive directly there next time I drive from l0?" i.e., l2_delta

3) at l2, to "stop to ask directions" is more like, "hey, where did I go wrong here? Oh!  I messed up when I took that first exit" (i.e., chain rule back prop);

is the above more accurate?

**AK: I think you are using "asking for directions" to mean multiple things and that bothers me in an explanation. To me "asking for directions" is the best analogy for comparing what you did to what you would ideally want to do (i.e. figuring out l2_error) and should ONLY be used for that purpose.**

**I agree that the legs from X to l1 and l1 to l2 can be treated as distinct.  However, these legs are entirely pre-determined by the current weights of your synapses.  I think a better analogy would be the leg from X to l1 is following the first page of instructions on a AAA Triptik.  At l1, you turn the page and l1 to l2 is the next page of instructions.  However, all of these instructions were pre-determined when you had the Triptik printed.  Building on the analogy, back-propagation is when the lady tells you how far off you are and you go back through your instructions and figure out which ones contributed the most to your error.  Updating the synapses is printing a new Triptik with corrections (i.e. I will not take this exit cause it messed everything up last time.) The main problem with this analogy is that nobody under 25 will have any idea what a Triptik is.**



#12) In what DIRECTION is the target (ideal) l1?  Lines 66-69
As before, we compute l1_delta by multiplying l1_error by the derivative of the sigmoid to aggressively change low confidence values.  We will use the exact same process as Step 10 to find in what direction our gradient descent should be moving in order to take us closer to the perfect l1 that will contribute to us finding the perfect l2, our ultimate goal.

We want to answer the question, "In what DIRECTION is l1, the desired target value of our hard-working middle layer 1, from l1's latest prediction in this current iteration?  We want to tweak this middle layer of our NN so it sends a better prediction to l2, making it easier for l2 to better predict target y.  In order to answer this question, we need to find the l1_delta, which tells us how much to adjust the weights to produce large changes in low confidence values and small changes in high confidence values.


#13) Update Synapses, aka Gradient Descent: Lines 71-74
This final step is all the Glory Moment:  all our work is complete, and we reverently carry our hard-earned l1_delta and l2_delta up the steps of the podium to our hallowed leader, Emperor Synapse, the true brains of our operation.  
We compute the update to syn0 by multiplyinjg l1_delta by the input l0.  This causes large changes in components of syn0 that have stronger effects on l1.
We update syn1 and syn0 so they will learn from their mistakes of this iteration, and in the next iteration they will lead us one step closer to that ideal bottom of our bowl, where error is smallest, predictions are most accurate, and joy abounds!

DCQ: why do we multiply the l2_delta by ***l1*** to update syn1?  I would have expected to multiply the *old* syn1 by the l2_delta to update the weights by the corresponding delta in l2_delta, thus creating a new-and-improved syn1 to use in the next iteration.  No?  Why not?
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)

**AK:  It is more efficient to change weights in the synapse that correspond to larger values of l1 (i.e. if a node of l1 has a large value, a small change in the weights that are multiplied by this value can have a large effect on l2).  The multiplication ensures that the total change applied to the synapse maximizes the impact on l2.  Side note:  the mathematically rigorous answer is that it produces an increment in the direction of steepest descent (opposite of the gradient).**


#In Closing...
Andrew Trask gave me a fabulous gift when he wrote that memorizing these lines of code leads to mastery, and I agree for two reasons: 

1) When you try to write out this code from memory, you will find that the places where you forget the code are the places where you don't understand the code.  Once you understand this code perfectly, every part of it will make sense to you and therefore you will remember it forever;

2) This code is the foundation on which (perhaps) all Deep Learning networks are built.  If you master this code, every network you learn and every paper you wade through will be clearer and easier because of your work in memorizing this code.

Memorizing this code was made easy for me by making up an absolutely ridiculous story that ties all the concepts together in a fairy-tale mnemonic.  You will remember better if you make your own, but here's mine, to give you an idea.  I count the 13 steps on my fingers as I recite this story out loud:

1) Sigmund Freud (think: Sigmoid Function) abolutely *treasured* his neural network, and he buried it like a pirate's treasure, 

2) "X" marks the spot (Creating X input that will become l1).  

3) "Why," I asked him (Create the y vector of target values), "didn't you plant 

4) Seeds instead?" (Seed your random number generator)  "You could have grown a lovely garden of 

5) Snapdragons," (Create Synapses: Weights) "which could be fertilized by the 

6) firm poop" (For loop) "of the deer that 

7) Feed on the flowers" (Feed Forward Network)!  Then suddenly, an archer 

8) Missed his target (By How Much Missed the Target?) and killed a grazing deer.  As punishment, he was forced to 

9) Print his error (Print Error) 500 times on a blackboard facing the 

10) Direction of his target (In What Direction is y?).  But he noticed behind the 

11) BACK of his target two deer were mating and PROPAGATING their species (Back Propagation) and he shouted for them to stop but they wouldn't take 

12) Direction and ignored him (In what Direction is the l1 target?).  He got so angry that his mind 

13) Snapped and he Descended into Gradient insanity (Update Synapses, Gradient Descent).  

So, this is a very silly story, but I can tell you that it has burned those 13 steps into my brain, and once I can write down those 13 steps, and I understand that code in each step, to write the code perfectly from memory becomes easy.

I hope you can tell that I love my journey into Deep Learning, and I wish you the same joy I find!
Feel free to email me improvements to this article at: DavidCode1@gmail.com

THE END
