<a href="https://colab.research.google.com/github/davidAcode/davidAcode.github.io/blob/master/0402_Teaching_Deep_Learning_to_the_Marginalized_Without_a_Math_Background.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Building Your First Neural Network: A Simple, Complete Explanation That Skips No Steps
The best teacher is the person who just learned yesterday the stuff you are studying today, because he still remembers what he struggled with and how he overcame it, and he can pass those shortcuts on to you.  That "he" would be me.  Let's jump in:

#1) Congratulations!  You are the wealthy owner of a fabulous pet shop...  
One month ago, you launched a new kitty litter product, cleverly named, "Litter Rip!"  A big part of your success comes from your savvy use of AI to send targeted advertisements to the right potential new customers.  You seek folks whose beloved felines would be pleased to, "let 'er rip," so to speak, upon your new cat toilet.

Your secret weapon is your dataset.  You have data from surveys of your pet shop customers in the past month since you starting selling Litter Rip! in your store.  These customer surveys include their answers to four stunningly insightful questions:  
1. Do you own a cat who poops?
2. Do you drink imported beer?
3. In the past month, have you visited our award-winning website, LitterRip!.com?
4. In the past month, have you purchased Litter Rip! for your poopin' puss?

The answers to these four questions are known as "features," (characteristics) of your past customers.  It's important to understand that a neural network always trains on one dataset in order to make predictions on another dataset.  So first, you will train your network by inputting millions of past, overjoyed customers and their Yes/No answers to the first **three** of the above insightful questions/features.  Your neural network will calculate which of these features (or combinations thereof) were most important to past customers who bought Litter Rip! for their puss.  

Most thrillingly, your neural network will train itself, based solely on your old customers' answers to the above three questions, until it becomes **awesome** at predicting which of your old customers probably did buy Litter Rip!  The process is trial-and-error:  the network will predict, then compare its predictions to the list of old customers who actually **did** buy Litter Rip! (i.e., our fourth question above: "Purchased Litter Rip! in the past month?"), and learn from its mistakes over 60,000 iterations.  

Once your network is fabulous at predicting purchasers of Litter Rip! from the past customers database, then, you can turn it loose on your *other* dataset, a list of hot prospects.  From your local veterinarian (who is secretly in love with you, you charmer...) you have obtained a fresh batch of surveys of people who have answered the same first three questions, and your by-now-well-trained network will predict who best to send your targeted ad to.  Pure genius!  OMG, how do you *do* it?

The best way to master Deep Learning is to build a neural network and then memorize the code, because then you can apply these fundamentals to any network you meet in future academic papers or work projects.  It all starts here.  Today you will build your first neural network using these tools:

1.   A real-life example: marketing Litter Rip! to cat owners;
2.   The Big Picture: an analogy of a neural network as a brain, with neurons and synapses;
3.   Visualizing how networks "learn:" a white ball rolling in a red bowl on a table;
4.   The Code: how a computer creates a learning brain with only 21 lines of Python;
5.   Break down this Python code into 13 steps that cover the 5 **Major Themes** of Deep Learning:

> 1.   A network learns by trial-and-error;
2.   Forward Feed computes error as the distance (the height) from the *grid* on which global minimum lies;
3.   The Sigmoid, a simple activation function, calculates probability and confidence levels;
4.   Back Propagation tells you which parts of your network to tweak to reduce your error; and
5.   Gradient Descent shows how the synapses, rather than the neurons, are the core of your network's "brain."  

I mastered today's material thanks to Andrew Trask and Siraj Raval, as well as my mentor Adam Koenig, a Stanford PhD in aerospace engineering, and a gifted teacher.  Now I'm going to stand on these teachers' shoulders and help you learn it too.

#2) Here is the Big Picture: An Analogy
Here is a diagram of the 3 layer neural network we will build today (For now, just focus on the bottom labels: "Input Layer, Synapses, etc."  We'll get to the labels at the top later):

![alt text](https://lh3.googleusercontent.com/zp68UAhUuc6V5tG_aKC2_74-jIF4cDUZxN3fOJk3Szh2MlWGnpETp2IXAx7CwiWc_AAgIriR4-uz6SyUAqghFztB5ZCQ6JDbPMAuZsbY0WnlZ7XRp9n20C80aRj6RakhEh61HHFCp2qv8spXvUf0d3Qd_Exjia9aejQtqvrzRNw64e8oh4-RhpNUkdptf3C7MAXgGalKacTmaqF2U6qglNUazrFCI4YokrtKm8CD5KwQu2TQiWgyVszBsgtsqRVx9-hJkO4C9uT_jEQ2C7n29wVPvMdumi2pkmjogab8RM-ZXlvLXBmad3rSDAum1ntO6LM5UwzB8mM7PyZpVCuBQCifbFdhowCqn6J6hUdxde-lTv18MgwN-qml_rWuKbR9VT2uNED9WlYMhtpjJ4zfpid7c_QyqcvTEfl-mlhkn64eMzt_WmTYFapa1tQ7lmnbdhmbHK_1uzqGESNSM9yVxrLdJR4LomfcFTGLE7YPlPVCVE8l-CPgJtes252wHeWowsZJ0dy2ZK2h6OCfejtNuWiapWMgdVSs5gYO3M-KprHch8zzQ4oKN3DaciOils-2fzcwD2P6xAZU7HlsWktQOICorsbybYh0SvxXtcaJiAnJ9xKBxypHTWhyHzlC37F3it9UfI9gb-iWBNaO5jJdxAWlol_h8kq2=w973-h535-no)



































Let me help you get your bearings in this diagram.  This is a three-layer, feed-forward neural network.  The input layer is on the left: the three circles represent neurons (sometimes known as nodes or features).  You may recall our first three questions above represent our three features.  So for now, think of that column of three questions as representing one customer's responses.  The top circle on the left contains the answer to the question, "Do you own a cat who poops?"  The middle circle of the three contains the answer to, "Do you drink imported beer?"  The bottom circle of the input layer is the feature, "Have you visited our website, LitterRip!.com?"  So, if Customer One responded, "Yes/No/Yes" to the questions, the top circle would contain a 1, the middle circle would contain a 0, and the bottom circle would contain a 1.

The synapses (all those lines) and the hidden layer above are where our brain does its "thinking."  The single circle on the right-hand side (the one still attached to four synapses) is the network's prediction, which says, "Based on the combination of answers input into me, here is the probability that this customer will indeed buy Litter Rip! (or not)."  The predicted probability will be between 0 and 1.  For example, an output prediction of 0.00001 means, "Definitely won't buy,"  0.2 means "Probably wont buy," 0.8 is "Probably will buy," and 0.999 is, "You're damn right they'll buy Litter Rip! with joy in their hearts!" 

The stand-alone circle on the right-hand side, labelled y, is The Truth, which is the answer to the fourth question, "Have you purchased Litter Rip! for your poopin' puss?"  With y, there are only two choices: "0" means they did not buy, and "1" means they bought.  Our neural network will output a prediction of probability, compare it to y to see how accurate it was, and learn from its mistakes in the next iteration.  Trial-and-error.  60,000 times.  In seconds.  That's the power of Deep Learning.

























#3) Another Big Picture: Visualizing How Networks Learn
Some people learn visually, so seeing a geometric representation of a neural network below may help them master the material.  But, if you understand nothing of what you see in this diagram, no worries at all.  I will teach you every step of the process below, but I want you to have a clear vision of where we're headed.  

This picture sums up how our neural network takes the input of each customer's three features and outputs a prediction of the probability whether that customer will buy Litter Rip!.  It then compares that prediction to the actual truth (y, the answer to the fourth question) about whether the customer did indeed buy Litter Rip!, and learns from its mistakes to give a better probability in the next iteration.  60,000 times.

![alt text](https://lh3.googleusercontent.com/jIup60T65tIKtXg0B-Np6jeNXk4TvQTRgBI1btNRZUZ4yy_ZEyL1bN3RwiSjzKNcbyXQN6z7vdV55NzGFxJfUpZXkyU6HTmrScht0rbk5BXGC6eO79LrZuuVpJdHE4fr4QYwvdbO)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)



























In the diagram above, our network begins its predictions about kitty litter buyers at the top of the dotted, white line on the surface of the warped, lumpy red bowl.  That point is very important, so let's call it Point A.  Think of Point A as the first prediction the network makes.  This is the first forward feed.  What most AI courses and blogs forget to mention is that Point A has a location defined by two coordinates on the white grid underneath the lumpy red bowl.  Notice that the lumpy bowl is not drifting in space.  It is sitting on a white grid. Think of that white grid as a table top, and our warped red bowl actually sits on the table top, but the bowl's only point of contact with the table top is the very bottom-most dip in the bowl--the point where the dotted white line ends.  (I know the bowl would be too lopsided to actually sit properly, but humor me--this is just a quick-and-dirty analogy).    

Now, pretend Point A is a little, white ball.  It has 3 coordinates: the X and Y axes tell you location on the grid (table top), and the third coordinate is the height from the grid to Point A as it sits on the inner surface of the warped, red bowl.  Here's  a super-key insight: the two orange arrows you see, representing the X and Y axes on the grid, are actually the synapses syn0 and syn1 that you will learn about below.  As I eyeball Point A, its coordinates look like (3,3) to me, so for let's say syn0 is 3 and syn1 is 3 (I'm oversimplifying here on purpose).

Here's where it gets cool: remember Customer One's Yes/No/Yes responses to the first three questions?  The network takes that 1,0,1 of Customer One, multiplies it by the synapses of syn0 and syn1 (plus some other fancy jazz), and makes a prediction: let's say, 0.7.  In other words, there is a 70% probability that Customer One bought Litter Rip!  Then the network compares that prediction, 0.7, to The Truth, y, which is 1.  Customer One did indeed buy Litter Rip! (a wise choice).  

And now things get super-cool:  Do you see that yellow, verticle arrow?  If I were drawing this diagram, I would have put that yellow arrow right under Point A--namely, at about (3,3) as I eyeball things.  If the vertical, yellow arrow were located just underneath Point A, then it would represent the 3rd coordinate, the height from Point A to its two coordinates on the white grid, syn0 and syn1, or (3,3).  This height equals the error of that first prediction--how much it missed The Truth by, therefore the y survey answer of Yes, or 1, minus the prediction of 0.7 equals an error of 0.3.  That yellow arrow would have a "length" of 0.3!  Ain't that cool?  We can see how the network thinks in terms of simple, 3-D geometry.  You just witnessed a complete Forward Feed.  You're a rock star.

And it gets better.

The goal of a neural network is to find the fastest, most direct path for the white ball down from the original Point A to the bottom-most dip of the warped, red bowl--the point that sits on the table top.  That is Utopia folks--the place where the yellow arrow, the error in our predictions, would equal zero in length, meaning our predictions have no error and therefore our network would be stunningly accurate.

So, consider our scenario now: we started at the top of that white, dotted line, where Point A's height (the yellow arrow, the amount of error in our first prediction) equals 0.3.  How do we roll the ball down the bowl to get to the bottom, where the yellow arrow, the amount of error) is zero?  In other words, how do we tweak our original synapses of (3,3) so that their coordinates will move on the grid to the point *right* under the bowl's bottom, at (roughly) (3,0)?  So far, we have only made one prediction based on Customer One's responses.  How can we improve each subsequent prediction, based on each customer's responses, until our prediction error is zero, the white ball is at the bottom, and our fabulous network is trained enough to take on our new dataset of hot prospects that were provided by the (smitten) veterinarian?

To repeat: our goal is for our small, white ball of Point A to roll down the lumpy bowl as depicted in the diagram by the dotted white line, which represents our network's journey from its first, not-so-great prediction of 0.7 to the final, 60,000th prediction, which is as accurate as possible.  To find that step-by-step path down the surface of the bowl to the bottom is the process of Gradient Descent.  

Here's a key ingredient of Gradient Descent:  When you start at Point A, it is Back Propagation that tells you the slope of the surface of the bowl at Point A.  Finding the slope of the bowl's surface, where our ball lies, tells us the direction our ball should roll to make the quickest descent to the bottom of the bowl where error is 0 because our ball would be touching the syn0, syn1 grid plane.  Each step of that dotted white line is an iteration of the neural network.

So, what is an iteration?  **That's** a key question.  Our white ball, Point A, rolls down our lumpy red bowl towards the bottom, which sits on the white grid, where "Accurate Prediction" lives.  Each dot of that dotted white path represents a tweak, or update of the weights syn0 and syn1.  Here's the key thing to visualize: the path of our white ball in the three-dimensional bowl is bumpy and erratic.  But if you envision the path of the white ball on the two-dimensional white grid underneath, our white ball is essentially inching its way across that flat grid towards the bottom of the red bowl, where error is zero and acurracy reigns.  

Since syn0 and syn1 are the X and Y axes of the white grid, therefore each adjustment of those coordinates is what brings the white ball to the bowl's bottom, i.e., less error and greater accuracy.  The "learning" process of reducing error takes place in the synapses of the neural network pictured in my first diagram above--not the neurons.

This next diagram may also be helpful in envisioning the geometry of neural networks.  It's essentially another version of the same, red bowl above, but from a slightly different angle:

https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing

Here is a perfect example of why the best teacher is someone who learned yesterday the material you are learning today:  Over the past year of studying AI, I had seen the bowl analogy many times, but NO ONE mentioned the significance of the bowl sitting on the grid, or what the grid was, or the significance of the height from the grid to Point A.  You see, all the experts already know this, and they (mistakenly) assume that you know it too.  So please learn from my mistake, and gain a super-key insight that eluded me for over a year, even though it was right under my nose.  

And that's it.  That is the best geometric representation of what a neural network does, in one picture.  Why doesn't EVERYBODY teach it like this?  If you can SEE what a neural network does in 3-D, it makes it SO much easier to understand why we do all these abstract steps with math and with code.

Again: if you understood nothing of the above diagram and analogy, no worries.  I'm going to fill in all the details below, but at least now you know exactly where we're headed!  Godspeed.

#4) How to Create a Brain with 21 Lines of Code
Now, let's get an overview of our code.  I suggest you open this blog post in two side-by-side windows and show the code in the left window while you scroll through my explanation of it in your right window.  First, I'll show you the entire code we'll be studying today, and underneath that is my detailed step-by-step explanation of what it does.  As you see from the comments below (the lines beginning with a #), I have broken this process of building a neural network down into 13 steps.  Get ready for the wonder of watching a computer learn from its mistakes and recognize patterns!  We're about to give birth to our own little baby brain... :-)





























##But first, a word on the concept of a matrix and linear algebra...
You will see the word "matrix" or matrices in the code below.  VERY important: the matrix is the engine of our car.  Without matricies, a neural network goes nowhere.  At all.  

A matrix is a set of rows of numbers.  For a quick-and-dirty metaphor, think of a glorified XL spreadsheet.  Or a database with many rows and many columns of numbers.  The first matrix you will meet below is from our pet shop.  It looks like this:
```
[1,0,1],
[0,1,1],
[0,0,1],
[1,1,1]
```
Each row contains three numbers between brackets, right?  Those are the Yes/No answers to the three questions from our customer survey.  And each row represents one customer's survey.  So, row one means Customer One.  And column one contains the answers to, "Do you own a cat?"  Now, to fill in a little more detail, here's the exact same matrix with some labels: 

![alt text](https://lh3.googleusercontent.com/J6TA3cwHmh8ZdfmY-J64Kj8PMnUP5pQgzZQ8vMjXr2Ujfvys1ufT8SXHBQeyLNZHV1REYKgw5JJ_nZ5xS9bHT1_pmYV2kUyM7tItvue-fSMhs8ddv5nRK-UUoO6WhrWtOJ1ICq2SE-vbn6W1W88c99_0MVYLiAL9cY6KxWKOy5X6yen7LMlWSseA2piU0GeMtPTvgMwf3z-PCQerZBnacUnTIkTjhofVpXt0ItJWpaB89sN8ylGl87D72G8TuKRt2tAWa3Z8AcGza5XPaoDtuUsjKPUbr3_T0PALX-_XrXiCXxtVGwkLEU_zMYaNByeJYBnUfRpwzpB6BxdNqZoNxQQ5t4hOVqeARJftP6WyaCULDSllKbc3tWADFOqyYHaQXXvRNvkfmnWsuSMEdRzLhPCLMcN2TGA7y6wX5FcY_lWcm2GQmsvetzpD_G6M6WeowokRa-FZiUKVPSNKIoJCZS1SrnvCs7wTYmkqFdIacuiEiHhX7-4-b9h9HRKLxCV3En8_xjLLjnu0-UAbWbE4R3_3oVJNS3G72WlqxmBb_pkTSnW9i5aiz3oU8LmG4oMIqZ7xu82IiWIeYNU4zwCvuRt-hiNfYtHAYbm1mPE-aC_imCcgXbzQIncsFXU51rn2QGWIGa1cHeJWLx2Ob6ZZj8YwFasARCXI=w973-h738-no)

Here's a key point that really confused me at first: in our matrix, one customer's data is represented by a ROW of three numbers, right?  And in a neural network diagram like the first picture I showed you at the beginning of this article, the input layer is a column containing three circular "neurons," right?  Well, it's important to notice that each neuron does NOT represent a customer--a ROW of data.  Rather, each neuron represents a FEATURE--a COLUMN of data.  So, within a neuron, we have *all* the customers' answers to the same question/feature, e.g., "Do you own a cat?"  We're only charting four customers, so in my diagram above, you only see four 1's and 0's that are responses to that question.  But if we were charting 1,000,000 customers, that top neuron would contain one million 1's and 0's representing each customer's Yes/No response to feature one, "Do you own a cat?"

So I hope it's becoming clear why we need matrices:  Because we have more than one customer.  In our toy neural network below, we describe four customers, so we need four rows of numbers.

Our network also has more than one survey question.  So we need one column per survey question (aka, per feature), thus we have three columns here representing the responses to the first three questions of our survey (the fourth question appears in a different matrix that we'll see later).  

So our matrix is tiny: 4 rows X 3 columns, known as a "4 by 3."  But the matrices in real neural networks can have millions of customers and hundreds of survey questions (features).  Or the neural networks that do image recognition in photos or video can have billions of rows of "customers" and billions of columns of features.  

In sum, we need matrices to keep all our data straight while we do complex calculations on it, so a matrix organizes our data into nice, neat little rows and columns (usally, not so little at all).  Good enough for now?  Let's move on.   

I am grateful for Andrew Trask's [blog post](http://iamtrask.github.io/2015/07/12/basic-python-network/) from which the code below is taken (though the comments are mine). Display this in your left window:

DCQ: Adam, I lack confidence in the comments I've written among the code below.  Can you please check each for accuracy and simplicity?  Wherever possible, I evoked the pet shop or red bowl analogies, but I struggled to do so in the back prop steps of the second half of code.  Can you add anything?



In [0]:
#This is the "3 Layer Network" near the bottom of: 
#http://iamtrask.github.io/2015/07/12/basic-python-network/

#First housekeeping: import numpy, a powerful library of math tools.
import numpy as np
#1 Sigmoid Function: changes numbers to probabilities and finds slope to use in gradient descent
def nonlin(x,deriv=False):
  if(deriv==True):
    return x*(1-x)
  
  return 1/(1+np.exp(-x))
#2 X Matrix: This is our feature set from our 4 customers, in language the computer
#understands.  Row 1 is the first customer's set of Yes/No answers to our survey questions:
#"1" means Yes to, "Have cat who poops?" No to "Drink imported beer?" is a "0," and Yes
#to "Visited the LitterRip!.com website?" is the 1.  There are 3 more rows (customers and 
#their responses) below that.  Got it?  4 customers, and their Yes/No responses 
#to 3 questions (the 4th question is used in the next step below).  These are the set of 
#inputs that we will use to train our network.
X = np.array([[1,0,1],
              [0,1,1],
              [0,0,1],
              [1,1,1]])
#3 y Vector: Our testing set of 4 target values. These are our 4 customers' Yes/No answers 
#to question four of the survey, "Purchased Litter Rip?"  When our neural network
#outputs a prediction, we test it against what really happened.  When our network's
#predictions compare well with these 4 target values, it is now ready to predict
#whether our hot prospects from the (hot) vet will buy Litter Rip!
y = np.array([[1],
             [1],
             [0],
             [0]])
#4 SEED: This is housekeeping. One has to seed the random numbers we will generate
#in the synapses during the training process, to make debugging easier.
np.random.seed(1)

#5 SYNAPSES: aka "Weights." These 2 matrices are the "brain" which predicts, learns
#from trial-and-error, then improves in the next iteration.  syn0 and syn1 are the 
#X and Y axes on the white grid, so each time we tweak these values, we march the 
#grid coordinates of Point A towards the red bowl's bottom, where error is zero.
It learns, remembers, improves.
syn0 = 2*np.random.random((3,4)) - 1 # 1st layer of weights, Synapse 0, connects l0 to l1.
syn1 = 2*np.random.random((4,1)) - 1 # 2nd layer of weights, Synapse 1 connects l1 to l2.

#6 FOR LOOP: this iterator takes our network through 60,000 predictions, 
#tests, and improvements.
for j in range(60000):
  
  #7 FEED FORWARD NETWORK: Think of l0, l1 and l2 as 3 matrix layers of "neurons" 
  #that combine with the "synapses" matrices in #5 to predict, compare and improve.
  #l0, or X, is the 3 features/questions of our survey, recorded for 4 customers.
  l0=X
  l1=nonlin(np.dot(l0,syn0))
  l2=nonlin(np.dot(l1,syn1))
  
  #8 TARGET values against which we test l2, our prediction, to see how much 
  #we missed it by. y is a 4x1 vector containing our 4 customer responses to question
  #4, "Did you buy Litter Rip?"  When we subtract the l2 vector (our first 4 predictions)
  #from y, our target values, we get l2_error: how much our 4 predictions missed the
  #target by, on this particular iteration.
  l2_error = y - l2
  
  #9 PRINT ERROR--a parlor trick: in 60,000 iterations, j divided by 10,000 leaves 
  #a remainder of 0 only 6 times. We're going to check our data every 10,000 iterations
  #to see if the l2_error (the yellow arrow of height under the white ball)
  #is reducing, and whether we're missing our target y by less with each prediction.
  if (j% 10000)==0:
    print("Avg l2_error after 10,000 more iterations: "+str(np.mean(np.abs(l2_error))))

  #10 In what DIRECTION is y, our desired target value, from our network's latest guess? 
  #This is the beginning of back propagation.  Here we calculate confidence levels.
  #We take the slope of our latest guess (i.e., the slope of the bowl's surface 
  #where the Point A ball sits), and multiply it by how much that latest guess missed
  #our target of y.  In line 92 we then multiply the resulting l2_delta by l1 to update
  #each weight in our syn1 synapses so that our next prediction will be even better.
  l2_delta = l2_error*nonlin(l2,deriv=True)
  
  #11 BACK PROPAGATION: In Step 7, we "fed forward" our input.  Now we work backwards
  #to find the l1 error (back propagation). l1 error is the difference between the 
  #ideal l1 that would provide the ideal l2 we want, and the most recent computed l1.  
  #To find l1_error, we have to multiply l2_delta (i.e., what we want our l2 to be
  #in the next iteration) by our last iteration of the optimal weights (syn1). 
  l1_error = l2_delta.dot(syn1.T)

  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Pos. or neg.? Similar to #10 above, we want to tweak this 
  #middle layer so it sends a better prediction to l2, so l2 will better predict target y.
  #In other words, add weights that will produce large changes in low confidence values and 
  #small changes in high confidence values.
  l1_delta = l1_error * nonlin(l1,deriv=True)
  
  #13 UPDATE SYNAPSES: aka Gradient Descent. This step is where the synapses, the true
  #"brain" of our network, learn from their mistakes, remember, and improve--learning!
  syn1 += l1.T.dot(l2_delta)
  syn0 += l0.T.dot(l1_delta)

#Print results!
print("Our y-l2 error value after all 60,000 iterations of training: ")
print(l2)

#5) Explaining the Code in 13 Steps 
Let's go through each of the 13 steps of the code in detail:

##1) The Sigmoid Function, Briefly Mentioned: lines 6-11:
The sigmoid function plays a super-important role in making our network learn, but don't worry if you don't understand it all yet.  This is only our first pass over the material.  I'll explain it in detail below in Step 10.  For now, just do your best:

"nonlin()" is a type of sigmoid function called a logistic function.  Logistic functions are very commonly used in science, statistics, and probability.  This particular Sigmoid function is written in a more complicated way than necessary here because it serves two functions:

1) to take each of the matrices within its parentheses and convert each value to a number between 0 and 1 (aka a statistical probability).  This is done by line 11: `return 1/(1+np.exp(-x))` 

Why do we need statistical probabilites?  Well, remember that our network doesn't predict in just 0's and 1's, right?  Our network's prediction doesn't shout, "YES!  Customer One WILL ABSOLUTELY buy Litter Rip! if she knows what's good for her!"  Rather, our network predicts the probability: "There's a 74% chance Customer One will buy Litter Rip!"

This is an important distinction because if you predict in 0's and 1's, there's no way to improve.  You're either right or wrong.  Period.  But with a probability, there's room for improvement.  You can tweak the system to increase or decrease that probability a few decimal points each time, so you can improve your accuracy.  It's a controlled, incremental process, rather than just blind guessing in the dark.

We will see below that this is very important, because this conversion to a 0-1 number gives us **FOUR** very **big advantages**.  I will discuss these four in detail below, but for now, just know that the sigmoid function converts every number in every matrix within its parentheses into a number between 0 and 1 that falls somewhere on the S-curve illustrated here:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

So, Part 1 of the Sigmoid function has converted each value in the matrix into a statistical probability, which is also known as a confidence measure.  In other words, the number answers the question, "how confident are we that this number correctly predicts an outcome?"  You may wonder, So what?  Well, our goal is a neural network that confidently makes accurate predictions.  The fastest way to achieve that goal is to fix the non-confident, wishy-washy, low-accuracy predictions, while leaving the good predictions alone.  Remember this concept of wishy-washy, non-confident numbers.  It will be important below.

Now for Part 2.  The second part of this sigmoid function is in lines 8 and 9:
'  if(deriv==True):
    return x*(1-x)'
When called to do so by `deriv=True` in the code below, line 9 takes the confidence measure from Part 1 and converts it into a slope at a particular point on the Sigmoid S curve, which will be used to tweak the synapse matrices of our network and nudge them towards greater accuracy in prediction.  

Let's move on to Step 2:


##2) Creating X input: Lines 12-17
Lines 12-17, step 2, create a 4x3 matrix of input values that we will use to train our network.  X will become layer 0, or l0 of our network, so this is the beginning of the "toy brain" we are creating!  

This is our feature set from our customer surveys, in language the computer understands.  We have four customers who have answered our three questions.  For example, Row 1 below, which is 001, is the first customer's set of Yes/No answer to our survey questions: "0" means No to, "Have cat who poops?" The second 0 is also No to "Drink imported beer?" and the 1 means Yes to "Visited the LitterRip!.com website?"  

There are 3 more rows (customers and their responses) below that.  These are the set of inputs that we will use to train our network.
```
Line 14 creates the X input (which becomes l0, layer 0, in line 38)
X: 
[0 0 1]
[0 1 1]
[1 0 1]
[1 1 1]

```
Think of each row of this matrix as a training example we'll feed into our network, and each column is one node of our input.  So our Matrix X can be visualized as the 4x3 matrix that is l0 in the diagram below:

![alt text](https://lh3.googleusercontent.com/296hKe7aPkmXPxrLiysqwexLQMvpjBuU_WmUxb2EnoXMTCV2NZNyEumfsTcHsCyn17QxW6SB3nAKxzd2Ssk15kYLbPafhETVkzDb8uQOxmiLQwl0EMqucdlr0cQzBCR9q45MfIM-_Uo_qusjxqItwpuVtVL6H37NE1-I4vxRXP_LcTKQAspuXy6dAH1oYhfHUBBIME7CuPajJCAOubo0G56M8qSrh-LReM9RZoOgfL0kwSM7uO5gt4mEnjCiWimCLDXsp38Ng9ZIlHKxsot6G05P9x-msDRhzAL5bzxijuzh4w854bYItZATqfwGJMJgxklpfaqi9JzMz2h_ihzu-Ln3lCyDcZBfBqv5nE0fDEXEozV4ahgkc_d7z3lWQHh7aZI7mW_kE4PrdpSuHrrReBg1Vd0573M4NrardoSIg405xutOYfoKx6fwsbgE6-BaG9nrRQgQHRqu4-4Lv_n9VaG3BPa5gixlNLHnf2v5imGICK1PLgOq9reftrJ9azac7KCAEk3rGRBpTZf3VS2Xdgm909F6KCu-J2RkEMKyLKBn3kTDzO878YzlQtVqy5PRIr1cube5jqVqpxNsJ0vCSP5qNiik2UHBRTPZ0Wx-djomhEwD2483EOuU5qevYa81W6Q8g0gDhqxuUuEFWviFdjklJtKSTWNq=w973-h535-no)
You may wonder, "How does Matrix X become layer 0 in the diagram above?"  We'll get to that soon.  Next, let's create our list of the four correct answers we want ournetworkto be able to predict.

##3) Create y output: Lines 18-24
This code creates "the truth," or "the future."  Here's what I mean: think of our neural network as a psychic that learns by trial-and-error.  Our psychic can "read our palm," the input layer X above, and from that palm reading she can "predict the future."  Her prediction will be layer 2 (l2, a vector), but the actual, true future is the y vector (a vector is a one-column matrix or array) we are creating here.  For each iteration, our psychic neural network will take input X, make a prediction, l2, and compare it to y, the truth, to see how she did.  She'll learn from her mistakes and do better next time.  60,000 times!  If the network is properly trained, the predicted l2 will approach closer-and-closer to the true future, y, with each iteration.  

**Best to tie back to your original analogy, this is whether each of the four customers from X bought the product.  It's a good analogy, so use it wherever applicable.**

To use another metaphor, I also like to think of y as our "target" values, and I picture an archery target.  Once ournetworkcan correctly predict these 4 target values from the inputs provided by matrix X above, it is now ready to predict in real life.  Think of X above, the input layer, layer 0, as the beginning of our NN.  And y is our truth.  Our truth looks like this:



```
Line 21 creates the y vector, a set of target values we strive to predict.
y: 
[0]
[1]
[1]
[0]

```


#4) Seed your random numbers: Lines 25-27
This step is housekeeping. We have to seed the random numbers we will generate in synapses/weights for the next step in our training process, to make debugging easier.  You don't have to understand how this codes works, you just have to include it.

##5) Create "Synapses" of your brain--Weights: Lines 29-31
DAVE: make sure you explain why they are (3,4) and (4,1). That confused you a LOT.

When you first look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain.  In fact, the real brain of a neural network, the part that actually learns and remembers, is the synapses, those lines that connect the circles in this diagram.  These 2 matrices, syn0 and syn1, are the brain of our NN.  They are the part of ournetworkthat learn by trial-and-error making predictions, then improve their next prediction, then remember their improvements--learning!

Notice how this code, `syn0 = 2*np.random.random((3,4)) - 1` creates a 3x4 matrix and seeds it with random values.  This will be the first layer of synapses, or weights, Synapse 0, that connects l0 to l1.  It looks like this:


```
Line 30: syn0 = 2*np.random.random((3,4)) - 1: creates synapse 0
syn0: 
[ 0.67534974  0.1809666  -0.96032933 -0.91055814]
[-0.94870047 -0.6558582  -0.25683472 -0.61369466]
[ 0.77928043 -0.53186624  0.87700966  0.11388595]
```
Here's where it fits in our diagram below as "syn0 (3x4)":

![alt text](https://lh3.googleusercontent.com/296hKe7aPkmXPxrLiysqwexLQMvpjBuU_WmUxb2EnoXMTCV2NZNyEumfsTcHsCyn17QxW6SB3nAKxzd2Ssk15kYLbPafhETVkzDb8uQOxmiLQwl0EMqucdlr0cQzBCR9q45MfIM-_Uo_qusjxqItwpuVtVL6H37NE1-I4vxRXP_LcTKQAspuXy6dAH1oYhfHUBBIME7CuPajJCAOubo0G56M8qSrh-LReM9RZoOgfL0kwSM7uO5gt4mEnjCiWimCLDXsp38Ng9ZIlHKxsot6G05P9x-msDRhzAL5bzxijuzh4w854bYItZATqfwGJMJgxklpfaqi9JzMz2h_ihzu-Ln3lCyDcZBfBqv5nE0fDEXEozV4ahgkc_d7z3lWQHh7aZI7mW_kE4PrdpSuHrrReBg1Vd0573M4NrardoSIg405xutOYfoKx6fwsbgE6-BaG9nrRQgQHRqu4-4Lv_n9VaG3BPa5gixlNLHnf2v5imGICK1PLgOq9reftrJ9azac7KCAEk3rGRBpTZf3VS2Xdgm909F6KCu-J2RkEMKyLKBn3kTDzO878YzlQtVqy5PRIr1cube5jqVqpxNsJ0vCSP5qNiik2UHBRTPZ0Wx-djomhEwD2483EOuU5qevYa81W6Q8g0gDhqxuUuEFWviFdjklJtKSTWNq=w973-h535-no)

The function np.random.random produces random numbers uniformly distributed between 0 and 1 (with a corresponding mean of 0.5).  But we want this initialization to have a mean zero.  Why?  So that the initial weight numbers in this matrix do not have an a-priori bias towards values of 1 or 0, because this would imply a confidence that we do not yet have (i.e. in the beginning, the network has no idea what is going on so it should display no confidence until we update it after each iteration).  

So, how do we convert a set of numbers with an average of 0.5 to a set with a mean of 0?  We first double all the random numbers (resulting in a distribution between 0 and 2 with mean 1) and then we subtract one (resulting in a distribution between -1 and 1 with mean 0).  That's why you see `2*` at the beginning of our equation, and - 1 at the end: `2*np.random.random((3,4)) - 1`

Notice that we are generating a 3x4 matrix.  Why?  Because l0 (aka our X matrix) is a 4x3, and matrix multiplication requires the inner 2 size numbers to match, i.e., a 4x3 matrix must be multiplied by a 3x_?_ matrix--in this case, a 3x4.  See how those inner two numbers must be the same?

Then this line of code, `syn1 = 2*np.random.random((4,1)) - 1` creates a 4x1 vector and seeds it with random values (depicted with 4 question marks in the diagram).  This will be our NN's second layer of weights, Synapse 1, connecting l1 to l2.  Meet syn1:


```
Line 31: syn1 = 2*np.random.random((4,1)) - 1: creating synapse 1
syn1:  
[ -0.39072641]
[  0.43509921]
[ -0.43520534]
[  0.32941201]

```

Keep an eye on the size of each matrix we are creating (i.e., 4x3, 3x4, 4x4, etc.), because this will become *very* important soon.



##6) For Loop: Lines 33-34
This is a for loop that will takes ournetworkthrough 60,000 iterations.  For each iteration, our network will take X, our input data, and based on that data, give its best guess at a prediction of what our y output is. It will then analyze how it did, learn from its mistakes, and give a slightly better prediction on the next iteration.  60,000 times, until it has learned by trial-and-error how to take the X input and predict accurately what the y output is.  Then ournetworkwill be ready to take *any* input data you give it and correctly predict its future!


#Now Steps 7-13 Will Cover The 5 Major Concepts of Deep Learning:
These 5 major concepts are interrelated and will appear multiple times below.  Here is a rough order of appearance:
1.   Forward Feed computes error as the distance from the white ball to the red bowl's bottom;
2.   A network learns by trial-and-error;
3.   The Sigmoid, a simple activation function, predicts probability and confidence levels;
4.   Back Propagation tells you which parts of your network to tweak to reduce your error; and
5.   Gradient Descent: How the synapses, rather than the neurons, are the core of your network's "brain."  

##7) Forward Feed computes error in our prediction as the yellow arrow: the distance from the white ball to the white grid under red bowl
This is where our network makes its prediction. Think of l0, l1 and l2 as 3 matrices that are the "neurons" that combine with the "synapses" matrices we created in #5 to predict, compare to y, learn from mistakes, improve.  

This is an exciting part of our deep learning process, so I'm going to teach this same deep learning process from three perspectives: 

1.   First, I will tell you a spellbinding fairy tale of feed forward;
2.   Second, I will draw stunningly beautiful pictures of feed forward; and
3.   I will open up the hood and show you the matrix multiplication that is the engine of feed forward.

I'm Irish.  Who doesn't love a good story?  My mentor Adam Koenig suggested the following analogy, which I have ridiculously exaggerated into a fairy tale, because **I am an *artiste*:**

##*The Princess and The Castle*, Chapter 1 of 4: The Feed Forward Network

Imagine yourself as a neural network.  You happen to be a neural network with a valid driver's license, and you're the type of neural network that enjoys fast cars and hot romance.  You eagerly wish to meet The Love Of Your Life.  Well, Miracle of Miracles, you have just found out that if you drive to a certain castle, your Prince/Princess Charming is waiting to meet you for the first time, sweep you off your feet, and live happily every after.  Joy!

(Hint: Princess' castle = l2 error of of zero = yellow arrow height of zero = bottom of red bowl)























Needless to say, you're fairly motivated to find the princess's castle.  After all, the princess is our y vector: she is The Truth, the future we're trying to predict.  Unfortunately, finding her castle is going to require some patience and persistence, because you have already attempted to drive to the castle thousands of times and you keep getting lost.  

(Hint: thousands of trips = iterations, "keep getting lost" = your error in your l2 prediction leaves you a distance from the Castle of Truth, y, the fourth question, which equals the height of the yellow arrow)

But there is fabulous news:  you know that every day, with every driving trip, you're getting closer-and-closer to the Princess (who is of course The Truth, the y we strive to predict).  The bad news is, alas, each time that you don't arrive at her castle, POOF! you wake up the next morning back at your house (Matrix X, Layer 0, our input features, the 3 survey questions) and have to start again from there (a new iteration).  It looks a bit like this:

![alt text](https://lh3.googleusercontent.com/NvKKJXSOSSYZSn0N7swTG5fIigN519BM19xQtFzmnHVPdiaF-TDutS3oRfwTtA9165otPBJplP9nnvbS2g7Ah1FDmJtOAWr6Nk_z3lM_CPkTJoMjs01VN7kGU_B3SaLNQqeo0Ka5r9Jm3B8SKsyzA0UOmWRd1k7MwOUvuIC06Mm-l5b-YiRmypMVpL18X6Y2g-MEhm5ciw_QUHYrYShY0hhszJHkx3UHIIsRlTAB8WnLZG8PNvVOmB_qxxjadYHzufsJU-S62w36nnDA1fey_vHionTEx8v8eTy_qo98rfw24yY6Wlk4DjKSyBYHAcBdz7RJgtTWNp8uiQXDPnMZcLePtdIDQwcg1KJlHlgU9baiNDYH4bAlqETs9IeEpIYUzI3tL2EKXtzcbzXrARi3Lbkb7lEDqbYlk66jWdWjNoL0JCz3RCkyo5KhDzLA-tA2wIsjHZjxOhnfn0kck7hUMQIUIPKl7oBeSW6kFEzcpCcC3d9yEa3TXAw1Rh5bSzyv-V8sEqVUYlh1Con0JP0OX7QypP8-fjZ7-HhmLQzCqXk4uMkAIeGnbsw2JnhILI45NJrEvgSwmlqWfTfukdv1604XWdUMrvTKPTlh6DnqouLwuTrIW2pKmwFdlQBq-hB6YoU2pX0B1fMomw2l7A6WaDvbmSz9G6Bo=w970-h768-no)

Fortunately, this story has a happy ending, but you will have to keep attempting to arrive at her house and correcting your route for, say, another 58,000 trials (and errors!) before you fall into her arms.  Don't worry, it'll be worth it.

Let's take a look at one of those drives (an iteration) from your house, X to her castle, y:  each trip you make is the feed forward pass of lines 36-40, and each day you arrive at a new place, only to discover it is NOT the castle.  Drat!  Of course, you want to figure out how to arrive a little closer to your beloved on the next drive.  I'll explain that in Steps 8 and 10 below.  Stay tuned for *The Princess and the Castle, Chapter 2: Learning from Your Errors.* 

Above is an analogy for the feed forward process.  Below is a view under-the-hood of the math that makes it happen.  

We're going to walk through one example of one weight only, out of the 16.  The weight we will study is the tippy-top line of the 12 lines of syn0, between the top neurons of l0 and l1, and we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0).  Here's what it looks like:

![alt text](https://lh3.googleusercontent.com/CDts4VOB4z3Quo58n7Sz9MvVLFZk-UHniYTX4r4CIOP0e0xRjoy9Un67N73KX492-gq91XB5KSMjDY3WBKPNZ5yYrLA6NURJ_crOYKpcQUILfQNavN8_WQDpBWxkkJ3g5mWNDhstKQ1IiJ6Rk_6SifQFk-qiRKH34usHsWj8-HXuBDEiwKcz8Aq-s0Bf3JclMJZhqbpdHllSi-ZH2qFc9UZx_1NVHSw02h33j3Pncnl22AIvw-84YGRk8ALriyYXHLYEiqK_Vuk19bjEQxFSBvOyOeUshMmlStpiFVR452QsWaKahKYF7xemmqT_Ret7ck0uOCFfcN0VLN9UxoJwT1k44C1ugs9HJ5xbhcMXlZcXjA8d8M2L3DVpJqxdLMhEsxHqDv34e3SSb9MUMGyMvrwRQgAAbNPILRBlaJQw_eT_Gr6BBcCyoqWbow9LNorEyv0gHYo-qj3q4naltOM_06lsQ-Cb5369RXff_TT-QjSg59oL45ss-mrD2FtG92n_3Hi9T_XXRAzLWMxeHydPEcPAQQKeudWQ_9RxwhW_tBATGbRpNDA3ELb0gUw3fCy3v7YWLZs26IITnkWbwLHAlN7hYKiQZUaZwgzxMgDX9r2LcV_1MYeqcF4Pspx4NEhXjExK3QzCWx1ZZ9ZLQwuRxOrz-WWAp7MS=w973-h722-no)

Why are the circles representing the neurons of l2 and l1 divided down the middle?  The left-hand (depicted in the variables I use with an "LH") is the value used as input for the sigmoid function and right hand side is the output of the sigmoid function: l1 or l2.  In this context, recall that the sigmoid is simply taking the product of the previous layer times the previous synapse and "squishing" it down to a value between 0 and 1.

OK, here is Feed Forward using one of our training examples, row 3 of Matrix X, aka l0: [1,0,1].  Here's what it looks like:

![alt text](https://lh3.googleusercontent.com/_1cLOu2Rxc7xchgANum00LLlcGEectc2ffdXbpB4VEZX2cT_8czgF8PebXO7R_9WNj3TBDB6AearSjfszEednS_9GvXQ1RCmfPG9cdOQFbkDsefjx2MPgrgCfuzBPLLcbPEv9ZXl1fjv9_MBGzY5KOtlo0mW2iy1xNcQTkcWmUUyiN5MAKhRVeolOEQ8s-Ct7J0Kgd2YYspn4u5D_EyFuLdlfNTCGPyEdm8YQP5FDPxNNwtzm9Yv_LIFIlpFtF5aOZgzTWoqsG08Tr6MDoXNpLVT6Yk8xKEpLJsZiejHA4e9dzHExEJKPWO2dtUnzDSBBlMs-Vvfw2ECHi6Q5M3WLkuA2aAM72e2pKDgtZnM1b85Zktx54jF-tktJAJg6LAB6YJOsUBQI67tW5O3tt-C3Gu9Td5K0bwtChbvXfFeEMIU5SLp1yKGq-bybnVsUWZ0JegxCZeFjZefvN-2qCcYjJSn92BKS2qHRen0prilTfs9kcNhNAdwaPWM5vh-IOB1085hQHNSdg1Qnk0sZk3djKYnplktOvbNGwXoY0x7-8yfBzk00-awZb-Sqt7ngb_8IvXyRAMo7iS8KgnIKuGOp-yqCvCxt3Zc2hORrWfY06Omz1TZXQmWlYtIyJPGrU6yyqdvxYABz4VlCj-AxXke0WeZMgNgfmdU=w973-h722-no)

Here's what Feed Forward looks like in pseudo-code, and you can follow the forward, left-to-right process in the diagram above. (Note that I add "LH," meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of the circle representing l1," which means, "before the product has been passed through the nonlin() function.")

```
 l1_LH = l0 x syn0 (don't forget to add the products of the other l0 values x the other syn0 values) ->  

l2_LH = nonlin(l1_LH) = l1  ->  l1 x syn1 (again, add the products of the other syn1 multiplications) ->  

nonlin(l2_LH) = l2  ->  y-l2 = l2_error
```
Again: nonlin() is the part of the Sigmoid function that renders any number as a value between 0-1.  It is the code, `return 1/(1+np.exp(-x))`.  It does not take slope.  But in back prop, we're going to use the *other* part of the Sigmoid function, the part that does take slope, i.e., `return x*(1-x)` because you will notice that lines 57 and 71 specifically request the Sigmoid to take slope with the code, `(deriv==True)`.

I artificially assigned Syn0,1with a beginning value of 2.  2 is just a random value we assigned, it could be any number, but hey--ya gotta start somewhere, right?  Let's walk through the math of Feed Forward slowly: 
l0 x syn0 = l1LH, so in our example 1 x 2=2, but don't forget we have to add the other two products of l0 x the corresponding weights of syn0.  In our example, l0,2 x syn0,2= 0 x something = 0, so it doesn't matter.  But l0,3 x syn0,3 *does* matter because l0,3=1, so let's just make up a simple, convenient value for syn0,3 of 3.  Therefore, l0,3 x syn0,3 = 1 x 3 = 3  Our product of l0,1 x syn0,1 + our product of l0,3 x syn0,3 = 2+3 = 5, and 5 is l1_LH.  Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1.  Nonlin(l1_LH) uses the code, `return 1/(1+np.exp(-x))`, so in our example that would be: 5/(1+(2.718^-5))=0.98, so l1 (the RH side of the l1 node) is 0.98.

So, what just happened above?  The computer used some fancy code, `return 1/(1+np.exp(-x))`, to do what we could do manually with our eyeballs--it told us the corresponding y value of x=5 on the sigmoid curve as pictured in this diagram:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Notice that, at 5 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis.  Our code converted 5 into a statistical probability between 0 and 1.  It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here.  The computer did what we did: it used math to "eyeball what 5 on the X axis would be on the Y axis of our diagram."  Nothing more.

Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value.  l1 x syn1 = l2LH which in our example would be 0.98 x 3 (3 is a random number we just assigned because hey--ya gotta start somewhere) = 2.94.  But again, don't forget that to 2.94 we have to add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake we'll just pretend those all added up to -2.  So you end up with -2 + 2.94 = 0.94, which is l2_LH.  Next we run l2_LH through our fabulous nonlin() function, which would be: 1/(1+2.718^-(-2)) = ~0.7, which is l2, which is our very first prediction of what the truth, y, might be!  Congratulations!  You just completed your first forward feed!

Now, let's assemble all our variables in one place, for clarity:
```
l0=1
syn0,1=2
l1_LH=5
l1=0.98
syn1,1=3
l2_LH=0.94
l2=~0.7
y=1 (this is value 3 of vector y, which corresponds to training example #3, row 3 of l0)
l2_error = y-l2 = 1-0.7 = 0.3
```
OK, let's now take a look at the matrix multiplication that makes this all happen (for those of you who are rookies to matrix multiplication and linear algebra, Grant Sanderson teaches it brilliantly, with lovely graphics, in [14 YouTube videos](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).  Watch those first, then return here).

First, on line 39 we multiply the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4 matrix.  
```
X (aka,l0):      syn0:
[0 0 1]          [ 5.67534974  5.1809666  -6.96032933 -4.91055814]
[0 1 1]     X    [-3.94870047 -6.6558582  -7.25683472 -4.61369466]   =
[1 0 1]          [ 1.77928043 -2.53186624  2.87700966  7.11388595]
[1 1 1]
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

Product of l0 x syn0:
[  1.77928043  -2.53186624   2.87700966   7.11388595]
[ -2.16942003  -9.18772444  -4.37982506   2.50019128]
[  7.45463017   2.64910035  -4.08331967   2.20332781]
[  3.5059297   -4.00675784 -11.34015439  -2.41036686]

Now we pass it through the "nonlin()" function in line 39, which is a fancy math expression you don't need to understand: "1/(1 + 2.781281^-x)=" Just trust me that it gives us values between 0 and 1, and I'll explain it more in Step 10:

This is layer 1, the hidden layer of our neural network:
l1:  
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]

Note that each value is written in exponential notation, so the first value is the same as 0.855607991.  And don't make the same mistake I made at first: that hyphen in the "e-01" does NOT mean your number is negative.  If this number were negative, it would be written as, "-8.55607991e-01" with the negative sign out front. Rather, the hyphen behind the e tells you whether you're moving your decimal point in a negative direction, i.e., to the left, or a positive direction.  For example, "8.55607991e+01" is 85.5567991.
```
If you find yourself feeling faint at the mere sight of matrix multiplication, fear not.  We're going to start simple, and break down our multiplication into tiny pieces, so you can get a feel for how this works.  Let's take one, single training example from our input.  Until now, we've been talking about l0 as a 4x3 matrix.  For a clearer, simpler example we will take only the 3rd row, `[1,0,1]` *not the first row* because that will give us the simplest demonstration.  

We're going to take the first row/example from that matrix, which would be, [0,0,1].  In other words, a 1x3 matrix.  We're going to multiply that by syn0, which would still be a 3x4 matrix, and our new l1 would be a 1x4 matrix.  Here's how that simplified process can be visualized:
```
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

row 1 of l0:     col 1 of syn0:
[0 0 1]    X     [ 5.67534974] +       [ 1.77928043  -2.53186624   2.87700966   7.11388595]
[0 0 1]    X     [-3.94870047] +   =   [ (row 2 of l0 x cols. 1, 2, 3, and 4 of syn0...)  ]
[0 0 1]    X     [ 1.77928043]         [                                  etc.            ]
                                                                                      [                                      etc.        ]
                                                                                      
Then pass the above 4x4 product through "nonlin()" and you get the l1 values
l1:  
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]
```
Note that, on line 39, we next take the Sigmoid function of l1 because we need l1 to have values between 0 and 1, hence: `l1=nonlin(np.dot(l0,syn0))`

#How the Sigmoid Function Gives Us Probability and Confidence Levels
It is on line 39 that we see ***Big Advantage #1*** of the ***Four Big Advantages of the Sigmoid Function.***  When we pass the dot product matrix of l0 and syn0 through the `nonlin()` function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.  In the case of your superior feline hygiene product, this means, "the closer the value is to 1, the more certainty the customer will buy Litter Rip!, whereas the closer the value is to 0, the more certainty that Litter Rip will remain untouched by the (ungrateful! unfeeling!) customer.  As I mentioned above 0.2 means "Probably wont buy," 0.8 is "Probably will buy," and 0.999 is, "You're damn right they'll buy!"

"OK," you might be thinking, "So I can eyeball the values in l1 above and deduce whether they are predicting a probable 1 or a probable 0.  So, what?"  Well, it doesn't matter in lines 39 and 40, but it matters a *ton* when we hit line 61 and beyond.  Stay tuned.

Here's where these values appear in our picture of neurons and synapses:

![alt text](https://lh3.googleusercontent.com/QtRwLfpxhGoXP6_bwxjlKdn0CSirkxyfuR1EKOaG94HSMmR3sjCwX5ueB3SPR6xvq6L997dTjvHKH67pnFGFXIeesj9iMCBGT46RINqOr1OtH0MWdqcvGn6K9NcIMv-ahzl1Yy28FlXF2qqnZ4WZ-rFRu4BDGha1pPgxKbcGaFsoN9tQ5LoS5r66D2Jho5ejjXXUMd3M45OFvoTIEtIk2l2LXdxF7X85yXST5iDiXhMBajh4cp65b7UrOd4qlSEZlk-t7MAE1ZleppKEzTGOFNoXoVfRJQZf1F7KvvlNqldydwn93FWLBB1kwMrLSFZkqvpX8mmc5aQxpJowev6FXJuG2HDSMNhuGtLpWBMJCq3eZM126bo_iAzxvzxakl6BksyUNR9IIqh61dbLYq2EHcxDtP4l68OeI5tWRnFBEU-b_jXWz4BdEkCmEXICgYpaHVdQnnuwiy74KmUjKW3Kvac1xaqb1P4xNl3Mr67COckWvi2sRqPY-scqnpHQ7QCstxLJTopN9QJAXmYO6eC0YFxhg4uFB4nwvep9kWScIJdD-cs0siu6w62VPyx1CEVQk2CbTpt8uCkiRmrUdRhPxV5s29OXhQ8cvS6HtCRaAXcQ8wYayRcRkn8i9dc7LZNQLGqHpO8DAEkQvdVC3b3Gce339fJBgbLj=w970-h710-no)

So, the above diagram gives you a picture of how row 1 of our input X, which is one of four training examples, feeds through its first step in a NN.  But you may recall that we have 4 examples, not one.  The diagram below shows the same training example, [0 0 1] in its natural habitat--stacked on top of the other 3 training examples.  Once you can visualize how the neurons of example 1 (i.e., row 1 of Matrix X) are simply stacked on top of the other 3 rows/training examples of Matrix X, then you have the key to The Kingdom: hopefully, now you could visualize even a typical NN, with thousands or millions of neurons, containing thousands or millions of examples, stacked on top of each other.  The same image above still holds--there's just a whole lot of stacking going on!  These stacks are referred to as a "full batch" configuration, which is a very common model.

![alt text](https://lh3.googleusercontent.com/csNXEukbaM4zJP6fg3AmE6HWsR3QNOK9ZSgY-PJ72a3D_WRqBUEMrVi3S760aHxd0krvHk6_fGM6s3RfWXueb1FZeGv8dw5ClJ3zcBNoPE6QHxQCbdlJM7myAGXuIpUtWqbhnqQ3Y7QFf93klMO3u1OJrhnEMo_ykyNbfrJmBr-Ay7X-uCqV0eUqKZ4FwmurAo1xGAvXZ9uFxNRXWX7w8LWYY9LlAXi7jogTedzApJUcj1fSB4lcGp9FAvBacrFlAqZpMvV8JEv3ad1yczvWCmY2z0CGG8x7krXFEC-OTY04gZNwXfYM5lIzAjPIH12rvUBbBBzEZ4v0sG5rYVlP8xsYEeIFmmEWQDNnjDdftfwc2jcQlq3Xemyh4tOuKg8j-hj4vgQLfamI4TI7KS-cSB9HUZH-VujwGnUTZKtGa150hYtu4AWcUG16h95TirKEXM3RaXPyEQErKCYWRLggb8fblUVqjvIRLz90UxXmgcgoL5cS0e8fHGHJwy8B_udVzK5zAly1-C3fTG0V0T8oUwprUAdaBcWURn2NC1wXX9efF_B3enpLhtk8E40FWcFPk6ae0WXFIPcUBOvSjflw1c98P7Bw_wunp8V8Vxsy-CI_CeK11NX1BHVvmRkcoXYN70LyN4fPd8A3nenmfoEMHEvDPm67esoj=w968-h923-no)

Exactly the same thing happens on line 40, as we take the dot product of 4x4 l1 and 4x1 syn1 and then run that product through the Sigmoid function to produce a 4x1 l2 with each value becoming a statistical probablility from 0-1.


```
l1 (4x4):                                                         syn1 (4x1): 
[8.55607991e-01 7.36542129e-02 9.46698170e-01 9.99186935e-01]     [ -9.39072641]
[1.02530388e-01 1.02276900e-04 1.23725522e-02 9.24155229e-01]  X  [  9.43509921]  =
[9.99421579e-01 9.33955520e-01 1.65721667e-02 9.00547951e-01]     [-12.43520534]
[9.70856017e-01 1.78672362e-02 1.18857985e-05 8.23855801e-02]     [ 10.32941201]

Then pass the above 4x1 product through "nonlin()" and you get l2, our prediction:
l2: 
 [1.52039467e-04]
 [9.99781882e-01]
 [9.99801142e-01]
 [3.04170696e-04]
   ```
We have now completed the Feed Forward portion of our network.  If you can visualize what we have done so far, both in terms of the matrices involved, and also as layers of "neurons" and the "synapses" connecting those neurons, then you have done outstanding work.  Bravo to you.

Our next goal is to find Step 8: by how much did we miss our target truth y, the princess' castle?  Well, turns out we missed by 0.3.  But any distance between us and our beloved princess is too much, so how can we reduce that l2_error of 0.3 to put us finally in her arms?  The back propagation step below will soon teach us the exact amount we want to increase/decrease syn0,1 in order to decrease l2_error and firmly embrace our beloved.










##8) By How Much Did We Miss the Target? Lines 42-45
```
l2_error = y - l2
```
The 4x1 y vector is our goal, our target.  Given our input X of layer 0, we want to produce an output, layer 2, that is as close to the 4 values of y as possible.  Each one of our 60,000 iterations should bring us, by trial-and-error and learning from our mistakes, closer to the 4 target values of y.  So, for each iteration, we take our best prediction so far, the 4x1 vector l2, and subtract it from the 4x1 vector y.  The remainder is l2_error, i.e., how much each value of l2 missed its target value in y.  

In Step 7, you might say that we made our first try, or trial--as in "trial-and-error."  This is our first attempt at a prediction of what y, the truth, might be.  Step 8 is the exciting first step of figuring out our "error," in the learning process of our NN.  Once we know what we missed by, in the steps below we will seek to correct that error and do better next trial.

But before we move on to more steps, let's take a careful look at what we have in l2_error.  There's a lot of important information here.  y has 4 values, l2 has 4 values, and we subtracted each value of l2 from the corresponding value of y.  We ended up with 4 values in l2_error: 4 "misses" of the y target.

So, what? You may ask.  Well, consider: some of those misses were quite small.  Our l2 prediction was pretty close to correct, so when we subtract that l2 value from its corresponding y value, the remainder in l2_error is a small number; a Small Miss.  But there were also Big Misses.  Pay close attention to those big misses, because they will matter a *lot* in the steps below:

```
l2_error = y - l2      (Note that this example is already after 10,000 iterations, so the numbers are relatively small.)

y:        l2:           l2_error:    Relatively speaking...
[0]      [0.00015]     [-0.00015204] a small miss
[1]  _   [0.9998]   =  [ 0.00021812] a Big Miss 
[1]      [0.9998]      [ 0.00019886] a small miss
[0]      [0.0003]      [-0.00030417] a Big Miss
```
Why do we care about the big misses?  Because correcting the bigger misses improves our network's accuracy faster and cheaper than messing around with the small misses.  If it ain't broke, don't fix it.  Throughout our network, we want to focus on the Big Misses and the Low Confidence/Wishy-Washy/Big Slope Ratio numbers (which I'll explain below).  For now, just remember this key point: the Big Misses matter bigtime.


##9) Print Error: Lines 47-51
Line 50 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations.  The line, `if (j% 10000)==0:` means, "If your iterator is at a number of iterations that, when divided by 10,000, leaves no remainder, then..."  ` j%10000 `would have a remainder of 0 only six times: at 0 iterations, 10,000, 20,000, and so on to 60,000.  So this print-out gives us a nice report on the progress of our NN's learning.

The code `+ str(np.mean(np.abs(l2_error))))` simplifes our print out by taking the absolute value of each of the 4 values, then averaging all 4 into one mean number and printing that.  Here's an example:
```
Avg l2_error after 10,000 iterations: 0.00021829659275871905     (Not bad, huh?  :-)
```

#How Back Propagation Tells You What Will Reduce Your Error (Steps 10-12)
##10) In What DIRECTION is y?  The Precursor to Back Propagation: Lines 53-57  
```
 l2_delta = l2_error*nonlin(l2,deriv=True)
```
Now we have entered the brain of the beast; here is the secret sauce of Deep Learning.

Let's bust two myths, shall we?  
Myth #1: Back Propagation is Super-Hard

False.  Back propagation requires patience and persistence.  If you read a text on back prop once and throw up your hands because you understand nothing, you're done.  But if you watch Grant Sanderson's [video 3 on back prop](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) 5 times at reduced speed, then watch [video 4 on the math of back prop](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&t=320s&index=5) five times, you'll be well on your way.  Grit.

Myth #2: To Understand Back Prop, You Need Calculus

There are many online posts which state with great authority that you need multivariable calculus in order to understand AI.  Not true.  Deep Learning author and guru [Andrew Trask](https://www.cs.ox.ac.uk/people/andrew.trask/) says that, even if you took three semesters of college-level calc, only a tiny subset of that material would be useful for learning back propagation: the Chain Rule.  But, even if you took those 3 semesters of college calculus, the chain rule is often presented very differently in college from the way you would use it in back propagation.  

So, bottom line?  Don't make the same mistake I did: I panicked every time I saw the word, "derivative," and it was self-defeating.  You must fight that inner voice saying, "I don't have the background to master this."  There are workarounds--simple ways to do calculus without calling it "calculus."  But there is no workaround for grit. 

Here is my favorite saying: "There is the task, and there is the drama ***about*** the task."  
Leave your drama here now.  Please give me your grit, and your trust.  Let's learn back prop.

#The Big Picture of Back Propagation: In What Direction?  And, by how much?
What is the purpose of back prop?  To find the best set of adjustments with which to tweak our network so that it gives a better prediction in the next iteration.  In other words, certain values in certain matrices in our network need to be adjusted to give a better prediction.  To adjust each of those numbers, we must answer the two key questions:

1) In what direction do I adjust the number?  Do I increase the value, or decrease it?  Positive direction, or negative? and

2) By how much do I increase or decrease the number?  A little, or a lot?

We will examine these two basic questions in great detail below.  But first:
#Exactly what *is* this network we're going to tweak?

When you first look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain.  In fact, the real brain of a neural network, the part that actually learns and remembers, is the synapses, those lines that connect the circles in this diagram.  We have control over 16 variables in our network: 12 variables in the 3x4 matrix syn0, and 4 variables in the 4x1 vector syn1.  Look at this diagram and understand that every line (aka, "edge" or synapse) you see represents one variable, containing one number, aka one weight.  
![alt text](https://lh3.googleusercontent.com/296hKe7aPkmXPxrLiysqwexLQMvpjBuU_WmUxb2EnoXMTCV2NZNyEumfsTcHsCyn17QxW6SB3nAKxzd2Ssk15kYLbPafhETVkzDb8uQOxmiLQwl0EMqucdlr0cQzBCR9q45MfIM-_Uo_qusjxqItwpuVtVL6H37NE1-I4vxRXP_LcTKQAspuXy6dAH1oYhfHUBBIME7CuPajJCAOubo0G56M8qSrh-LReM9RZoOgfL0kwSM7uO5gt4mEnjCiWimCLDXsp38Ng9ZIlHKxsot6G05P9x-msDRhzAL5bzxijuzh4w854bYItZATqfwGJMJgxklpfaqi9JzMz2h_ihzu-Ln3lCyDcZBfBqv5nE0fDEXEozV4ahgkc_d7z3lWQHh7aZI7mW_kE4PrdpSuHrrReBg1Vd0573M4NrardoSIg405xutOYfoKx6fwsbgE6-BaG9nrRQgQHRqu4-4Lv_n9VaG3BPa5gixlNLHnf2v5imGICK1PLgOq9reftrJ9azac7KCAEk3rGRBpTZf3VS2Xdgm909F6KCu-J2RkEMKyLKBn3kTDzO878YzlQtVqy5PRIr1cube5jqVqpxNsJ0vCSP5qNiik2UHBRTPZ0Wx-djomhEwD2483EOuU5qevYa81W6Q8g0gDhqxuUuEFWviFdjklJtKSTWNq=w973-h535-no
)
These 16 weights are all we can control.  l0, our input, is fixed and unchanging.  l1 is determined *exclusively* by the weights in syn0 by which you multiply the fixed values of l0.  And l2 is determined *exclusively* by the weights in syn1 by which you multiplied l1.  Those 16 lines pictured above, the synapses, the weights, are the only numbers you can tweak to achieve your goal, which is an l2_error that gets smaller and smaller until l2 almost equals y.  l2_error is what we call your "cost," and back propagation is the tool that we use to figure out how to tweak the network to reduce the cost as much as possible, as quickly as possible.

#Step 1: Confidence Levels Help to Answer, "By How Much Do I Adjust the Numbers of My Network?"
You might call Step 10, "How much do I tweak mynetworkbefore its next iteration and prediction?"  For statistics and calculus buffs, we could simply say, "In line 57 and following, we compute how much to increment the weights in the opposite direction of the gradient of the error with respect to the weights in the synapses."  Whew!  For the rest of us mere mortals, let's unpack that a bit, by returning to our (spellbinding) fairy tale:

##*The Princess and the Castle, Chapter 2: Learning from Your Errors.*

You may recall, back in Step 7, you made a feed forward pass and drove to l2, your best guess as to where the castle y is located, but you arrived at l2 only to discover you were *closer* to the castle, but not yet arrived.  And you know that soon, you will (POOF!) disappear and wake up the next morning back at your house, l0, and start over.

How can you improve your driving directions to get closer to the love of your life tomorrow?

First, when you arrive at today's destination, you eagerly ask a local knight how far today's arrival place is from the Princess's castle.  This chivalrous knight tells you the distance you are from Castle y (this is the l2_error, or "how much you missed the princess by").  Every day, at the end of each trip, before you disappear for the day, you want to compute **by how much** you want to change today's failed l2 prediction such that tomorrow your l2 prediction will be perfect and you can fall into your beloved's arms.  This is the l2_delta.  It is the amount you want to change today's l2 so that tomorrow, that new-and-improved l2 will hopefully lead to the castle drawbridge!

Note that the l2_delta is NOT the same as l2_error because l2_error only tells you how many miles you are from your princess.  l2_delta also factors in how confident you were in the turn-by-turn directions by which you missed the castle today.  These confidence numbers are the derivatives (forget calculus, you don't need it here, so let's just use the word "slope," as in Good Ol' rise-over-run), or slope of each value of l2.  Think of these slopes as the confidence levels you had in each of the turns in the set of directions we're using for today's trip.  Some of those turns you were super-confident of.  With other turns, you weren't certain if they were right or not.  

But wait: perhaps this concept of using confidence levels to compute where you want to arrive tomorrow seems a bit abstract and confusing?  Actually, you use confidence levels to navigate all the time--you just aren't conscious of it.  

Think about a time when you got lost.  At first, you started out assuming you were on the right route, or you wouldn't have taken that route in the first place.  You started out confident.  But your trip seems to be taking longer than you expected, and you wonder, "Gee, did I miss a turn?"  Less confident.  Then, as time passes and you should have arrived by now, you become more certain you missed that turn.  Low confidence.  And you know you are not at your destination, but you are not sure where your destination is from your current location.  So now, you stop and ask a nice lady for directions, but she tells you more turns and landmarks than you can remember, so you can only follow her directions part-way before you are again unsure how to proceed.  So you ask directions again, but this time you are closer, so the directions are simpler, and you follow them to the letter and arrive joyfully at your destination.

It's very important to notice a couple of things: 

First, you just learned by trial-and-error, you had varying confidence levels.  A bit later below, I will explain in detail how those confidence levels allow our network to learn by trial-and-error, and then I will explain how our beloved Sigmoid function gives us those all-important confidence levels.  

Second, notice that your trip had two segments--the first segment was your route up to where you asked the nice lady for directions (l1), and segment two was your route from the nice lady to l2, the place you thought was your destination, but you had to ask how far you were from your true destination.  At first, you were sure you were on the correct route. Then, you wondered if you missed a turn. Then you were certain you missed a turn, and stopped to ask directions before proceeding further.  Those 2 segments of your daily trip look like the dog legs pictured here, and each day with your improvements, the dog leg gets a bit straighter.  It's like the process you go through as our romantic, driving, neural network: 

![alt text](https://lh3.googleusercontent.com/F1eSirIXLvjK9Dxs_jYgG_7jo9BYfiKxtYgcXqic8zWZhNVP6POfCTzeqg9tbGO-7vIfw-TenaJ0Rb4OK_7FGL5ilwtx8osTl_LyFhpWixZSVOjmAfzKJuicGJ2CXPt4u-Da7JmR7LEpqwjXFIs84kpQc8rj-NHOHGuwaoKxiB2un3FgMDp0JqLfP73g_Gc7j4GxpmUXOVGHSJY3YQNjDiBeoey-GzEkAIdZaV4ygSb-sIV7gSNfnkMGp0bgy3HeIn_sVGadqjviQswdEHbleQOaKy6lMr2FhRdYQoRZrKwMTRX3ziDaDylQtCVOIygLAzA0ezkhr9V4Aq-qf1kBe_679XfFsRuvE8zjLXnM0D-sqBQAL9fNao9-8gEEHu3Z4tLURbh8ve_IiGYFAOLS4Sedu3jHwexmmFfPs8Zd6UUVEWhiGBhdqS-mpiw8Ptg4t9qsJnH__h4bxhoyziri6MWzR_qkBsTCrKRSnhag0X6qeKQGfcnk_QPQusVnN6JNhIeeJtwLcLBBPEn_sWgBxCMAzeGwniphYXyYEc2_56Nm53u9Glb6bh1oZmPrRLotBqzEAZBS6TexIKkKNfX7WVU83AAWOmUAJBxW74hitwLqski032aBVWK9rUHNekhVIymCow66_09qT8i5okt9ZIT1l8shph6i=w970-h744-no)

Every day, on every trip, you (our 3-layer network) start out with a set of directions to the princess (syn0).  When those directions end, you stop at l1 and ask for further directions (syn1).  These take you to your final destination of the day, your prediction of where you *thought* the princess was.  But no castle stands before you.  So you ask the knight, "How far to the castle? (l2_error)" And because you are a genius, you can multiply the l2_error by how confident you were in each turn of your directions (the derivative, or slope, of l2) and come up with where you want to arrive tomorrow (your l2_delta).  

So you must compute 3 facts before you can learn from them and re-attempt your quest for the princess's castle.  You must know:
1) Your current location (l2);
2) How far you are from the princess's castle (l2_error); and
3) What changes you need to make in your set of turns to increase your certainty that your next driving attempt will get you closer to the castle (l2_delta).

Once you possess these three facts, then you can compute the required changes in the navigation turns (i.e., the weights of the synapses).   This is line 75, the change to the weights of syn1, which is the product of l1 and l2_delta.  The changes in syn1 will help to realize the changes you seek in your next l2 (i.e., to end up closer to that darn castle).

Now, of course the smartypants readers will notice that I have only told you how to improve tomorrow's directions for Part 2 of our journey, from l1 to l2.  Ahh, there's always a stickler, isn't there?  Well, we're going to learn how to improve the directions (syn0) of Part 1 of our journey (l0 to l1) in Step 11 of our process.  Right now, I want to do a deep dive into using confidence levels to compute the l2_delta.  Wake up now, because below is some fascinating and important stuff:

#Step 2: How the Slope-Taking Feature of the Sigmoid Function Gives You Confidence Levels
Here is where you will see the beauty of the Sigmoid function in four *magical* steps.  To me, the genius of our neural network's ability to learn is found largely in these four steps.  We saw how, in line 39, Step 1 was when the `nonlin()` transformed each value of our matrix into a statistical probability (i.e., a prediction) between 0 and 1.  But I have yet to mention that that statistical probability is ***also*** a simple measure of confidence--numbers approaching 1 suggest high confidence that the NN's (neural network's) prediction of "1" is correct.  Numbers approaching 0 suggest high confidence that the NN's prediction of "0" is correct.  This is ***Big Advantage #2*** of the ***Four Big Advantages of the Sigmoid Function:***  If our NN's prediction, the four values of l2, is high-confidence and high-accuracy, that's an oustanding prediction, and we want to leave the syn0 and syn1 weights that produced that oustanding prediction alone.  We don't want to mess with what's working; we want to fix what's NOT working.

Let me explain the above from a different angle: right now, ournetworkis dealing with a pretty abstract problem, i.e., 1's and 0's.  Let's give those 1's and 0's meaning: imagine that our problem is one of image recognition, e.g., "Is there a fish in this image?  


"1" means, Yes, and "0" means No. Within this context, the output is simultaneously a prediction and a confidence measure.  An output of 0.999 is the equivalent of the network saying "I am extremely confident the customer will buy Litter Rip!."  A number of 0.001 is the equivalent of "There is no way in Hell the customer will buy Litter Rip!."  Low confidence numbers are in the vicinity of 0.5.  For example, a value of 0.4 would be similar to "The customer might buy Litter Rip!, but I'm not sure."

That's why we focus our attention on the numbers in the middle:  all numbers approaching 0.5 in the middle are wishy-washy, and lacking confidence.  So, how can we tweak ournetworkto produce four l2 values that are both high-confidence and high-accuracy?  

The key lies in the values, or ***weights*** of syn0 and syn1.  As I mentioned above, syn0 and syn1 are the center, the absolute *brains* or our neural network.  We are going to take the four values of the l2_error and perform beautiful, elegant math on them to produce an l2_delta.  l2_delta means, basically, "the change we want to see in the output of the network (l2) so that it better resembles y (the truth)."  In other words, l2_delta is the change you want to see in l2 in the next feed-forward pass in the next iteration.

***Get ready for beauty.***

Here is ***Big Advantage #3*** of the ***Four Big Advantages of the Sigmoid Function:*** Do you remember that diagram of the beautiful S-curve of the Sigmoid function that I showed you above?  Well, lo-and-behold, each of the 4 probability/confidence values of l2 lies somewhere on the S curve of the sigmoid graph (pictured again below, but this time with more detail).  If we search for that number (e.g. 0.9) on the Y axis of the graph below, we can see that it corresponds with a point on the S curve roughly where you see the green dot: ![alt text](https://iamtrask.github.io/img/sigmoid-deriv-2.png)
(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Did you notice not only the green dot but also the green line through the dot?  That green line is meant to represent the slope of the *tangent* to the line at the exact point where that dot is.  You don't need to know calculus to take the slope of a curve at a particular point--the computer will do that for you.  But you do have to notice that the S curve above has very shallow slope at both the upper extreme (near 1) and the lower extreme (near 0).  Does that sound familiar?  Wonder of wonders, a shallow slope on the sigmoid curve coincides with high confidence and high accuracy in our predictions!  And you also need to know that a shallow slope on the S-curve comes out to a tiny number for slope.  That's good news.  Why?

Because, when we go to update our synapses, we basically want to leave our high confidence weights alone since they already have good accuracy.  To "leave them alone" means to multiply them by tiny numbers, near zero, so the values remain virtually unchanged.  And here comes ***Big Advantage #4*** of the ***Four Big Advantages of the Sigmoid Function:*** Miracle-of-miracles, our high-confidence numbers correspond to shallow slope on the S-curve, which corresponds to tiny slope numbers.  Therefore, multiplying the values of syn0 and syn1 by these teeny-tiny numbers has exactly the effect we want: the values in our synapses are left virtually unchanged, so our confident, accurate, high-performing values in l2 remain so.

By the same token, our wishy-washy, indecisive, low-accuracy l2 values, which correspond to points in the middle of the S-curve, are the numbers that have the biggest slope on our S-curve.  What I mean is, the values around 0.5 can be traced on the Y axis of our graph below to the middle of the S-curve, where the slope is steepest, and therefore the value of that slope is a big number.  Those big numbers mean a big change when we multiply them by the wishy-washy values in l2, as we do in line 61.

In detail now, how do we compute the l2_delta?  

In line 45, we found l2_error, which measures how much our first prediction, l2, missed the target values of y, our truth, our future, and our princess.  You may recall we are particularly interested in the Big Misses.  

In line 57, the first thing we do is use **the second part** of our beloved Sigmoid function, `(x,deriv=True)` to find the slope of each of the 4 values in our l2 prediction.  This slope tells us which predictions were confident, and which were (wait for it...) Wishy-Washy.  This is how we find and fix the weakest links in our network, the low-confidence predictions.  We then launder our 4 slopes with `nonlin()` and multiply those 4 confidence measures by the four misses in`l2_error`and the product of this multiplication will be `l2_delta`.  Oh, Lordy!  Line 57 is an important step--did you notice that we are multiplying the Big Misses by the Wishy-Washy Predictions (i.e., the l2 predictions that had big slopes)?  Super-duper key point, as I'll explain below.  But first, let's make sure you can visualize what I just said:
```
Below is the matrix multiplication of this line of code, in order of operations: l2_delta = l2_error*nonlin(l2,deriv=True)
    
Take l2 predictions, find their slopes, pass them through nonlin(), multiply them by the l2_error, and product is l2_delta

l2 slopes after nonlin():    l2_error:                    l2_delta: 
[0.500038] Wishy Washy!      [-0.00015204] small miss    [-0.0000760] small change
[0.500054] Wishy Washy!   X  [ 0.00021812] Big Miss    = [ 0.0001090] WWxBM=Big Change
[0.500049] Wishy Washy!      [ 0.00019886] small miss    [ 0.0000994] small change
[0.500076] Wishy Washy!      [-0.00030417] Big Miss      [-0.0001521] WWxBM=Big Change
```
Notice that, the Big Misses are (relatively speaking), the biggest numbers in l2_error.  And the Wishy-Washy's have the steepest slope, so they are the biggest numbers in `nonlin(l2,deriv=True)`.  So, when we multiply the Big Misses X The Wishy-Washy's, we are multiplying the biggest numbers by the biggest numbers, which will give us--guess what?--the biggest numbers in our vector, l2_delta.  

Why is that fabulous news?  Think of l2_delta as "the change we want to see in l2 in the next iteration."  The **big** l2_delta values are the **big** changes we want to have in the l2 prediction of the next iteration, and we'll make those happen by making **big** tweaks in the corresponding values of syn1 and syn0 below, so in our next feed-forward pass, when we multiply l0xsyn0, the biggest values in syn0 will make a big change to our least accurate, wishy-washy-est values in l0.  So l0 will be improved.  Then, when we multiply the new-and-improved l1 by our better-weighted syn1, that will yield our best l2 prediction yet!  Happy Happy!  Joy Joy!

#Step 3: Gradient Descent Answers, "In What Direction Do I Adjust my Weights?"
When we update our synapse matrix by multiplying its corresponding element (aka, its value or number) with that large slope number, it's going to give that element a big nudge in the right direction towards confident and accurate prediction.  When I say, "in the right direction," what I mean is that some values of our l2_delta are going to be negative values, because we want the product of these negative values, when multiplied by the weight values of our synapse matrix, to approach 0.  Other values of our l2_delta are going to be positive, because we want them to increase the weight values and thereby nudge the elements in syn0 and syn1 to approach 1.

So it's important to notice that there is a sense of "direction" involved here.  When we talk about "what direction is the target y value from our current l2 value?" we mean, do we need to multiply each weight in syn1 by a positive l2_delta value to move it in a positive, larger direction, or by a negative l2_delta value to move it in a negative direction?

![alt text](https://mail.google.com/mail/u/0?ui=2&ik=e3f869f938&attid=0.2&permmsgid=msg-a:r4352876950048414936&th=1691255aa52a4d54&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ8FdFORGv3w0jn-Bs8GhlKpg2D1XPRzSF6OaNCqE8hchNYMIAymIg-nK1xCdIsQup54rJmkW2l0qttCzg03Hq8PJOv4KX0ae14e2dkswvLMt74Rzdhwt2ZJQBQ&disp=emb&realattid=ii_jsexnu8o2)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)

Above is a nice, simple picture of the "rolling ball" of gradient descent.  Line 61 computes each value of l2 as a slope value.  Of the 3 "bowls" pictured in the diagram above, it is clear that the true global minimum is the deepest bowl on the far left.  But, for simplicity's sake, let's pretend that the bowl in the middle is the global minimum.  So, a steep slope downwards to the right (i.e., a negative slope value, as depicted by the green line in the picture) means our ball will roll a lot in the negative direction (i.e., to the right), causing a big, negative adjustment to the corresponding weights of syn1 that will be used in the next iteration to predict l2.  But if, for example, you have a shallow slope downwards to the left, that would mean the prediction value is already accurate and confident, which produces a tiny, positive slope value, so the ball will roll very little to the left (i.e., in a positive direction), thus adjusting by very little the corresponding weight in syn1, so the next iteration's prediction of that value will remain largely unchanged.  This makes sense because the back-and-forth motion of the rolling ball is becoming smaller and smaller before it soon comes to rest at the global minimum, the bottom of the bowl, so there's no need to move much.  It is already close to the ideal.

The above 2-dimensional diagram is a tad oversimplified, so below is a more accurate picture of what gradient descent looks like.  This is another good image, similar to the lumpy red bowl I showed you at the beginning, sitting on the "table top" plane of the syn0, syn1 grid with an arrow showing Feed Forward and a tiny arrow showing slope/gradient descent:  https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing

This is a perfect example of why the best teacher is someone who learned yesterday the material you are learning today.  I only discovered the following insight after a year of studying gradient descent, because all the experts take this point so for-granted that they don't bother mentioning it.  Here is a BIG chance for you to learn from my mistakes, and gain a super-key insight that eluded me for over a year, even though it was right under my nose.  Here goes:

Take another look at the red, 3-D "warped bowl" in the diagram above.  Think of it as just that, a warped bowl.  Notice that the warped bowl is not drifting in space.  It is sitting on a "white table top," that white grid.  Do you notice that the only place our warped bowl actually touches the plane of our white table top is at the global minimum, i.e., the bottom of the lowest dip in the bowl?  Perfect.  Now you have all the info you need for this stunning insight:

The grid of our white table top is the axis of syn0 and syn1!  That means that, for example, every value in syn0 is a point on the X axis of our grid, and every value in syn1 is a point on the Y axis of our grid.  When we do a forward feed through out network, the value we arrive at is simply the height of our "ball" from the syn0, syn1 coordinate on that grid.  Once we have the height from the grid cooridinates of the plane below, we know *exactly* where our ball is on the surface of our bowl.  And when we compute gradient descent, it tells us the slope of the surface of the bowl at the exact coordinate of (syn0, syn1, and the value from Forward Feed).  Finding the slope of where our ball is tells us the direction our ball should roll to make the quickest descent to the bottom of the bowl where error is 0 because height is 0 because our ball is touching the syn0, syn1 grid plane.

And that's it.  That is the best geometric representation of what a neural network does, in one picture.  Why doesn't EVERYBODY teach it like this?  If you can SEE what a neural network does in 3-D, it makes it SO much easier to understand why we do all these abstract steps in math and in code.

Take your time with the above points and make sure you understand them.  Do you see why the sigmoid function is a thing of beauty?  It takes any random number in one of our matrices and:

1) turns it into a statistical probability, 

2) which is also a confidence level, 

3) which turns into a big-or-small tweak of our synapses, and 

4) that tweak is always in the direction of greater confidence and accuracy.  

The sigmoid function is the miracle by which mere numbers in this matrix can "learn." A single number, along with its many colleagues in a matrix, can represent probability and confidence, which allows a matrix to "learn" by trial-and-error.  That is a thing of beauty, but there is more elegance to come!  As you learn other networks, you will see there are many functions that "learn" in even more beautiful ways than the sigmoid we have studied here.


##11) Back Propagation: Lines 59-64
```
  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Pos. or neg.? Similar to #10 above, we want to tweak this 
  #middle layer so it sends a better prediction to l2, so l2 will better predict target y.
  #In other words, add weights that will produce large changes in low confidence values and 
  #small changes in high confidence values
  l1_delta = l1_error * nonlin(l1,deriv=True)
```
First, let's do a quick review:
In Step 8, we computed l2_error, which is the difference between the prediction of the network (l2) and the truth (y).  Next, in Step 10 we computed l2_delta by multiplying l2_error times the confidence levels we had in each value of l2.  And soon in Step 13 we will update syn1 by multiplying l2_delta by l1.  This multiplication is needed so that larger changes are applied to weights that have more impact on l2.  However, we do not apply this update until after we are ready to update syn0, for the sake of consistency.    

This is an important distinction.  We know our goal is to get to The Ideal l2 Prediction, which is close to or the same as our truth, y.  But we'd only be using half our horsepower to get there if we only update syn1.  We want to update both syn1 and syn0 in order to maximize our efficiency in creating The Ideal I2.  Finding the l1_delta is the key to updating syn0, and that's our next job now.

In Steps 11 and 12, we are going to compute the l1_delta, but in a slightly different manner.  We are going to use back propagation.    

Continuing your (unmistakeable) genius, now you can work backwards to learn from your mistakes.  Your l2_delta tells you the confidence you had in each of the turns (syn1) that brought you to l2.  So, you go back through your turns and eyeball the low-confidence turns to figure out the first place you went wrong.  In math terms, you will multiply tomorrow's ideal arrival and the confidence you had in today's predictions (l2_delta) by today's not-so-perfect directions (syn1) to figure out the first place you went wrong on today's route (l1_error, which is the distance between today's screwy l1 stop and tomorrow's fabulous l1 stop).  

You can then multiply the l1_error by how confident you were in your *first* set of turns/predictions that got you to l1, and the product will be the l1_delta, with which you can eagerly update syn0 to bring you ever-closer to your betrothed.  I am brushing back the tears, just thinking about it...

Consider this: back when we were finding l2_error, life was easy.  We knew l2, and we knew "The Ideal l2" we were shooting for, which was simply y.  But with l1_error, things are different.  We don't know what "The Ideal l1" is.  There's no y to tell us that.  y only tells us The Ideal l2.  So, how are we going to figure out what The Ideal l1 is?  We must first figure out how much l1 would need to change to produce the desired effect on l2.  We're going to have to take what we *do* know and work backwards.  That's "backwards" as in, "back propagation."  Now, Folks, we're goin' bigtime:


#The Big Picture on Back Prop
So, here's the challenge: every time you tweak one of your 16 lines/weights/variables in syn0 and syn1, it has a ripple effect through the network.  How can you calculate the best value for each of the 16 weights while taking into consideration its ripple effect on all the other 15 weights, all at the same time?  That sounds crazy-complex, right?

Can do.  Let me show you the World's Greatest Parlor Trick.  Those of you who know calculus will understand when I say we are going to use the Chain Rule to take derivatives.  But those of us who don't know calculus are not intimidated in the least because we will use slope, Good Ol' rise-over-run, to juggle 16 bowling pins in the air at the same time.  The secret?  In this context, to find the slope is to take the derivative.  They are exactly the same thing.

Here is our overall question: "When we nudge syn0,1 up or down, how much does that increase or decrease the l2_error?"  In other words, think of a derivative as a "sensitivity," or a relationship, or a ratio: we know that if we wiggle syn0,1, up or down, then the l2_error will wiggle up or down in proportion to that nudge.  But will it move a little?  A lot?  What is the ratio of l2_error's wiggle to syn0,1's wiggle?  How sensitive is l2_error with respect to syn0,1?

Remember: the goal of back prop is to figure out the amount to tweak each of our 16 weights so that we can reduce our l2_error as much as possible in the next iteration.  But the challenge is that each of these 16 weights affects many of the other 15 weights, so you might say that how much you adjust Weight 1 *depends on* how much you adjust Weight 2, which *depends on* how much you adjust Weight 3, and so on.  Keep this *"depends on"* image in your mind because it will come in handy later in the math.  As an analogy, imagine 16 of the Minions from the *"Despicable Me"* movie, and these 16 minions are aligned in perfect formation and cooperating perfectly as one body in order to serve their master, Felonius Gru.

#The Butterfly Effect
OK.  Our key question is, "How much do I adjust syn0,1 so that syn0,1 working in synch with its other 15 minions, minimizes l2_error so I can arrive at Castle y and fall into the arms of my princess?"  

The answer to that question is kind of like The Butterfly Effect, where the butterfly flapping its wings in New Mexico sets off a chain of events that results in a hurriance in China.  (Did you notice?  "chain of events" sounds a tad like "chain rule" in calculus, no?  Oh Dave, so very clever...).  Let's look at a picture of how this analogy might apply to the series of ratios we must calculate in back propagation (errata: delete "storm clouds off West coast" because syn1 does not change):

![alt text](https://lh3.googleusercontent.com/mkwLywuL4IJFRUlgusaQCf41A3QT1elx7jRvUP--ky2tsiqHCdpvLDhVxzJNa6ajt-vm7dLLAl1IQtxAn4JhYfb0dqEUkj7Kz4rbC-Vfbx3niYg-522ugY36Dv9InwkNMPUfZK-7869YgexDLQkZ3X7dnapAyOAAvXF-_-iVXDfhkV5CkscPD2bAJ7SgbmbJ3Hw6BQk7TWuV5v1ZbnJshsPER1UZ4XVgRR13S-dYOgiO2VV_N-x5pR1ci7gBPOfaeaDVoz8sgvH5KqDWRmU8rN_J80zmZc-oquyt3XCalfaV7Gmt_lX98d2-faDkSTHZly1XCl9XdKDZOnEwyxnn8Uxg5A5AQdDAbtgXzeddzf2Zm0DcU1b8tFr6l-ZL6M-vmrfkJiaP7VUC1hUql7CZ1OuG1ZyrCnN-FUQLcVT-M0-AjkmkSlh12R6jQnsVtU-FIGr8hq0z8koTKbZGUWZNUTHnPv1DqMRPRX4nGan_mUhMY-Ne7VPBrSTVhzkKfe4eigo6F70yh3zjNpZfdHGHIsoEXEMTFjc6N_wT_JWFG7DgvoXR5v-OYK6Hd2csUsJn5RITL4QRpUDuu9ni-IoiUPyCR0g4SAVxoI0CNUVKS9uu_GuaxCK-zotNnYg9U_1sdLQOXpscPrDnAWIJvDr8m0XFXQ9TiR1M=w973-h345-no)

Now, let's walk through the math of our butterfly analogy:  When we increase or decrease the value syn0,1, that's like the butterfly in New Mexico flapping its wings.  For simplicty's sake, let's say our original syn0,1 was 1.  We will now tweak it.  We will increase syn0,1 from 1 to 2.  This increase will now ripple through our chain of events.  Take a look:

![alt text](https://lh3.googleusercontent.com/JR4C73zysUgxPvCsb-TFGPDsvZBn1Amr9fZeIaXvCwp6AEsCEsC6PQPLpbQQzYl7TnSq428w_WD6099oHF_u4c09fB2aDsoBiFJXcTsk4U89whhbG46KfEreGbcQf2a2u0RPLlcZOg_lIucEQJ8CVnyRtYPGvK2oA6IiUe2_Pn8-BpEBN_30zmpnjOhQ5y9ynKrW45WS5YLD2pKOoOd_qGZVHr-Qki9A2SAwTHAsDoCLrOBjGRE9Wq0C9wLp1aIe4Vi4qN3MN_DVyHf92C6HwUvUs5_jO5S6KOlga6tY49IxDrL4rIsw4zow7ts5pipNTsEGf5UCJauCIpYS1vixrGF_A-mQFmbUtjlmvP6C-yi_JQKGWTIsKe1jCMZC43i-LQv45juLPOc9v1mpfiuJwLpX8oXObdLLLhpDgD9Wum697_zHC5MOLSgupZjKdj3GbUKobgK_djnUHxvqa1b-7GR30nEsDIpyMboliS7Qv-_qfwM_o6G0nnteid1Np0XMKBTgV2gcjzevaE-gm0_1Tq94RoIe6ScpIgugU0u9ILCb47inELGH1RZTUPcXe21sv-_nQDM3BZ2sKnYvAu6zrlJ6z3ufdqQkeMm1nwTeqfSXja-6YH3HELs_-Ml8V9-pj3n4DHx-kmFwXSSslvekrmIR1cTnY-TD=w973-h347-no)

Now, follow along with the above diagram and let's combine two analogies:
#The Butterfly Effect Meets the Ripple Effect

Ripple 1, the first ripple effect of our tweak to syn0,1 will cause l1_LH to increase by a certain proportion, aka ratio.  That's the "gust of wind in Nevada."

Since l1_LH is the input of our sigmoid function, to calculate that ratio of change between l1_LH and L1 (aka, "to take the slope") is to measure Ripple 2, the "heavy winds in L.A."  

Then, l2_LH will obviously be affected in proportion to the change in l1 and its subsequent multiplication by syn1,1 (which does not change-- we're leaving syn1,1 and the other 14 weights unchanged for now, to simplify our example), so measuring that ratio of change will give us Ripple 3, the "thunderstorm in Hawaii."  

You can probably guess l2 will change in proportion to the change in l2_LH, so taking the slope of those 2 numbers will give us Ripple 4, the "storm over the Pacific."

Finally, when this new l2 is subtracted from our target y value, the remainder, which is l2_error will change, and this is Ripple Effect #5, the "hurricane in China."  

Our goal is to calculate the ratio by which each ripple ripples, in order to know the amount we want to increase/decrease syn0,1 in order to minimize l2_error on our next iteration.  When we say our neural network "learns," we really mean it reduces l2_error with each iteration such that the network's predictions become more and more accurate each time.  So, tweaking syn0,1 is like tweaking the flap of the butterfly's wings, which ripples through a chain of events right up to the hurricane in China, which in our example is the reduction of l2_error. 

Since we're working backwards, we would say, "How much the hurricance l2_error changes *depends on* how much l2_RH changes, which *depends on* how much l2_LH changes, which *depends on* how much l1_RH changes, which depends on how much l1_LH changes, which depends on how much our butterfly, syn0,1 changes.  Let's have a clean, simple look at that in this diagram (errata: add the code, "l2_error = y - l2" pointing to the d l2_error / l2 ratio):

![alt text](https://lh3.googleusercontent.com/vzKvBFBRjYHO7T07_QiKtU95GptvZYxjhTOjIjMWuq7MbQFID4tpIXJkIttdHD1jL3qt8njA0xeeKwje_wI70jgNb5BywonMQmqQoCmRhwTh1gBuFPS1Vhaq0EC2Fei1iYW8JlO2RLuEI5Y7E_TyhaIVDm0nQmprBwXUNhgQm3tVuVbDmZB0DlDZ0hlUJLORfxr8IuxF2t5pv04DyiqnaPprJZUL5B_zqTY8zkCFhUpuNVVibls5ZNadbpReQwPj_tZqdrHjKNTSBWBOzAI6cDW2_h-qs9Ttij8p7IZJd_N_sELv_gxZngc53VVF3IJn8Hby_vtLBORgX_2gjzy8rtxtsvuA4CBGyNBuhyB8L5J5ls-8jw2II1eLC64y1qI5sLhiwlls4KQGQpfeT2JYi9l6mP1Q8xqB_ME0vhckUnCoTGxu5lLQlDOjgANUSmQaj3OXepBKIBE_uM1cTvRQFHFaaQkoxvu5SKe8JW0cZZU7JLYXwO0dGZJKnHGd9ZM1Vwh-cKeoiwbcvD55_J2cKhdew7HXUOToogPWZ_sA-sWPBsrWNPvCuF4d9qpKhhth0Zjd_Hnd9Tg6xDxqzq9ZmptcGIbaUDqoMW-JAeJW7LTT7725HZNTriYzqkjG9Bev7ss-081eBnR-tHnb6sEY65lgqpf0J6xl=w973-h380-no)

So you can see that there is a chain of 5 ratios we need to calculate and multiply together in order to find the ultimate ratio of how much a change in our butterfly, syn0,1 creates the change we want in our hurricane, the l2_error.  How do we calculate those ratios?  For your convenience, here again are the variables we found in the Step 7 Forward Feed:
```
l0=1
syn0,1=2
l1_LH=5
l1=0.98
syn1,1=3
l2_LH=0.94
l2=~0.7
y=1 (this is value 3 of vector y, which corresponds to training example #3, row 3 of l0)
l2_error = y-l2 = 1-0.7 = 0.3
```
Don't make the same mistake I made.  Here's how I first attempted to calculate the ratios (i.e., the wrong way):

![alt text](https://lh3.googleusercontent.com/jDP5tTftL0aljCAULeNE9jW02gV1P0IYL6tWKE8NQbJ7K677UMG4Yrs10kWReFjBAsy_EI7VXYXolBVj0TczQXiiX-D5sps12K5mH2oqzRt_cvZF8dK_3Owyo7gO0LcibE489wYu-O_GMMEAjETAyZIARQvP6iwc2WHw7YC06rTRjj0TInrq_KAF4hvxhC7XXWsOYjRDeJdbLZ3WaA2PmPhzTpg_uO2ShqGyQafNYHZle5Pb8-c45uRsz_YZDHPIJQ_W4XuGbgLgOo5HAZUm_8-vS34D2s9k-8aa68mpFf-p2VDqLS8EmoOzM3Kn5z5YLqpssqKoRGDt-xB4fLaFqZY08AlyAcJfVNEAZIzmxLhPklfs0I5cU70VQEsbNwGlMxYnwNlHx1uIraTpLbc5CIJ_8R2BH5Ee_kBVIV2ucJMkKEoH1YmCC8EMA8Gqo5oFKUKM2GGHHE-AgDxQrW3FPhZ2uwwqFpCnWhDIrzkUTnugna1awZLCuop1qdtWkNBs0V46XHB0xADzEVckGGkkRQSTIB3xHdZDL4xk7ASDUSMr4sWOzXLn-W_-M_4Brxj_38p7Vc5BO_LgOiGzTQqJLZNTLXhC0QtOG33KBbVbGeFBoObGgIejppQVt_ASI-nnynYpHYNSm_NNv3TOQRPq2d_pJClaJAL6=w973-h274-no)

For $250,000 cash and a trip to our bonus round, **What's wrong with the above picture?**

Answer: In my above calculations, I forgot that the goal is to calculate **change**.  I mistakenly believed that each of these 5 ratios is fixed.  In fact, how much a given ratio "B" is changed by the change of the preceding ratio "A" depends on *where* the coordinates of Ratio B fall on the graph of Function B on the grid.  So calculating change is not possible if a ratio is fixed.  It *is* possible if a function is linear or non-linear, but in order to calculate change you must calculate the slope of the ratio.  Why?  Because how much a ratio changes depends on where it falls on the grid in the graph of the function.  Once you know the slope of a function you know how much it varies given its location.  And to calculate slope, you need at least 2 sets of coordinates.  

I want to make sure that's clear for you.  Consider for a moment:  we want to predict the future (so to speak)!  We want to know how much a *change* syn0,1 will ripple through out network to cause a *change* for the better in our next, future iteration of l2_error.  In other words, we're not just comparing how a single number 2 in syn0,1 affects a single number 0.3 in l2_error.  We want to compare how a *change* in the number 2 in syn0,1 will in turn change the 5 rippling ratios to ultimately create a better, future l2_error.  Specifically, we want to make that 0.3 value smaller, as quickly as possible.  So, we don't want a bunch of ratios of numbers; we want the ratio of **CHANGE** in the numbers, the numbers that change our future.  The delta.  

OK, so hopefully you now agree that we need *two* values for each of our variables.  We already have a current value for each variable.  We need to provide a *second* value for each variable and subtract one value from the other.  The remainder = an "amount of change," or delta, that we can then compare to the amounts of change, or deltas, of the other variables.  It looks like this formula in the top part (ignore the math below for now):

![alt text](https://lh3.googleusercontent.com/klUn2t8vP02Y6orwKA0GIfud3TL6hGH3Cu22pWKhLGwKcxv9VpHVrMSrvcr9qtOhGc6ibuDme3iD3qkL50DpS-4MS1zkSJ-2CVK87RCHvpkUt7R49BloZZRnGK4zCnUJ5Hs7QF_XeTls3_2i2NTpZYEhHyi0KVJgw8DUKy2fMIi046qw5RVjsFgY0MHdsH1lWWo1H0KdZMipUpmEe2wNtxA43ZWfYss44pdZOPCVIs--g4uFE5KxYhYXv7_tnJjpnIC3X86O9JAxeP7Ju6lNPApsKx0IJhi48MWpB9tLafT6izha3kk88qzQ4RcmC9E_osTO997XRirA8dr5uG9Ii8DM4w3aOXwAfMQuGnnrQ1FwY2VpfYWwSJfGaQu-yqYmbxmbILe9Pl4J1aUg6G6nQROVkWyd03pM_XQOI7g1QK3ib6iE7bHro_8R3j8D0ooR4lSB0xPgTIEPavFH9vfNkwZwYsDFYciMniC093QO1vrE5Z1RQhQVlip9TxtR_9zYnb1NT9nRQvUhlO1sQluUGS08t0dCevnZ8mgI-mJsasRxrKPmF9fH6LCQ_tM94nQl9Iy5TPkURLH1NYbQab1TjPVQGqoD_sAV8ss7_1HyVP2IsstatN7R7mHEwKyGPu5xC7PavPUQh0W-TDOE2AFkfUpaR1rp7Rpu=w973-h724-no)

"Current" above refers to the current values we have for each variable.  "nearby" means we want to provide a number pretty close to our current number, for convenience's sake.  That way, when we subtract the current number from the nearby number, it will yield a small number that's easy to calculate in our ratio of change.  And it will give a more accurate slope when your two points are on a curvy graph line.  Let's walk through our 5 ratios together so we can practice finding the "nearby's" for each ratio.  Once we've done a full example, you'll be a giant step closer to understanding back propagation.

![alt text](https://lh3.googleusercontent.com/75_ZzCO0tvIh6-fnUhdzZFmkDgN7NuD5M7UCYqsQh32Dtuq2uGARzWAzErDCQcvfhcO5G-c9aTaELiBShVhZ0GdDXBUzSyjEmrUQM_aboQhxKRwoA68-WaqTdTW7S7slBIGWD6uqT7EKC2_8DGvXpNQp1uCMvZO41IVrQZdtrIrin7rOhkntmCaEkgthFsqLNbZ8qBtBOR_GyqlinLtXbeQXi6gEKO_rFBvMw-J8RskOH700ebxqoO_Lgda-HOY_EWkE-AYZb605V9atohsanr6c2_KSQYcnJKSPRTbTsIC6CBfLsgv8L-9w_o5_sHbI5LZYW2AC48hI9QqpXt8Np2HRb2JsAPR2xctF0Kb7pLccGJCeIxinFk-ehIor56GSiHonA5EyjWzMHlR0wosODXtXpk967Rm4yWIrep8yw-zEB_KBln6yHfqAmMVV3VYioIUpFnA16LRzWFjoR3_n1GYT0_OOr2BOg7L9vk0oa40I3wf3fRLpTSZINzA73j-CZVsgf5K7E0BH_odTWEZettoXgOOP4JW3YAD_HNvN14V8qe40yeAamT5YHjKn3h5iB24zUhVgP2EJRFuj1dCafiC9D8ITycHvM4ZbZ05Vf-k6_nRu3lCORL4_7p5qxmeI3cc6qowihu570GEEF_gs5T1wv2Az5Cso=w973-h473-no)

OK.  Ratio 1, which is Ripple 5, the hurricane in China (since "back propagation" means we're working *backwards* through the 5 ripples of our ripple effect, right?):  d l2_error / d l2.  Where did our "currents" and our "nearby's" come from?  Well, x_current is the l2 we caculated from our forward feed, 0.7.  y_current is y-l2 = 1-0.7 = 0.3, our l2_error.  x_nearby is simply a convenient example we made up.  We know that if l2 were 0.9 which is indeed nearby our x_current of 0.7, then y-0.9 would be 0.1.  Hence, x_nearby = 0.9 and y_nearby = 0.1  Once you are clear on your 4 variables, the math is easy and the slope, aka the sensitivity = -1.  This means that for every 1 you increase l2, it decreases the l2_error by -1.    A delta of 1 in our l2 produces a delta of -1 in our l2_error.  Nice.

Ratio 2, which is Ripple 4, the storm over the Pacific.  d l2_LH / d l2.  We know x_current is l2_LH from our forward feed: 0.94.  y_current is our l2, 0.72 (I added another decimal place).  But how do we find the nearby's?  We eyeball the S-curve of our sigmoid function in the diagram above, to find a convenient ratio to plug in.  We notice that at 0 on the X axis, Y is 0.5.  Nice, let's use that.  So, our x_nearby becomes 0 and our y_nearby becomes 0.5, and it's all over but the math: answer is 0.23.  

Why, you might ask, do we eyeball the S-curve of our sigmoid function?  Look at the code in line 57.  It's not asking us to squish the LH side of l2 into a number between 0 and 1 with the `return 1/(1+np.exp(-x))` portion of our sigmoid code.  Rather, this time it's asking us to take the slope (aka derivative) of l2, by calling the `return x*(1-x)` code with `(deriv==True)`.  So, input is the X axis (i.e., 0, and output is the Y axis i.e., 0.5)

Next is Ratio 3, Ripple 3, the thurderstorm in Hawaii. d l2_LH / d l1.  We know x_current is L1, or 0.98 (If you eyeball the S-curve, you'll notice that 0.98 is the Y coordinate of 5, which was our l1_LH product.  Cool, huh?).   y_current is L2_LH, the product of l1 x syn1,1 *plus* the products of the other neurons multiplied by their synapses (which we conveniently made up as -2) = 0.94.  Note that, on this ratio, the code does not ask us to take the derivative (aka slope).  So for our x_nearby and our y_nearby, we can choose any darn number we please.  Let's keep it simple and choose 1's all around.  Subtract, divide, done.  Hallelujah.

Ratio 4, Ripple 2, the heavy winds in L.A..  Having waded far upriver, we are now nearing the source of our ripples, syn0,1!  Exciting.  Note that the code in line 71 on this one is asking us to take slope, so you know we'll be eyeballing our S-curve again.  x_current is l1_LH, or 5, so look that up on our X axis.  y_current lines up at about 0.98, no?  Done.  Now, since we're taking slope this time, for our x-and-y nearby's we have to find a nice pair of coodinates on our S-curve that make for convenient math.  How 'bout x_nearby as 6?  That would make y_nearby as about 0.99.  Lovely.  Do the math and it's 0.01.  Joy.

Final Ratio, #5, Ripple 1, the gust of wind in Nevada.  The source of our mountain stream.  x_current is syn0,1, or 2.  y_current is l1_LH, or 5.  Line 76 of our code is not asking for any slopes/derivatives because it is a linear function (which looks like a straight line on a graph), so the distance between coordinates can be very large and the slope will still be exactly right.  For curvy functions (parabolas, sigmoids, etc) the points should be close together to minimize the effect of curvature on your estimate of the slope.  Thus, for our nearby's on this linear function, we can use numbers that are, well...convenient, rather than nearby.  Hence: x_nearby is 1 and y_nearby is 4 and the math happens to work out (again, conveniently) to 1, which is the l0 mentioned in line 76.  

OK.  We now have our 5 ratios, so let's multiply them together to come up with an answer to our question, "How much will l2_error increase/decrease, depending on how much I increase/decrease syn0,1?"  

```
1 x 0.01 x 3 x 0.23 x -1 = -0.0069 = d l2_error / d syn0,1
```
In other words, for every 1 I increase syn0,1, l2_error decreases by -0.0069.

Now let's walk through what would happen, if I updated syn0 for the next iteration, using line 75 of our code:
```
syn1 += l1.T.dot(l2_delta) 
Note that "+=" means to add to the existing, so:
3 += 0.98x(-0.0069)
= 3 + -0.006762
= 3 - 0.006762
= 2.993238 = our new syn1 to be used in our next iteration!
```
If you have followed things thus far, then you are well on your way to becoming a Back Propagation Rock Star.  If not yet, hey--reread the above several more times and click on the helpful links of the Super Teachers I have cited above.

Here's a question I had when I had arrived at this stage: Why bother taking the slope of l2?  

We take the slope of l2 to fix the most mistaken of our 16 weights faster.  How?  Well, you may recall from our discussion of the Sigmoid function (and the S-curve diagram) above that the slope of l2 is the confidence level of l2.  The smallest slope numbers indicate the highest confidence level.  Therefore, multiplying the corresponding values of the l2_error by these small numbers ain't gonna cause a big change in the l2_delta product, which is good.  We don't want to change those weights much, because we're already pretty confident in the job they're doing.  

But the l2 prediction numbers that we are *least* confident in have the steepest slope, which yields a larger number.  When we mutliply that larger number by the l2_error then the resulting l2_delta has a bigger number.  When we update syn1 later on, that bigger multiplier is going to mean a bigger product, and therefore a bigger change, or tweak, in that value.  This is as it should be, because we want to take the weights we have the least confidence in and change them the most.  That's where we will get the biggest "bang for our buck" when it comes to tweaking the 16 weights of our system.  To summarize, taking the slope of l2 gives us the confidence of each l2 prediction, which allows us to home in on the numbers that most need fixing, and fix them the fastest.

The next key step is for you to understand how the computer code is doing the same thing we just did manually with our math equations.  I want to show you how these few lines of code...
```
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
```
...are doing the same thing as this math equation below.

![alt text](https://lh3.googleusercontent.com/NgW0Tm9i3p6SI00P1AYxz5mpXtVNmg49DyHZUTiEHRdvQrzjIP_phX6_-Jm48axERmwUPDTNn1rjIzAq8Af4nm0CEBUD7e6Po2V-CCED42kLSCWDCB9Ts36_ahavGLeXiHBVdlCn72eJvgNyAD3wHYogx53fbhoivDiijFAQqA9sqgbL70x3vfCnjXRCr-LTpxrn1-50fgM-G4lnYvl1eBjscRL-0iJJuvaw1zyehgTDEyMbT4o6CSTNtsc3VSY1zfy5-y_s0J3XgkxT5wEkkWqylGmcY29Mm4QYhAQmnVpO95p7z2yxPCXyMEfe3xhhuZULsgNv6KmovD1TcnSk32Z-UsyCLVsIWSoU9pXN-KIPo7DmnPP8wguwG1aQoGrUzjoCPf8wTcq1bRBjvCWOYvavp1jXJTrrvJ3v-sKRpV94_gKWjCdwADCsXYo16mG2gD09QpOllbt91Myzv7Zq312IFYHEju3Lp-mVfR1mS-l__LaO9TENVD2DkwLDqIDFGEf6Pnumam1x6VU4b-vCVZLUBzaClqewI539lMFW-opF_1CPieAtdwJOjP1173xl4ei1yKBKXELbKjxBhTiHGsJOW-pYjBCc2ZvtfpqJINTd-cpJKhmITbC5OsQ_iPNn-ebkpJ74gvt-uONwHoSlV7waUjLkRcsx=w973-h615-no)

If you compare the code above to the math in the diagram and they don't look consistent to you, that is only because our original code breaks the back propagation process down into several intermediary steps with several extra, intermediary variables.  If you ignore the intermediary variables l2_delta, l1_error, and l1_delta, suddenly things align very nicely.  Look at the line of code beginning with Greek letter Cap delta syn0,1 and follow the arrows from each piece of code to each ratio.  Makes sense?  Lines up nicely, right?  That, my friend, is back propagation.  It took me about one month of daily study to finally arrive at this insight, so don't feel badly if you haven't nailed it yet.  
```
d l2_err / d l2 = -1 (The target value of l2 is 1.  If l2 increases, then the corresponding error decreases.  Increasing l2 by 0.1 causes the error to change by -0.1.  The ratio of the change in error to the change in l2 is -1.)
d l2 / d k2 = slope of l2
d k2 / d l1 = weight syn1,1
d l1 / d k1 = slope of l1
d k / d syn0,1 = l0
```


##12) In what DIRECTION is the target (ideal) l1?  Lines 66-69
As before, we compute l1_delta by multiplying l1_error by the derivative of the sigmoid to aggressively change low confidence values.  We will use the exact same process as Step 10 to find in what direction our gradient descent should be moving in order to take us closer to the perfect l1 that will contribute to us finding the perfect l2, our ultimate goal.

We want to answer the question, "In what DIRECTION is l1, the desired target value of our hard-working middle layer 1, from l1's latest prediction in this current iteration?  We want to tweak this middle layer of ournetworkso it sends a better prediction to l2, making it easier for l2 to better predict target y.  In order to answer this question, we need to find the l1_delta, which tells us how much to adjust the weights to produce large changes in low confidence values and small changes in high confidence values.


##13) Gradient Descent: How the synapses, rather than the neurons, are the core of your network's "brain."
Lines 71-74
This final step is all the Glory Moment:  all our work is complete, and we reverently carry our hard-earned l1_delta and l2_delta up the steps of the podium to our hallowed leader, Emperor Synapse, the true brains of our operation.  
We compute the update to syn0 by multiplyinjg l1_delta by the input l0.  This causes large changes in components of syn0 that have stronger effects on l1.
We update syn1 and syn0 so they will learn from their mistakes of this iteration, and in the next iteration they will lead us one step closer to that ideal bottom of our bowl, where error is smallest, predictions are most accurate, and joy abounds!

It is efficient to change weights in the synapse that correspond to larger values of l1 (i.e. if a node of l1 has a large value, a small change in the weights that are multiplied by this value can have a large effect on l2).  The multiplication ensures that the total change applied to the synapse maximizes the impact on l2.  In other words, it produces an increment in the direction of steepest descent (opposite of the gradient).


#In Closing...
Andrew Trask gave me a fabulous gift when he wrote that memorizing these lines of code leads to mastery, and I agree for two reasons: 

1) When you try to write out this code from memory, you will find that the places where you forget the code are the places where you don't understand the code.  Once you understand this code perfectly, every part of it will make sense to you and therefore you will remember it forever;

2) This code is the foundation on which (perhaps) all Deep Learning networks are built.  If you master this code, every network you learn and every paper you wade through will be clearer and easier because of your work in memorizing this code.

Memorizing this code was made easy for me by making up an absolutely ridiculous story that ties all the concepts together in a fairy-tale mnemonic.  You will remember better if you make your own, but here's mine, to give you an idea.  I count the 13 steps on my fingers as I recite this story out loud:

1) Sigmund Freud (think: Sigmoid Function) abolutely *treasured* his neural network, and he buried it like a pirate's treasure, 

2) "X" marks the spot (Creating X input that will become l1).  

3) "Why," I asked him (Create the y vector of target values), "didn't you plant 

4) Seeds instead?" (Seed your random number generator)  "You could have grown a lovely garden of 

5) Snapdragons," (Create Synapses: Weights) "which could be fertilized by the 

6) firm poop" (For loop) "of the deer that 

7) Feed on the flowers" (Feed Forward Network)!  Then suddenly, an archer 

8) Missed his target (By How Much Missed the Target?) and killed a grazing deer.  As punishment, he was forced to 

9) Print his error (Print Error) 500 times on a blackboard facing the 

10) Direction of his target (In What Direction is y?).  But he noticed behind the 

11) BACK of his target two deer were mating and PROPAGATING their species (Back Propagation) and he shouted for them to stop but they wouldn't take 

12) Direction and ignored him (In what Direction is the l1 target?).  He got so angry that his mind 

13) Snapped and he Descended into Gradient insanity (Update Synapses, Gradient Descent).  

So, this is a very silly story, but I can tell you that it has burned those 13 steps into my brain, and once I can write down those 13 steps, and I understand that code in each step, to write the code perfectly from memory becomes easy.

I hope you can tell that I love my journey into Deep Learning, and I wish you the same joy I find!
Feel free to email me improvements to this article at: DavidCode1@gmail.com

THE END
