<a href="https://colab.research.google.com/github/davidAcode/davidAcode.github.io/blob/master/042519_Teaching_Deep_Learning_to_the_Marginalized_Without_a_Math_Background.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Building Your First Deep Learning Neural Network: 
##A Simple, Complete Explanation That Skips No Steps
Have you heard of "expert blindness?"  For example, expert blindness is when an expert teaches a subject like AI to a rookie, but the expert lacks a feel for which concepts might confuse the rookie.  So the expert passes too quickly over complex concepts that need to be broken down into bite-size, user-friendly parts.  Or the expert teaches without analogies, pictures or examples that would help the rookie to grasp the concepts.  The rookie ends up feeling frustrated and overwhelmed.

You see, everyone thinks they want an expert to teach them AI.  But you actually don't want an expert.  You want a **teacher** to teach you AI.

The best teacher is the person who just learned yesterday the stuff you are studying today, because he still remembers what he struggled with and how he overcame it, and he can pass those shortcuts on to you.  That "he" would be me.  I'm not an expert, but I am passionate about teaching.  Let's jump in with a real-life example of Deep Learning:

#1) Congratulations!  You are the wealthy owner of a fabulous pet shop...  
One month ago, you launched a new kitty litter product, cleverly named, "Litter Rip!"  A big part of your success comes from your savvy use of AI to send targeted advertisements to the right potential new customers.  You seek folks whose beloved felines would be pleased to, "let 'er rip," so to speak, upon your new cat toilet.

Your secret weapon is your dataset.  You have data from surveys of your pet shop customers in the past month since you started selling Litter Rip! in your store.  These customer surveys include their answers to four stunningly insightful questions:  
1. Do you own a cat who poops?
2. Do you drink imported beer?
3. In the past month, have you visited our award-winning website, LitterRip!.com?
4. In the past month, have you purchased Litter Rip! for your poopin' puss?

The answers to these four questions are known as "features," (characteristics) of your past customers.  First, you will train your network by inputting the data from millions of past, overjoyed customers and their Yes/No answers to the first **three** of the above insightful questions/features.  Your neural network will train itself, based solely on your old customers' answers to the first three questions above, until it becomes **awesome** at predicting which of your old customers probably did buy Litter Rip!  You will then use your past customers' answers to the **fourth** question as your Test Set, to see how accurately your network is predicting.  Each time your network uses the first three answers to predict whether a given customer bought Litter Rip!, you compare that prediction to their actual answer to Question Four.  For example, if your network predicts, "Yep, I bet this customer bought Litter Rip!" and that customer's answer to Question Four was indeed Yes, then you have a successful neural network.

The process is trial-and-error:  the network will predict, then compare its predictions to the old customers' answers to Question Four, and learn from its mistakes over 60,000 iterations.  

It's important to understand that a neural network always trains on one dataset in order to make predictions on another dataset.   Once your network is fabulous at predicting purchasers of Litter Rip! from the past customers' database, then you can turn it loose on your *new* database, a list of hot prospects.  From your local veterinarian (who is secretly in love with you, you charmer...) you have obtained a fresh batch of surveys of people who have answered the same first three questions, and your by-now-well-trained network will predict who best to send your targeted ad to.  Pure genius!  OMG, how do you *do* it?

The best way to master Deep Learning is to build a neural network and then memorize the code, because that way you can apply these fundamentals to any network you meet in future academic papers or work projects.  It all starts here.  Today you will build your first neural network using these tools:

1.   A real-life example: marketing Litter Rip! to cat owners;
2.   The Big Picture: an analogy of a neural network as a brain, with neurons and synapses;
3.   Visualizing how networks "learn:" a white ball rolling inside a red bowl on a table;
4.   The Code: how a computer creates a learning brain with only 21 lines of Python;
5.   Breaking down this Python code into 13 steps that cover the 5 **Major Themes** of Deep Learning (Don't worry if none of these is familiar; I'll explain them all below.):

> 
1.   A network learns by trial-and-error;
2.   Forward Feed computes error as the distance (the height) of the white ball from the white grid on which the red bowl's bottom (the global minimum) lies;
3.   The Sigmoid, a simple activation function, calculates probability and confidence levels;
4.   Back Propagation tells you which parts of your network to tweak to reduce your error on the next iteration; and
5.   Gradient Descent shows how the synapses, rather than the neurons, are the core of your network's "brain."  

I mastered today's material thanks to [Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/), [Grant Sanderson](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw) and [Kalid Azad](https://betterexplained.com/), as well as my mentor Adam Koenig, a Stanford PhD in aerospace engineering, and a gifted teacher.  Now I'm going to stand on these teachers' shoulders and help you learn it too.  We have already begun with #1, our real-life example of using AI for marketing, to target the right ads to the right prospective buyers of (superlative!) kitty litter, and we will weave this (charming) scenario through all of the concepts below so you can see the practical application of the concepts you are learning.  

And by the way, Yes.  I know Survey question #2 asking if the customer drinks imported beer is irrelevant.  I deliberately chose an irrelevant question to show you later on how the network sorts out the most useful features from less-useful features when it is making its predictions.

Let's move on to Step 2 of five: an analogy for how neural networks work:

#2) Here is the Big Picture: An Analogy
Here is a diagram of the 3 layer neural network we will build today (For now, just focus on the bottom labels: "Input Layer, Synapses, etc."  We'll get to the labels at the top later):

![alt text](https://lh3.googleusercontent.com/CkHAz8xdeL8F5_ZqEysrOsFK-mMa2VLn_dqNcFX56tQOSR62eiNbw2OVGVPKehgkrYA8tAy9bXbfC09qglgVAVqZPghLRvJ_xaEzm1nRbltZGCvkCqZ7O59cihqL0723imx9LLdtz26LQNtKaJC13uPuWORKAx3rAfK3ZOz2MIlHLV_MO0b730fUqZLO9HwOtU8sIkFvwyLQp2tMy7G5JTGi-3gAqhspbKI684vLav73AsmHVNwcd4bnBB_0F844ehpNHjKp_ei1zl1g1WFHvM8yPbvU4WsNgBGpiVdrBJmhlFTuAcCjlMBqB8jOHP_9JMXZRaOcBNjo5QSfQ56DILw2C4h5W6g8VHpEv1k4HRWrQ4jqKPXWIeiTQB1ZBrSZwNBJfqK9yxFU1XDgP_N1s6uUGCF3v2Ae5qGiab15bRKUkORmXKN6IAS3kxfVgt3fXypSxr76Gv5Gdg27D3oJ4jmkCiiP8f7rsP6TvRH6PaTGqqcsgxzxx3KeQ9Wd3UdMsDhnt1g_2KMnPK0NvBmx3PFcrNc3dY6L_5fBzrc_-p-5R4Ch1anPROZUKX92Zrp1VFAE-kg2Mb9JdCzzOKopmpL6-SMPxGUQKAUJg2UQJDaI0GfGRM6OGpCZJ19Sl0gVY2unHZmpWroCm21TV_IvF6zzaJxc5vxJ=w905-h510-no)

































Let me help you get your bearings in this diagram.  This is a three-layer, feed-forward neural network.  The input layer is on the left: the three circles represent neurons (sometimes known as nodes or features).  You may recall our first three questions above represent our three features.  So for now, think of that column of three questions as representing one customer's responses.  The top circle on the left contains the answer to the question, "Do you own a cat who poops?"  The middle circle of the three contains the answer to, "Do you drink imported beer?"  The bottom circle of the input layer is the feature, "Have you visited our website, LitterRip!.com?"  So, if Customer One responded, "Yes/No/Yes" to the questions, the top circle would contain a 1, the middle circle would contain a 0, and the bottom circle would contain a 1.

The synapses (all those lines) and the hidden layer above are where our brain does its "thinking."  The single circle on the right-hand side (the one still attached to four synapses) is the network's prediction, which says, "Based on the combination of answers input into me, here is the probability that this customer will indeed buy Litter Rip! (or not)."  The predicted probability will be between 0 and 1.  For example, an output prediction of 0.00001 means, "Definitely won't buy,"  0.2 means "Probably won't buy," 0.8 is "Probably will buy," and 0.999 is, "You're damn right they'll buy Litter Rip! with joy in their hearts!" 

The stand-alone circle on the right-hand side, labeled y, is The Actual Truth, which is the answer to the fourth question, "Have you purchased Litter Rip! for your poopin' puss?"  For the circle labeled "y" there are only two choices: "0" means No, they did not buy, and "1" means Yes, they bought.  Our neural network will output a prediction of probability, compare it to y to see how accurate it was, and learn from its mistakes in the next iteration.  Trial-and-error.  60,000 times.  In seconds.  That's the power of Deep Learning.

























#3) Another Big Picture: Visualizing How Networks Learn
Some people learn visually, so seeing a geometric representation of a neural network below may help them master the material.  But, if you feel confused at what you see in this diagram, no worries at all.  I will teach you every step of the process below, but I want you to begin to get a vision of where we're headed.  This picture is by far the most concise summary of Deep Learning that I have seen.  It gives you a visual representation of: 

1.   Feed Forward;
2.   Global Minimum;
3.   Back Propagation; and 
4.   Gradient Descent.
  
If you feel disoriented, don't panic--all will be explained later, and you'll be referring to this diagram many times:

![alt text](https://lh3.googleusercontent.com/jIup60T65tIKtXg0B-Np6jeNXk4TvQTRgBI1btNRZUZ4yy_ZEyL1bN3RwiSjzKNcbyXQN6z7vdV55NzGFxJfUpZXkyU6HTmrScht0rbk5BXGC6eO79LrZuuVpJdHE4fr4QYwvdbO)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)

































##1) Feed Forward##
In the diagram above, our network begins its predictions about kitty litter buyers at the top of the dotted, white line on the surface of the warped, curvy red bowl.  That point is very important, so let's call it Point A.  Think of Point A as the first prediction the network makes.  This is the first forward feed.  

What most AI courses and blogs forget to mention is that Point A has a location defined by two coordinates on the white grid underneath the curvy red bowl.  This curvy bowl is not drifting in space.  It is sitting on a white grid. Think of that white grid as a tabletop, and our warped red bowl actually sits on the tabletop, but the bowl's only point of contact with the tabletop is the very bottom-most dip in the bowl--the point where the dotted white line ends--the global minimum.  (I know the bowl would be too lopsided to actually sit properly, but humor me--this is just a quick-and-dirty analogy).    

Now, pretend Point A is a little, white ball.  It has 3 coordinates: the X and Y axes tell you location on the grid (tabletop), and the third, Z coordinate is the height from the grid to Point A as it sits on the inner surface of the warped, red bowl.  Here's  a super-key insight: the two orange arrows you see, representing the X and Y axes on the grid, are actually the synapses syn0 and syn1 that you will learn about below.  As I eyeball Point A, its coordinates on the white grid look like (3,3) to me, assuming the orange arrows are the X and Y axes.  So for now, let's say syn0 is 3 and syn1 is 3 (I'm oversimplifying here on purpose).

Here's where it gets cool: remember Customer One's Yes/No/Yes responses to the first three questions?  The network takes that 1,0,1 of Customer One, multiplies it by the synapses of syn0 and syn1 (plus some other fancy jazz, which we'll get to later), and makes a prediction: let's say, 0.5.  That's our first feed forward.  In other words, there is a 50% probability that Customer One bought Litter Rip!  Then the network compares that prediction, 0.5, to The Actual Truth, y, which is 1.  Customer One did indeed buy Litter Rip! (a wise choice), but our first prediction of 0.5 wasn't so accurate.  

And now things get super-cool:  Do you see that yellow, vertical arrow?  If I were drawing this diagram, I would have put that yellow arrow right under Point A--namely, at about (3,3) as I eyeball the grid.  If the vertical, yellow arrow were located just underneath Point A, then it would represent the 3rd coordinate, the height from Point A to its two coordinates on the white grid, syn0 and syn1, or (3,3).  This height equals the error of that first prediction--how much it missed The Actual Truth.  That is to say y, the fourth survey answer of Yes, or 1, minus the prediction of 0.5 equals a miss of 0.5.  That yellow arrow would have a "length" of 0.5!  Ain't that cool?  We can see how the network thinks in terms of simple, 3-D geometry.  You just witnessed your first Forward Feed, complete with yellow arrow measuring l2_error--how much we missed the target of y (which in this case is 1 or Yes).  You're a rock star.











And it gets better.















##2) Finding the Global Minimum##
The goal of a neural network is to find the fastest, most direct path for the white ball to roll down the curvy surface of the red bowl:  down from the original Point A to the bottom-most dip of the warped, red bowl--the point that sits on the tabletop.  That is Utopia, folks--the place where the yellow arrow, the error in our predictions, would equal zero in length, meaning our predictions have no error and therefore our network would be stunningly accurate (the global minimum).  And the fastest, most efficient path for the white ball to roll down the surface of the curvy red bowl is represented by that white dotted line which hugs the surface of the red bowl.  

So, consider our scenario now: we started at the top of that white, dotted line, where Point A's height (the yellow arrow, the amount of error in our first prediction) equals 0.5.  How do we roll the ball down the bowl to get to the bottom, where the yellow arrow, the amount of error) is zero?  In other words, how do we tweak our original synapses of (3,3) so that their coordinates will move on the grid to the point *right* under the bowl's bottom, at (roughly) (3,0)?  So far, we have only made one prediction based on Customer One's responses.  How can we improve each subsequent prediction, based on each customer's responses, until our prediction error is zero, the white ball is at the bottom, and our fabulous network is trained enough to take on our new dataset of hot prospects that were provided by the (smitten) veterinarian?

To repeat: our goal is for our small, white ball of Point A to roll down the curvy bowl as depicted in the diagram by the dotted white line, which represents our network's journey from its first, not-so-great prediction of 0.5 to the final, 60,000th prediction, which is as accurate as possible (meaning, close to 1).  To find that step-by-step path down the surface of the bowl to the bottom is the process of Gradient Descent.  











##3) Back Propagation and 4) Gradient Descent##
The key ingredient of Gradient Descent is Back Propagation:  When you start at Point A, it is Back Propagation that tells you the slope of the surface of the bowl under the white ball.  Finding the slope of the bowl's surface, where our ball lies, tells us the direction our ball should roll to make the quickest descent to the bottom of the bowl where error is 0 (i.e., where our ball would be touching the syn0, syn1 grid plane).  Each point on the dotted line as it descends to the bottom of the bowl is one iteration of the neural network.

So, what is an iteration?  **That's** a key question.  The first part of an iteration is our forward feed, which we saw above.  It is the forward feed that gets our white ball to its starting position at Point A, from whence it rolls down the curvy surface of our red bowl towards the bottom, which sits on the white grid, where "Accurate Prediction" lives (i.e., 0 in the cost function, the global minimum).  Each dot of that dotted white path represents a tweak, or update of the prediction weights assigned to syn0 and syn1.  Each iteration begins with a fresh forward feed that puts the ball in a new position in the bowl.

Here's the key thing to visualize: the path of our white ball as it rolls down the inside surface of our curvy red bowl is erratic.  You can see that the ball rolls over some bumps, rolls into some ravines, changes direction suddenly, etc.  To understand what's happening, let's start with our vertical axis first and imagine the yellow arrow (which is our vertical, Z coordinate and equals the error between our current prediction and The Actual Truth found in y) moves with the ball and always remains under the ball as it rolls erratically down the side of the bowl.  The yellow arrow has many changes in length, and also in the location of its base on the white grid below the bowl.  So, with each lurch the white ball makes along the surface of the red bowl, there is also a corresponding lurch on the white grid below.  As the white ball gets closer to the bottom of the bowl, the yellow arrow gets shorter and shorter until it reaches zero, where there is zero difference between our prediction and The Actual Truth found in y.  That means our prediction is accurate.  So when the yellow arrow equals zero, our horizontal coordinates of syn0 and syn1 on the X and Y axes must be right under the bottom of the bowl.  

To sum up: as the yellow arrow gets shorter and shorter, its horizontal coordinates on the white grid get closer-and-closer to arriving right under the red bowl's bottom (which in this diagram looks to be about (3,0)).  Make sense?

Since syn0 and syn1 are the X and Y axes of the white grid, therefore each adjustment of the values of those coordinates is what brings the white ball to the bowl's bottom, i.e., less error and greater accuracy.  The "learning" process of reducing error takes place in the synapses of the neural network pictured in my first diagram above--not the neurons.

















The [diagram at this link](https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing) may also be helpful in envisioning the geometry of neural networks.  It's essentially another version of the same, red bowl above, but from a slightly different angle.

The above two diagrams are a perfect example of why the best teacher is someone who learned yesterday the material you are learning today:  Over the past year of studying AI, I had seen the bowl analogy many times, but NO ONE mentioned the significance of the bowl sitting on the grid, or what the grid was, or the significance of the height from the grid to Point A.  You see, the experts already know this, and they (mistakenly) assume that you know it too.  This is the "expert blindness" I mentioned at teh beginning of this article.  So please learn from my mistake, and gain a super-key insight that eluded me for over a year, even though it was right under my nose.  

Hopefully, now that you can SEE what a neural network does in 3-D, it will make it easier to understand why we do all these abstract steps with math and with code.

Again: if you are still confused by the above diagram and analogy, no worries.  I'm going to fill in all the details below, but at least now you know exactly where we're headed!  Godspeed.

#4) How to Create a Brain with 21 Lines of Code
Now, let's get an overview of our code.  I suggest you open this blog post in two side-by-side windows and show the code in the left window while you scroll through my explanation of it in your right window.  First, I'll show you the entire code we'll be studying today, and underneath that is my detailed step-by-step explanation of what it does.  As you see from the comments below (the lines beginning with a # that are interspersed with the lines of actual code), I have broken this process of building a neural network down into 13 steps.  Get ready for the wonder of watching a computer learn from its mistakes and recognize patterns!  We're about to give birth to our own little baby brain... :-)


















##But first, a word on the concept of a matrix and linear algebra...
You will see the word "matrix" or matrices in the code below.  This is VERY important: the matrix is the engine of our car.  Without matrices, a neural network goes nowhere.  At all.  

A matrix is a set of rows of numbers.  For a quick-and-dirty metaphor, think of a glorified XL spreadsheet.  Or a database with many rows and many columns of numbers.  The first matrix you will now meet holds the data from our pet shop customer survey.  It looks like this:
```
[1,0,1],
[0,1,1],
[0,0,1],
[1,1,1]
```
Think of each row as one customer, so there are four customers in the matrix above.  Each row contains three numbers (1's and 0's for Yes and No) between brackets, right?  So, row one above means Customer One answered, "Yes/No/Yes" to the three survey questions.  And column one contains our four customers' responses to question one, "Do you own a cat?"  Now, to fill in a little more detail, here's the exact same matrix as above, but with some labels and some color coding to emphasize my points: 


![alt text](https://lh3.googleusercontent.com/5HoREn3MrqlYiM5nCLBuYKCaG8eP-20obfFBPhLP2oFYTwemUWDeQBJG3jIdIPI7RQY-RUmsEArSj92PqNIDv2zPRu_XRENY4Xp5l9OoD_5wEZ6GqZd-79EPlH-vLeegN72Ifac-uax1W-hyx_CQG2tuXlT1f0_L-_zFO-ClyR9BMB3EmxHwhKvrgmuRRqlVuHgSFRLJPPNMRWPuce8FQk57h38ZlqxeeuApkTuGspzdoxjjrQsjuszSAVKcswu-U4OzhBkAouCLTJBTNE2Rbtaf4jmeVBZ7Y0b5D_t87C0MSAf8WsGCJ_yWzU_ZmAZuFi9ldaaF3ghzONtEANrLp7egvBq5azBG6TG3D8tY22THnIFdWHKxf7Kk9ULYi269lxxkWlu8OBIoy4bJFPklJFrQVUtiDZD6vzx0dwjphwCWuCIS1TaGosqhstnji0ITS8JGQY4gRlAsXRkVCgvPMKjT9r8K9ciTLse1OyvlnK-yExIgxwtnjZeAw9fAMEmNeioanEXs7maMHmykFWzaP2oLZh0ZuBhkGJs87Xes3xI4sivbVmPG-ONQjIj6Q26GgZdfm0PG62lQJLip18Opl7E27uUK2lhKTCq7LQmvriQa1jib6vYRhCmbV1Wkm8eIDDa-r23pz9_R9hhiQ8lHRUzKERXi6Gdx=w905-h510-no)

Hopefully, the above diagram clarifies a key point that really confused me at first: it is the relationship between rows of customers and columns of features.  You have to see a matrix as both.  Let's break it down:

In our matrix, one customer's data is represented by a ROW of three numbers, right?  And in a neural network diagram with all those circular neurons connected by lines of synapses, the input layer is a column containing three circular "neurons," right?  Well, it's important to notice that each neuron does NOT represent a customer--a ROW of data.  Rather, each neuron represents a FEATURE--a COLUMN of data.  So, within a neuron, we have *all* the customers' answers to the same question/feature, e.g., "Do you own a cat?"  We're only charting four customers, so in my diagram above, you only see four 1's and 0's that are responses to that question.  But if we were charting 1,000,000 customers, that top neuron would contain one million 1's and 0's representing each customer's Yes/No response to feature one, "Do you own a cat?"

So I hope it's becoming clear why we need matrices:  Because we have more than one customer.  In our toy neural network below, we describe four customers, so we need four rows of numbers.

Our network also has more than one survey question.  So we need one column per survey question (aka, per feature), thus we have three columns here representing the responses to the first three questions of our survey (the fourth question appears in a different matrix that we'll see later).  

So our matrix is tiny: 4 rows X 3 columns, known as a "4 by 3."  But the matrices in real neural networks can have millions of customers and hundreds of survey questions (features).  Or the neural networks that do image recognition in photos or video can have billions of rows of "customers" and billions of columns of features.  

In sum, we need matrices to keep all our data straight while we do complex calculations on it, so a matrix organizes our data into nice, neat little rows and columns (usually, not so little at all).  Good enough for now?  Let's move on.   

I am grateful for Andrew Trask's [blog post](http://iamtrask.github.io/2015/07/12/basic-python-network/) from which the code below is taken (though the comments are mine). Display this in your left window:

##IMPORTANT: 

The comments interspersed below with the Python code are a good summary, but they are complex.  Rookies should not feel intimidated if they don't understand.  All will be explained below in detail, so fear not-- later these comments in the code can serve as a nice reference if you need it in the future.
Also, note that I have inserted line numbers in front of the code, for easy reference.  If you copy-and-paste this code, you'll have to delete the line numbers and adjust the spacing.

In [0]:
#This is the "3 Layer Network" near the bottom of: 
#http://iamtrask.github.io/2015/07/12/basic-python-network/

#First, housekeeping: import numpy, a powerful library of math tools.
5 import numpy as np

#1 Sigmoid Function: changes numbers to probabilities and finds slope to use in gradient descent
8 def nonlin(x,deriv=False):
9   if(deriv==True):
10    return x*(1-x)
11  
12  return 1/(1+np.exp(-x))

#2 The X Matrix: This is our feature set from 4 of our customers, in language the computer
#understands.  Row 1 is the first customer's set of Yes/No answers to the first 3 of
#our survey questions: 
#"1" means Yes to, "Have cat who poops?" That "0" means No to "Drink imported beer?"
#The 1 for "Visited the LitterRip!.com website?" means Yes.  There are 3 more rows
#(i.e., 3 more customers and their responses) below that.  
#Got it?  4 customers, and their Yes/No responses 
#to first 3 questions (the 4th question is used in the next step below).  
#These are the set of inputs that we will use to train our network.
23 X = np.array([[1,0,1],
24               [0,1,1],
25               [0,0,1],
26               [1,1,1]])
#3 y Vector: Our testing set of 4 target values. These are our 4 customers' Yes/No answers 
#to question four of the survey, "Actually purchased Litter Rip?"  When our neural network
#outputs a prediction, we test it against their answer to this question 4, which 
#is what really happened.  When our network's
#predictions compare well with these 4 target values, that means the network is 
#accurate and ready to predict from the new dataset, i.e., whether our hot prospects 
#from the (hot) veterinarian will buy Litter Rip!
34 y = np.array([[1],
35               [1],
36               [0],
37               [0]])
#4 SEED: This is housekeeping. One has to seed the random numbers we will generate
#in the synapses during the training process, to make debugging easier.
np.random.seed(1)

#5 SYNAPSES: aka "Weights." These 2 matrices are the "brain" which predicts, learns
#from trial-and-error, then improves in the next iteration.  syn0 and syn1 are the 
#X and Y axes on the white grid under the red bowl, so each time we tweak these 
#values, we march the grid coordinates of Point A towards the red bowl's bottom, 
#where error is zero.
syn0 = 2*np.random.random((3,4)) - 1 # 1st layer of weights, Synapse 0, connects l0 to l1.
syn1 = 2*np.random.random((4,1)) - 1 # 2nd layer of weights, Synapse 1 connects l1 to l2.

#6 FOR LOOP: this iterator takes our network through 60,000 predictions, 
#tests, and improvements.
for j in range(60000):
  
  #7 FEED FORWARD NETWORK: Think of l0, l1 and l2 as 3 matrix layers of "neurons" 
  #that combine with the "synapses" matrices in #5 to predict, compare and improve.
  #l0, or X, is the 3 features/questions of our survey, recorded for 4 customers.
  l0=X
  l1=nonlin(np.dot(l0,syn0))
  l2=nonlin(np.dot(l1,syn1))
  
  #8 TARGET values against which we test l2, our prediction, to see how much 
  #we missed it by. y is a 4x1 vector containing our 4 customer responses to question
  #4, "Did you buy Litter Rip?"  When we subtract the l2 vector (our first 4 predictions)
  #from y, The Actual Truth about who bought, we get l2_error: how much our 4 predictions 
  #missed the target by, on this particular iteration.
  l2_error = y - l2
  
  #9 PRINT ERROR--a parlor trick: in 60,000 iterations, j divided by 10,000 leaves 
  #a remainder of 0 only 6 times. We're going to check our data every 10,000 iterations
  #to see if the l2_error (the yellow arrow of height under the white ball, Point A)
  #is reducing, and whether we're missing our target y by less with each prediction.
  if (j% 10000)==0:
    print("Avg l2_error after 10,000 more iterations: "+str(np.mean(np.abs(l2_error))))

  #10 In what DIRECTION is y, our desired target value, from our network's latest guess? 
  #This is the beginning of back propagation.  Here we calculate confidence levels by
  #taking the slope of each l2 guess.  We then multiply it by how much that latest 
  #guess missed our target of y (aka, the l2_error).  
  
  #To ensure that tweaks in our network are efficient,
  # we compute l2_delta by multiplying each error by the slope of the sigmoid at that value.
  # The terms of l2_error that correspond to high-confidence predictions (close to 0 or 1)
  # are multiplied by a small number (which represents low slope and high confidence),
  # ensuring that the network focuses on improving
  # low-confidence predictions (close to 0.5, steep slope). In line 92 we then 
  # multiply the resulting l2_delta by l1 to update each weight in our syn1 
  #synapses so that our next prediction will be even better.
  l2_delta = l2_error*nonlin(l2,deriv=True)
  
  #11 BACK PROPAGATION: In Step 7, we "fed forward" our input.  Now we work backwards
  #to find the l1 error (back propagation). l1 error is the difference between the 
  #ideal l1 that would provide the ideal l2 we want, and the most recent computed l1.  
  #To find l1_error, we have to multiply l2_delta (i.e., what we want our l2 to be
  #in the next iteration) by our last iteration of the optimal weights (syn1). 
  
  # In other words, to update syn0, we need to account for the effects of 
  # syn1 (current values) on the network's prediction.  We do this by taking the 
  #product of the newly computed l2_delta and the current values of syn1 to 
  #give l1_error, which corresponds to the amount our update to syn0 should change l1.
  l1_error = l2_delta.dot(syn1.T)

  #12 In what DIRECTION is l1, the desired target value of our hard-working middle layer 1,
  #from l1's latest guess?  Pos. or neg.? Similar to #10 above, we want to tweak this 
  #middle layer so it sends a better prediction to l2, so l2 will better predict target y.
  #In other words, tweak the weights in order to produce large changes in low 
  #confidence values and small changes in high confidence values.
  
  #In other words, just like in #10, we multiply l1_error by the slope of the 
  #sigmoid at the value of l1 to ensure that the network applies larger changes 
  #to synapse weights that affect low-confidence (close to 0.5) predictions for l1.
  l1_delta = l1_error * nonlin(l1,deriv=True)
  
  #13 UPDATE SYNAPSES: aka Gradient Descent. This step is where the synapses, the true
  #"brain" of our network, learn from their mistakes, remember, and improve--learning!
  syn1 += l1.T.dot(l2_delta)
  syn0 += l0.T.dot(l1_delta)

#Print results!
print("Our y-l2 error value after all 60,000 iterations of training: ")
print(l2)

Avg l2_error after 10,000 more iterations: 0.4964100319027255
Avg l2_error after 10,000 more iterations: 0.008584525653247153
Avg l2_error after 10,000 more iterations: 0.0057894598625078085
Avg l2_error after 10,000 more iterations: 0.004629176776769984
Avg l2_error after 10,000 more iterations: 0.003958765280273646
Avg l2_error after 10,000 more iterations: 0.0035101225678616753
Our y-l2 error value after all 60,000 iterations of training: 
[[0.99701711]
 [0.99672209]
 [0.00260572]
 [0.00386759]]


#5) Explaining the Code in 13 Steps 
Let's go through each of the 13 steps of the code in detail:

##1) The Sigmoid Function, Briefly Mentioned: lines 6-11:
The sigmoid function plays a super-important role in making our network learn, but don't worry if you don't understand it all yet.  This is only our first pass over the material.  I'll explain it in detail below in Step 10.  For now, just do your best:

"nonlin()" is a type of sigmoid function called a logistic function.  Logistic functions are very commonly used in science, statistics, and probability.  This particular Sigmoid function is written in a more complicated way than necessary here because it serves two functions:

1) to take a matrix (represented here by a small x) within its parentheses and convert each value to a number between 0 and 1 (aka a statistical probability).  This is done by line 11: `return 1/(1+np.exp(-x))` 

Why do we need statistical probabilities?  Well, remember that our network doesn't predict in just 0's and 1's, right?  Our network's prediction doesn't shout, "YES!  Customer One WILL ABSOLUTELY buy Litter Rip! if she knows what's good for her!"  Rather, our network predicts the probability: "There's a 74% chance Customer One will buy Litter Rip!"

This is an important distinction because if you predict in 0's and 1's, there's no way to improve.  You're either right or wrong.  Period.  But with a probability, there's room for improvement.  You can tweak the system to increase or decrease that probability a few decimal points each time, so you can improve your accuracy.  It's a controlled, incremental process, rather than just blind guessing in the dark.

We will see below that this is very important, because this conversion to a number between zero and one gives us **FOUR** very **big advantages**.  I will discuss these four in detail below, but for now, just know that the sigmoid function converts every number in every matrix within its parentheses into a number between 0 and 1 that falls somewhere on the S-curve illustrated here:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

So, Part 1 of the Sigmoid function has converted each value in the matrix into a statistical probability.  

Now for Part 2.  The second part of this sigmoid function is in lines 8 and 9:
'  if(deriv==True):
    return x*(1-x)'
When called to do so by `deriv=True` in the code below, line 9 takes each value in a given matrix and converts it into a slope at a particular point on the Sigmoid S curve.  This slope number is also known as a confidence measure.  In other words, the number answers the question, "how confident are we that this number correctly predicts an outcome?"  You may wonder, So what?  Well, our goal is a neural network that confidently makes accurate predictions.  The fastest way to achieve that goal is to fix the non-confident, wishy-washy, low-accuracy predictions, while leaving the accurate,  confident predictions alone.  Remember this concept of wishy-washy, non-confident numbers.  It will be important below.

Let's move on to Step 2:


##2) Creating X input: Lines 12-17
Lines 12-17, step 2, create a 4x3 matrix of input values that we will use to train our network.  X will become layer 0, or l0 of our network, so this is the beginning of the "toy brain" we are creating!  

This is our feature set from our customer surveys, in language the computer understands:  
```
Line 14 creates the X input (which becomes l0, layer 0, in line 38)
X: 
[1,0,1],
[0,1,1],
[0,0,1],
[1,1,1]
```
We have four customers who have answered our three questions.  For example, Row 1 above, which is 101, is the first customer's set of Yes/No answer to our survey questions.  Think of each row of this matrix as a training example we'll feed into our network, and each column is one node of our input.  So our Matrix X can be visualized as the 4x3 matrix that is l0 in the diagram below:

![alt text](https://lh3.googleusercontent.com/CkHAz8xdeL8F5_ZqEysrOsFK-mMa2VLn_dqNcFX56tQOSR62eiNbw2OVGVPKehgkrYA8tAy9bXbfC09qglgVAVqZPghLRvJ_xaEzm1nRbltZGCvkCqZ7O59cihqL0723imx9LLdtz26LQNtKaJC13uPuWORKAx3rAfK3ZOz2MIlHLV_MO0b730fUqZLO9HwOtU8sIkFvwyLQp2tMy7G5JTGi-3gAqhspbKI684vLav73AsmHVNwcd4bnBB_0F844ehpNHjKp_ei1zl1g1WFHvM8yPbvU4WsNgBGpiVdrBJmhlFTuAcCjlMBqB8jOHP_9JMXZRaOcBNjo5QSfQ56DILw2C4h5W6g8VHpEv1k4HRWrQ4jqKPXWIeiTQB1ZBrSZwNBJfqK9yxFU1XDgP_N1s6uUGCF3v2Ae5qGiab15bRKUkORmXKN6IAS3kxfVgt3fXypSxr76Gv5Gdg27D3oJ4jmkCiiP8f7rsP6TvRH6PaTGqqcsgxzxx3KeQ9Wd3UdMsDhnt1g_2KMnPK0NvBmx3PFcrNc3dY6L_5fBzrc_-p-5R4Ch1anPROZUKX92Zrp1VFAE-kg2Mb9JdCzzOKopmpL6-SMPxGUQKAUJg2UQJDaI0GfGRM6OGpCZJ19Sl0gVY2unHZmpWroCm21TV_IvF6zzaJxc5vxJ=w905-h510-no)

You may wonder, "How does Matrix X become layer 0 in the diagram above?"  We'll get to that soon.  Next, let's create our list of the four correct answers we want our network to be able to predict.

##3) Create y output: Lines 18-24
This is our test set.  It is the Yes/No answers to the fourth question of our survey, namely, "Have you purchased Litter Rip!?"  So, look at the column of 4 numbers below, and you'll see that Customer One answered Yes, Customer Two answered Yes, and so on. 
```
Line 21 creates the y vector, a set of target values we strive to predict.
y: 
[1]
[1]
[0]
[0]
```
Into our network we will input matrix X, which is four customers' responses to the first three questions of our survey.  Our network will output a prediction, l2 and compare it to y (i.e., Their answer to Question Four, The Actual Truth about whether they bought Litter Rip!), to see how she did.  She'll learn from her mistakes and do better next time.  60,000 times!  If the network is properly trained, the predicted l2 will approach closer-and-closer to the truth, y, with each iteration.  

To use another metaphor, I also like to think of y as our "target" values, and I picture an archery target.  As our network improves, its arrows hit closer-and-closer to the bullseye.  Once our network can correctly predict these 4 target values from the inputs provided by matrix X above, it is then ready to use our new database of hot prospects to predict in real life.  


##4) Seed your random numbers: Lines 25-27
This step is housekeeping. We have to seed the random numbers we will generate in synapses/weights for the next step in our training process, to make debugging easier.  You don't have to understand how this codes works, you just have to include it.

The reason we generate random numbers to populate the synapses is because one has to start somewhere.  So we begin with a set of made-up numbers, and then we tweak each of these numbers incrementally, over 60,000 iterations, until they produce predictions with the smallest possible error.

##5) Create "Synapses" of your brain--Weights: Lines 29-31
When you first look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain.  In fact, the real brain of a neural network, the part that actually learns and improves, is the synapses, those lines that connect the circles in this diagram.  These 2 matrices, syn0 and syn1, are the brain of our NN.  They are the part of our network that learns by trial-and-error, making predictions, comparing to their target values in y, then improving their next prediction--learning!

Notice how this code, `syn0 = 2*np.random.random((3,4)) - 1` creates a 3x4 matrix and seeds it with random values.  This will be the first layer of synapses, or weights, Synapse 0, that connects l0 to l1.  It looks like the matrix below (This is part of a simplified set of matrices I will consistently use throughout this blog.):
```
Line 30: syn0 = 2*np.random.random((3,4)) - 1: creates synapse 0
syn0: 
[ 3.66 -2.88   3.26 -1.53]
[-4.84  3.54   2.52 -2.55]      
[ 0.16 -0.66  -2.82  1.87]
```
Now, please learn from a mistake I made: I could not understand why syn0 should be a 3x4 matrix.  I thought it should be a 4x3 matrix, because you have to multiply the l0 4x3 matrix by syn0, so don't we want all the numbers to line up nicely in the rows and columns?  

But that was my mistake: to think that multiplying a 4x3 by a 4x3 lined up the numbers nicely.  Wrong.  In fact, if we want our numbers to line up nicely, we want to multiply a 4x3 by a 3x4.  This is a fundamental and very important rule of matrix multiplication.  Take a close look at the first neuron of now-familiar diagram below, "Do you own a cat who poops?"  Now, consider:

Inside that neuron are the Yes/No responses from each of four customers.  This is the first column of our 4x3 layer0 matrix:
```
[1]
[0]
[0]
[1]
```
Got it?  Now, notice there are four lines (synapses) that connect the "Cat who poops?" neuron with the four neurons of l1.  That means that EACH of the 1,0,0,1 above has to be multiplied four times by the four different weights that connect "Cat who poops?" to l1.  So, four numbers inside "Cat who poops?" times four weights = 16 values, right?  Yes, l1 is a matrix that is 4x4, so that makes sense.  

And notice that we're going to do the exact same thing with the four numbers inside the second neuron, "Drink imported beer?"  So that's also four numbers times four weights = 16 values.  And we add each of the 16 values to its corresponding value in the 4x4 we already created above.

Rinse and repeat a final time with the four numbers inside the third neuron, "Visited Litter Rip!.com?"  So our final 4x4 l1 matrix has 16 values, and each of those values is the SUM of the three corresponding values from the three sets of multiplication we just completed.  

Get it?  3 survey questions times 4 customers = 3 neurons times 4 synapses = 3 features times times 4 weights = a 3x4 matrix.

Seems complicated?  You'll get used to it.  Besides, the computer does the multiplication for you.  But I want you to understand what's going on beneath the hood.  When you look at a neural network diagram such as below, the lines don't lie (neither do Shakira's hips).  Consider:

If there are four synapses connecting the "Cat who poops?" neuron to all four neurons of the next layer, that means you MUST multiply whatever is inside "Cat who poops?" by four weights.  In this kitty litter example, we know there are four numbers inside "Cat who poops?"  Therefore,  you know you'll end up with a 4x4 matrix, and you know that to arrive there you have to multiply by a 3x4 matrix: i.e., the 3 nodes times the 4 synapses connecting each node to the next layer of 4 neurons.  Look this over for a while and study where each synapse begins and ends until you're clear on the pattern:

![alt text](https://lh3.googleusercontent.com/CkHAz8xdeL8F5_ZqEysrOsFK-mMa2VLn_dqNcFX56tQOSR62eiNbw2OVGVPKehgkrYA8tAy9bXbfC09qglgVAVqZPghLRvJ_xaEzm1nRbltZGCvkCqZ7O59cihqL0723imx9LLdtz26LQNtKaJC13uPuWORKAx3rAfK3ZOz2MIlHLV_MO0b730fUqZLO9HwOtU8sIkFvwyLQp2tMy7G5JTGi-3gAqhspbKI684vLav73AsmHVNwcd4bnBB_0F844ehpNHjKp_ei1zl1g1WFHvM8yPbvU4WsNgBGpiVdrBJmhlFTuAcCjlMBqB8jOHP_9JMXZRaOcBNjo5QSfQ56DILw2C4h5W6g8VHpEv1k4HRWrQ4jqKPXWIeiTQB1ZBrSZwNBJfqK9yxFU1XDgP_N1s6uUGCF3v2Ae5qGiab15bRKUkORmXKN6IAS3kxfVgt3fXypSxr76Gv5Gdg27D3oJ4jmkCiiP8f7rsP6TvRH6PaTGqqcsgxzxx3KeQ9Wd3UdMsDhnt1g_2KMnPK0NvBmx3PFcrNc3dY6L_5fBzrc_-p-5R4Ch1anPROZUKX92Zrp1VFAE-kg2Mb9JdCzzOKopmpL6-SMPxGUQKAUJg2UQJDaI0GfGRM6OGpCZJ19Sl0gVY2unHZmpWroCm21TV_IvF6zzaJxc5vxJ=w905-h510-no)

So, always remember that matrix multiplication requires the inner 2 "matrix size" numbers to match, e.g., a 4x3 matrix must be multiplied by a 3x_?_ matrix--in this case, a 3x4.  See how those inner two numbers (in this case, 3) must be the same?

And maybe you're wondering where the "`2*`" at the beginning of our equation, and the "-1" near the end come from?  I wondered.  Well, the function np.random.random produces random numbers uniformly distributed between 0 and 1 (with a corresponding mean of 0.5).  But we want this initialization to have a mean zero.  Why?  So that the initial weight numbers in this matrix do not have an a-priori bias towards values of 1 or 0, because this would imply a confidence that we do not yet have (i.e. in the beginning, the network has no idea what is going on so it should display no confidence until we update it after each iteration).  

So, how do we convert a set of numbers with an average of 0.5 to a set with a mean of 0?  We first double all the random numbers (resulting in a distribution between 0 and 2 with mean 1), and then we subtract one (resulting in a distribution between -1 and 1 with mean 0).  That's why you see `2*` at the beginning of our equation, and - 1 at the end.  This changes the mean from 0.5 to 0.  Nice: `2*np.random.random((3,4)) - 1`

Moving on: Next, this line of code, `syn1 = 2*np.random.random((4,1)) - 1` creates a 4x1 vector and seeds it with random values.  This will be our network's second layer of weights, Synapse 1, connecting l1 to l2.  Meet syn1:

```
Line 31: syn1 = 2*np.random.random((4,1)) - 1: creating synapse 1
syn1:  
[ 12.21]
[ 10.24]
[ -6.31]
[-14.52]

```
It would be a good exercise for you to figure out what size the matrices have to be for this multiplication.  We know l1 is 4x4.  Why is syn1 a 4x1?  Look at the diagram: whatever numbers that are in the top neuron of l1 have to be multiplied only once, right?  Because there's only one line (weight) connecting the top neuron of l1 to the single neuron of l2.  And we know that the top neuron of l1 has four values in it, right?  Therefore, 4 values x 1 weight = four products.  Add them together and that gives you the first product inside l2.  

Rinse and repeat that process three more times for the other 3 neurons in l1, and presto: l2 is a 4x1 matrix (known as a vector when there's only one column).  

Again: always remember that those two inner "size numbers" have to match.  A 4x3 matrix MUST be multiplied by a 3x_?_ matrix.  A 4x4 matrix MUST be multiplied by a 4x_?_ matrix, and so on.

You're doin' great.  I gotta remind you again that you're a rock star.


##6) For Loop: Lines 33-34
This is a for loop that will take our network through 60,000 iterations.  For each iteration, our network will take X, our input data of the customer survey responses, and based on that data, give its best prediction of the probability that that customer purchased Litter Rip!  It will then compare its prediction to The Actual Truth, found in y, learn from its mistakes, and give a slightly better prediction on the next iteration.  60,000 times, until it has learned by trial-and-error how to take the X input and predict accurately what the y target value is.  Then our network will be ready to take *any* input data you give it (such as the surveys from our loving veterinarian) and correctly predict which hot prospects should get a targeted ad!


#Now Steps 7-13 Will Cover The 5 Major Concepts of Deep Learning:
These 5 major concepts are interrelated and will appear multiple times below.  Here is a rough order of appearance:
1.   Forward Feed computes error as the distance (the height) of the white ball from the white grid on which the red bowl's bottom (the global minimum) lies;
2.   A network learns by trial-and-error;
3.   The Sigmoid, a simple activation function, calculates probability and confidence levels;
4.   Back Propagation tells you which parts of your network to tweak to reduce your error on the next iteration; and
5.   Gradient Descent shows how the synapses, rather than the neurons, are the core of your network's "brain."  

##7) Forward Feed computes error as the distance (the height) of the white ball from the white grid on which the red bowl's bottom (the global minimum) lies
This is where our network makes its prediction.  This is an exciting part of our deep learning process, so I'm going to teach this same concept from three different perspectives: 

1.   First, I will tell you a spellbinding fairytale of feed forward;
2.   Second, I will draw stunningly beautiful pictures of feed forward; and
3.   I will open up the hood and show you the matrix multiplication that is the engine of feed forward.

I'm Irish.  Who doesn't love a good story?  My mentor Adam Koenig suggested the following analogy, which I have ridiculously exaggerated into a fairytale, because **I am an *artiste*:**

##1) *The Princess and The Castle*, Chapter 1 of 4: The Feed Forward Network

Imagine yourself as a neural network.  You happen to be a neural network with a valid driver's license, and you're the type of neural network that enjoys fast cars and hot romance.  You eagerly wish to meet The Love Of Your Life.  Well, Miracle of Miracles, you have just found out that if you drive to a certain castle, your Prince/Princess Charming is waiting to meet you for the first time, sweep you off your feet, and live happily ever after.  Joy!

(Hint: The princess' castle is the utopia when your l2 prediction matches y, the answer to the survey question, "Did you buy Litter Rip!?" In other words, an l2 error of zero, or the yellow arrow with a height of zero, or the white ball arriving at bottom of red bowl.  They are all the same thing.)


















Needless to say, you're fairly motivated to find the princess's castle.  After all, the princess represents The Actual Truth, the answer to the mystical question, "Did you buy Litter Rip!?"  How romantic!  Unfortunately, finding her castle is going to require some patience and persistence, because you have already attempted to drive to the castle thousands of times and you keep getting lost.  

(Hint: thousands of trips = iterations.  "keep getting lost" means you have some error in your l2 prediction, which leaves you a distance from the Castle of Truth, y.)

But there is fabulous news:  you know that every day, with every driving trip, you're getting closer-and-closer to the Princess (Ahh, the y vector in all her loveliness--stunning...).  The bad news is, alas, each time that you don't arrive at her castle, POOF! you wake up the next morning back at your house (aka Layer 0, our input features, the 3 survey questions) and have to start again from there (a new iteration).  It looks a bit like this:


![alt text](https://lh3.googleusercontent.com/nRe205F0sgwTUasEbQDKFAeURmbTEhJ6HmQXd_bp7SJ9HqgAhZsbU7zjJmzuCOtQXKH4IbtQhDFW9sM6TxRVUrQZckBj32HLRT7GHllUUd-tWux553T-iHx2C7CE1ecTFazwjytbdgODR9KPAi381H4ow0cUxdoVbFgfGy97nAsbbeEBoicTP0CxVHq4tRbBaNPQ3OFNIEqoTKulvce94DRqJH9D1RUnhMSvBsjY6Fc7W34rMHU9tifMNeNetP4sjkWGmIy2A1v5FfT6F4ax-rg1iuqpOQmXjyrVdccmofIAzfw3W9hqg99U9Dae7MI7_1JLsNT31JU28Pzs7Ji1cVDi2Nx9dI77KisaC_1Y1heI6zxQYVDnBa1NXe1H6v6ku1Ode1eI07KsfFTMwHEkClqU5gyEAMdhsSX3cBKrSsIoIqlUdXNnS57A3spD0vIE3sRhrhtp9Zfg1iQhPIg39o7bHD-sPO4iP_rlJvL2Y109xoITCnPmEhe_spZrtCVnel0Yqo3Yy2GOTrEQgl9UPUxPYvyyrzGL1pKvRnMWfw7QQFYwJ0-AsC6-UsSuxt9XIfRlt6pZNx4OGkCYuLBPZ9yBM-41uC_lLuxPY0bT7CffxMVXhmMJh2U5V8CDyXI34Vt8TBiMGEwbZGRMsu05AOrjrz3vFjHl=w905-h510-no)



































Fortunately, this story has a happy ending, but you will have to keep attempting to arrive at her house and correcting your route for, say, another 58,000 trials (and errors!) before you fall into her arms.  Don't worry, it'll be worth it.

Let's take a look at one of those drives (an iteration) from your house, X to her castle, y:  each trip you make is the feed forward pass of lines 36-40, and each day you arrive at a new place, only to discover it is NOT the castle.  Drat!  Of course, you want to figure out how to arrive a little closer to your beloved on the next drive.  I'll explain that in Steps 8 and 10 below.  Stay tuned for *The Princess and the Castle, Chapter 2: Learning from Your Errors.* 

Above is an analogy for the feed forward process.  Below is a view under-the-hood of the math that makes it happen.  

##2) Stunningly Beautiful Pictures of Feed Forward
We're going to walk through one example of one weight only, out of the 16.  The weight we will study is the tippy-top line of the 12 lines of syn0, between the top neurons of l0 and l1, and we will call it syn0,1 (technically it is syn0(1,1) for "row 1 column 1 of the matrix syn0).  Here's what it looks like:


![alt text](https://lh3.googleusercontent.com/iaWNJpUAqo9RShRnLkRs8g3xsSEsTInXdAagnJvj9OPshuQfJGfJuO6uCXDOEq6953XOuy7h88ktMGTisPQ28kGiQ9oqGgrVH2Jo-TsMAuhrXmScOF9cDh-SLRw1OiEzIymUECoGE3HPd7q3MWDaQEjhdHnJIenyhFubPnvhr7J5ggv9mw2poRyXvDPBiv1r5vcL2v0cDmH5r7h2bK8EVphbjp49RBkloZcjfsqlbwmnMdP_XGKi14ROp13Dt0ZQ5ewfKeR0Y8Oy5loOJvPxKu4on3Y7ib5zRn-rP0mBYP_gwbAJYs5rsNZpx_6sokql3M4HBKNW-f4Yog7QGfilz_2SdcMZfZboEfsgHy5TyJEh5Q0lmSU0BKAMP6AwWMtvFGdmd5JGV3JpYDDfR5NXW9wIxzkAy0_y30RnAD7bzGqy6fn8cCs0-0f_6tozBSTSg4IZCCv8FrtCDvL0J8flcyn0-xWYYA9oQbIQOwBYKjwlHHbPSJvbHc3YiNb9jmoclUmJhFspnYBi4Eh1K8A9dfKi239DcnEjUtDph26Srzw-3eVx13lwAeNRx_KgOS0zaCD2NkK52UOzfyLX8qSZywysxxqSZPk9xx1X7G6mx5_2uSpw8LytaoKpD4t183d54x8Q0XmtkUU7GjUT3PkN4aE6BiuFI5Cn=w905-h510-no)

Why are the circles representing the neurons of l2 and l1 divided down the middle?  The left-hand (depicted in the variables I use with an "LH") is the value used as input for the sigmoid function and right hand side is the output of the sigmoid function: l1 or l2.  In this context, recall that the sigmoid is simply taking the product of the previous layer times the previous synapse and "squishing" it down to a value between 0 and 1.  It's this part of the code: 
```
return 1/(1+np.exp(-x))
```
OK, here is Feed Forward using one of our training examples, row 1 of l0, aka "Customer One's responses to the 3 survey questions": [1,0,1].  Here's what it looks like:



![alt text](https://lh3.googleusercontent.com/_oeTrJZYC4G7qDQL6c-ZTWHHhm1zbW3Q562EUN0LrKkQ-08_EejUpe8-UG0pUYSKnxorxIZRCPyC3eWCklg1wnG39UgdhsD7J5mDHYNTNzMbS66LHP4x4NhTbbecHu2Q8EErPV19IPkwMxfKUPo5TuR0Q27TiUyFWc8RYnxsj8fV22yXrbfvhwXsCJKWKhwxz4OHXnOTnJj4sO-rNyba0HNvaepk_1PuYeUkcRJ3pV7WJ_C_WdJuwn9sjYAV1Z0M1CkkifIzUXrSDAMZGfbyzX9tN6-RXolTRtBWfw-BmwLwPmBkFI7hegGvlVesFk6XXyKN5ki8PX8VAHgEBFEVwjgm_R0pfuE-_gG7raC3_5UT9Irk8NLXLvwyFdKSqu1N4FBGT1HtH8oGdjNZZYaiU9fVx0RCuJjHF9sIp-3nroUtEy7xSeoUpkoiP1VSkmhk5d004jzMUn3K2fC5gasybHdI8aYJ44gAHO_AKOYKwfnrNTjaDdp7UIB9tKXTMmpn_JYh6W66vTGD23-UaHt5YuOUANKDyvpgPRa1ByY8YnNZoa46RcU1wKQDNFecvROE6EKPkQPz2E-8PjQfgrm6NM9OLaFt0R0R1uZuzuQLse7hVh4ZgI435eb9pZofnfU93QK6-Bemav5l-fTDVGS0Xi0na0Y8Sk09=w905-h510-no)



































Here's what Feed Forward looks like in pseudo-code, and you can follow the forward, left-to-right process in the diagram above. (Note that I add "LH," meaning left-hand, to specify the left half of the circle we are using in our diagrams to depict a neuron in a given layer; so, "l1LH" means, "the left half of the circle representing l1," which means, "before the product has been passed through the nonlin() function.")

```
 l1_LH = l0 x syn0 so l1_LH = 1 x 3.66 -> (Next, don't forget to add the products of the other l0 values x the other syn0 values. For simplicity right now, just trust me that they add up to be 0.16 total) -> l1_LH = 1 x 3.66 + 0.16 = 3.82 

l1 = nonlin(l1_LH) = nonlin(3.82) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98

l2_LH = nonlin(l1_LH) = l1  ->  l1 x syn1 = 0.98 x 12.21 = 11.97 (again, add the products of the other syn1 multiplications--trust me that they total -11.97) -> 11.97 + -11.97 = 0.00

nonlin(l2_LH) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^0.00))] = 0.5

l2 = 0.5  ->  y-l2 = l2_error -> 1 - 0.5 = 0.5 -> l2_error = 0.5
```
So, what just happened above?  I artificially assigned Syn0,1 with a beginning value of 3.66.  3.66 is just a random value we assigned, it could be any number, but hey--ya gotta start somewhere, right?  

#3) Let's walk through the math of Feed Forward slowly: 
l0 x syn0 = l1LH, so in our example 1 x 3.66 = 3.66, but don't forget we have to add the other two products of l0 x the corresponding weights of syn0.  In our example, l0,2 x syn0,2= 0 x something = 0, so it doesn't matter.  But l0,3 x syn0,3 *does* matter because l0,3=1, so let's just make up a simple, convenient value for syn0,3 of 0.16.  Therefore, l0,3 x syn0,3 = 1 x 0.16 = 0.16.  Our product of l0,1 x syn0,1 + our product of l0,3 x syn0,3 = 3.66 + 0.16 = 3.82, and 3.82 is l1_LH.  Next, we have to run l1_LH through our nonlin() function to create a probability between 0 and 1.  Nonlin(l1_LH) uses the code, `return 1/(1+np.exp(-x))`, so in our example that would be: 1/(1+(2.718^-3.82))=0.98, so l1 (the RH side of the l1 node) is 0.98.

So, what just happened with the equation: `1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98` above?  The computer used some fancy code, `return 1/(1+np.exp(-x))`, to do what we could do manually with our eyeballs--it told us the corresponding y value of x = 3.82 on the sigmoid curve as pictured in this diagram:

![alt text](https://iamtrask.github.io/img/sigmoid.png)

(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))




































Notice that, at 3.82 on the X axis, the corresponding point on the blue, curved line is about 0.98 on the y axis.  Our code converted 3.82 into a statistical probability between 0 and 1.  It's helpful to visualize this on the diagram so you know there's no abstract hocus-pocus going on here.  The computer did what we did: it used math to "eyeball what 3.82 on the X axis would be on the Y axis of our diagram."  Nothing more.

Again: nonlin() is the part of the Sigmoid function that renders any number as a value between 0-1.  It is the code, `return 1/(1+np.exp(-x))`.  It does not take slope.  But in back prop, we're going to use the *other* part of the Sigmoid function, the part that does take slope, i.e., `return x*(1-x)` because you will notice that lines 57 and 71 specifically request the Sigmoid to take slope with the code, `(deriv==True)`.

Now, rinse-and-repeat: we'll multiply our l1 value by our syn1,1 value.  l1 x syn1 = l2LH which in our example would be 0.98 x 12.21 (12.21 is a random number we just assigned because hey--ya gotta start somewhere) = 11.97.  But again, don't forget that to 11.97 we have to add all the products of all the other l1 neurons times their corresponding syn1 weights, so for simplicity's sake trust me that they all added up to -11.97.  So you end up with 11.97 + -11.97 = 0.00, which is l2_LH.  Next we run l2_LH through our fabulous nonlin() function, which would be: 1/(1+2.718^-(0)) = 0.5, which is l2, which is our very first prediction of what the truth, y, might be!  Congratulations!  You just completed your first forward feed!

Now, let's assemble all our variables in one place, for clarity:
```
l0=1
syn0,1=3.66
l1_LH=3.82
l1=0.98
syn1,1=12.21
l2_LH=0
l2=~0.5
y=1 (this is a "Yes" answer to survey Question 4, "Actually bought Litter Rip?")
l2_error = y-l2 = 1-0.5 = 0.5
```














OK, let's now take a look at the matrix multiplication that makes this all happen (for those of you who are rookies to matrix multiplication and linear algebra, Grant Sanderson teaches it brilliantly, with lovely graphics, in [14 YouTube videos](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab).  Watch those first, then return here).

First, on line 39 we multiply the 4x3 l0 and the 3x4 Syn0 to create (hidden layer) l1, a 4x4 matrix.  

![alt text](https://lh3.googleusercontent.com/R5dFziBwELSqSS0llQWua63piEnO-LOkK42Wbev4muMpVtE-Qr_GlwXIcjXX76DHGFEvQf8j1dHgrGBdRYfRJ6sliOpPFJz2-2OoW1ibOM11m0R87qjfZIK6GyhV6jNvw5B-6u8gHUQJ8bFwH9HHhchguvU6rcUCjPl8epbUrwGOFhtVyCqxKy7Vy0ZnMdFSvCbf2Src0edJ154lAtEaxOBffTSQgiDi1AGEcavLnn4ne_nY-ZO2OJpmyTuyP2QNq8K3pP4MCluHAHzusFMvjF2JBH5cVY1mW5QHCq5AcHsvV5KcKax3nz01nBmFdv1VcIFKiZ_yRAWn-6PEFAajTAtdsS7ViddDswA1R16iE94dTH5DI7ZFwBeFHfgR0Ws2HcVuMXYpST1u_MC9vP-YnqMxk8cUzWuJ8iWTCBOMynjLsoe1n896zk4nXA00hrUA8Gtknn82FhM-7Fvj23lzD-Bp3Sizt6x7VyceIUOkYxNDungJA5xltQ6X2s9u48Bv-WOFIQqo8XM8T6xfz2VAW80sduSgzI5IJlli2eE1eSe1fLsdScxgnoFFLiyI064yfkvToed80nu2u5y-oY18kcDgv5c9WKS-5i5egYi9dYtFcJhj841meinaI6QGI1NKKkQixgAb7VMOxUdtCMCE6fs1GU3qOJdD=w905-h293-no)

```
Now we pass it through the "nonlin()" function in line 39, which is a fancy math expression I explained above that "squishes" all values down to values between 0 and 1: "1/(1 + 2.781281^-x)" 

This is layer 1, the hidden layer of our neural network:
l1:  
[0.98 0.03 0.61 0.58]
[0.01 0.95 0.43 0.34]
[0.54 0.34 0.06 0.87]
[0.27 0.50 0.95 0.10]
```
**DC QUESTION: I'm looking for a place to explain how the (intentionally useless) Survey Question #2 about "Do you drink imported beer?" will not have much impact on predictions.  Does this part of the text best illustrate that?  Or should I put it in "Confidence Levels" below?  Or is there some other fabulous place? Thanks, Dave**
If you find yourself feeling faint at the mere sight of matrix multiplication, fear not.  We're going to start simple, and break down our multiplication into tiny pieces, so you can get a feel for how this works.  Let's take one, single training example from our input.  Row 1 (customer one's survey answers):  `[1,0,1]`  a 1x3 matrix.  We're going to multiply that by syn0, which would still be a 3x4 matrix, and our new l1 would be a 1x4 matrix.  Here's how that simplified process can be visualized:
```
(multiply row 1 of l0 by column 1 of syn0, then row 1 by column 2, etc.)

row 1 of l0:     col 1 of syn0:
[1 0 1]    X     [  3.66] +       [ 3.82 -3.54  0.44  0.34]
[1 0 1]    X     [ -4.84] +   =   [ (row 2 of l0 x cols. 1, 2, 3, and 4 of syn0...)  ]
[1 0 1]    X     [  0.16]         [                                  etc.            ]
                                                                                      
Then pass the above 4x4 product through "nonlin()" and you get the l1 values
l1:  
[0.98 0.03 0.61 0.58]
[0.01 0.95 0.43 0.34]
[0.54 0.34 0.06 0.87]
[0.27 0.50 0.95 0.10]
```
Note that, on line 39, we next take the Sigmoid function of l1 because we need l1 to have values between 0 and 1, hence: `l1=nonlin(np.dot(l0,syn0))`

















It is on line 39 that we see ***Big Advantage #1*** of the ***Four Big Advantages of the Sigmoid Function.***  When we pass the dot product matrix of l0 and syn0 through the `nonlin()` function, the sigmoid converts each value in the matrix into a statistical probability between 0 and 1.  In the case of your superior feline hygiene product, this means, "the closer the value is to 1, the more certainty the customer will buy Litter Rip!, whereas the closer the value is to 0, the more certainty that Litter Rip will remain untouched by the (ungrateful! unfeeling!) customer.  As I mentioned above 0.2 means "Probably won't buy," 0.8 is "Probably will buy," and 0.999 is, "You're damn right they'll buy!"

"OK," you might be thinking, "So I can eyeball the values in l1 above and deduce whether they are predicting a probable 1 or a probable 0.  So, what?"  Well, it doesn't matter in lines 39 and 40, but it matters a *ton* when we hit line 61 and beyond.  Stay tuned.

#Visualizing Matrix Multiplication in Terms of the Neurons and Synapses of our Network
Here's where these values appear in our picture of neurons and synapses:




![alt text](https://lh3.googleusercontent.com/RDcMpsyX10uMk4_1J_2aRxQWrdb8E-VOS5KivFAaiPSdW2inl4jmF5bE5Dvc5ZdzAgpQF7hSDXlxOvHkXeO4RkdUBdkDFXgA3Fne9ToPW3m5JhozUQBNEJq6DJCHKFP6kT0klhSMUdUR9TA4ibgaEbwWRwQfLFj85irhTE-CGaoPi5xLGhgtrT2lI00tyfFEOkLpiCvstVTkQ50sWPmV0rYRe6G0LCowxD8oJmoviaR8PAjO0oXZ2Cj5_u-iOSvfzBtymu33AxoyL4t0xPQ-HnrZTpl9CmzWOM4wBpnLI6iFVOqJYsFhSxM8Xmfr1h5eynNhtcUpTk52vUMJ4LWZ0xniZefQUJPN2uGcOCLrPc36_TqlwwgSAdrgCQ-74y5HGrJISeZ4FVkC36TcqAK0bam7j2-I9TPC8MeparFuuRJtx_ijRnv4g1glClA6VXBzxpJPWvkdEG0OeVbq1O84wsipap0rpg9xQnpj6RMGdSeyHoljPmm-Sc0Bm-ezSiUR8dEXZkDINBwxDf2K3-6U-joxemv9H0S4XhraFd9Y1gA_J8UGQr84f3s5NeuXjN0tj2Jo7IpiHaip7BB9VT4w7vvZvLU4kWK9Dxr9Ti_YS-sZewuAwPJJAc8vN7pPPWkiy64JDPYvmNmNEF9u9o6LXtTtSpm7cMPT=w905-h510-no)









So, the above diagram gives you a picture of how row one of our input l0 feeds through its first step in a network.  And you remember that row one = Customer One and her three answers to the three survey questions.  But when it comes time to multiply those three numbers by the entire 12 values of syn0, and then do the same for the other three customers and their three values, how do you juggle all those numbers and keep them straight? 

The key is to think of the four customers as stacked on top of each other in a "batch."  So, our top stack is our top row is Customer One, right?  As you saw above, you multiply the three numbers of row one through the entire 12 numbers of syn0, sum them, and you end up with four values on the top stack of l1.  Perfect.  

Now, what's your second stack in the batch?  Well, 0,1,1 is the second row of answers is the Second Customer is the second stack.  Multiply those three numbers through all 12 of the syn0 values, sum them, and you end up with four values that become the second stack (or layer) in the batch of l1 values.  

And so on.  Two more times.  They key is to just take things one stack at time.  Then you'll be fine, whether it's four stacks in your batch, or four million.  You might say that each feature has its own batch of values to it; in our kitty litter case, each survey question (or feature) has a batch of four answers to it from four customers.  But it could be four million.  This concept of "full batch configuration" is a **very** common model, so that's why I took the time to explain this.  I find it easiest to think of a given feature having its own batch of values.  When you see a feature, you know there's a stack (batch) of values under it.  


![alt text](https://lh3.googleusercontent.com/kULygi1IE0ca_2sSJXrJnHp4WAGf7xMRqA8J0QDv4QJ1KaPVWMqTYSqxYA-X429ljcGS-P2H9_55GkvoqYb9N5XhEEYepcCTjvr33Gtrha6R9x6YuVX02FZCCQToMOGa1k7Q-IE5oVf_eZaHLgJ3GKNuQoSLd1JeQA9fN3Nk1W77xq7U37odR8idxpHe-ZcFTfTkK6RBe5nDfut53dVJIPN24hIjnNAezrn-vud9SNKeOSB5U8WSfiGhlfA3Cl4R5zhpYOt-V4r0RXp1BjHtAvhYMXSRvkS36wk0zudPEJpflpPde8E0VoNhOzHAYqbTcL443__LaOV3yOxKzVEArp82Es-9sw1MQP9F30cEEyS6SoqpFLiJUXtO0-UjOosPfvqVBLobBu6QSOaGdpOlB-uWwlstOutts_VPK4HIvwupkNPjbejlCeg6eJtcsekE3xBJfUeiYuY5L00l2XkMfVvGLxTDV7Ncd8-66slWMRMNnS9U-HeVoXH7r-98fLtl66HKUfv6rhoIdEnyyt_4D5oI0Tr3iBIT4ml31DIL7usuuhhGabmo1jKmTJNEAHcseRo-moXIvvJA40wcI8c9D7x9SIX1ivQkkLi0LUZNm1_My_mjMMDBn4dn_b2BwjPaRIAVgURx2DKdDo34JweLjkXeRxB9aiJV=w905-h510-no)










































Exactly the same thing happens on line 40, as we take the dot product of 4x4 l1 and 4x1 syn1 and then run that product through the Sigmoid function to produce a 4x1 l2 with each value becoming a statistical probability from 0-1.


```
l1 (4x4):  
[0.98 0.03 0.61 0.58]               [ 12.21]
[0.01 0.95 0.43 0.34]       X       [ 10.24]     =
[0.54 0.34 0.06 0.87]               [ -6.31]
[0.27 0.50 0.95 0.10]               [-14.52]

Then pass the above 4x1 product through "nonlin()" and you get l2, our prediction:
l2: 
 [ 0.50]
 [ 0.90]
 [ 0.05]
 [ 0.70]
   ```
We have now completed the Feed Forward portion of our network.  I hope you can visualize what we have done so far, namely:
1.   the matrices involved; 
2.   rows as customers;
3.   columns as features; and 
4.   each feature containing batches of values;  

If you can visualize these four items, then you have done outstanding work.  Bravo to you.

Our next goal is to find Step 8: by how much did we miss our target value of 1 or 0 in The Actual Truth y, the princess' castle?  Well, with Customer One/Row One, it turns out we missed by 0.5.  But any distance between us and our beloved princess is too much, so how can we reduce that l2_error of 0.5 to put us finally in her arms?  The back propagation step below will soon teach us the exact amount we want to increase/decrease syn0,1 in order to decrease l2_error and firmly embrace our beloved.










##8) By How Much Did We Miss the Target? Lines 42-45
```
l2_error = y - l2
```
You may recall that the 4x1 y vector contains the answers of four customers to question four, "Have you purchased Litter Rip!?"  It contains four "0" or "1" values to which we want to compare the predictions our network came up with, based on the answers to the first three questions/features.  Each one of our 60,000 iterations should bring us, by trial-and-error and learning from our mistakes, closer to the 4 target values of y.  

In the Step 7 feed forward, you might say that we made our first try, or trial--as in "trial-and-error."  This is our first attempt at a prediction of what y, the truth, might be.  Step 8 is the exciting first step of figuring out our "error"--the first step in the comparing and learning process of our network.  Once we know what we missed by, in the steps below we will seek to correct that error and do better next trial.

Our network took the input of the four responses from the four customers to the three questions and output a probability of whether the customer purchased Litter Rip! or not.  For each iteration, we take our best prediction so far, the 4x1 vector l2, and subtract it from the 4x1 vector y.  The remainder is l2_error, i.e., how much each value of l2 missed its target value in y.   
We then compared each probability to the actual customer response to Question Four, a 1 or a 0.  By "compared" I mean we subtracted the probability from the y value and the remainder is our l2_error, or "how much we missed the target value y by."

So, what? You may ask.  Well, consider: some of those misses were quite small.  Our l2 prediction was pretty close to correct, so when we subtract that l2 value from its corresponding y value, the remainder in l2_error is a small number; a Small Miss.  
But there were also Big Misses.  Pay close attention to those big misses, because they will matter a *lot* in the steps below.  

```
l2_error = y - l2      

y:        l2:           l2_error:    
[1]      [0.50]      [ 0.50] a Big Miss
[1]  _   [0.90]   =  [ 0.10] a Small Miss 
[0]      [0.05]      [-0.05] a Tiny miss
[0]      [0.70]      [-0.70] a Very Big Miss
```
Why do we care about the big misses?  Because if we focus on correcting the bigger misses, it improves our network's accuracy faster and cheaper than messing around with the small misses.  If it ain't broke, don't fix it.  Throughout our network, we want to focus on the Big Misses and the Low Confidence/Wishy-Washy/Big Slope Ratio numbers (which I'll explain below).  For now, just remember this key point: the Big Misses matter bigtime.


##9) Print Error: Lines 47-51
Line 50 is a clever parlor trick to have the computer print out our l2_error every 10,000 iterations.  The line, `if (j% 10000)==0:` means, "If your iterator is at a number of iterations that, when divided by 10,000, leaves no remainder, then..."  ` j%10000 ` would have a remainder of 0 only six times: at 0 iterations, 10,000, 20,000, and so on to 60,000.  So this print-out gives us a nice report on the progress of our network's learning.

The code `+ str(np.mean(np.abs(l2_error))))` simplifies our print out by taking the absolute value of each of the 4 values, then averaging all 4 into one mean number and printing that.  Here's an example:
```
Avg l2_error after 10,000 iterations: 0.00021829659275871905     (Not bad, huh?  :-)
```

#How Back Propagation Tells You What Will Reduce Your Error (Steps 10-12)
##10) In What DIRECTION is y, who lives in the bottom of the red bowl?  The Precursor to Back Propagation: Lines 53-57  
```
 l2_delta = l2_error*nonlin(l2,deriv=True)
```
Now we have entered the brain of the beast; here is the secret sauce of Deep Learning.

Let's bust two myths, shall we?  
Myth #1: Back Propagation is Super-Hard

False.  Back propagation requires patience and persistence.  If you read a text on back prop once and throw up your hands because you understand nothing, you're done.  But if you watch Grant Sanderson's [video 3 on back prop](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) 5 times at reduced speed, then watch [video 4 on the math of back prop](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&t=320s&index=5) five times, you'll be well on your way.  Grit.

Myth #2: To Understand Back Prop, You Need Calculus

There are many online posts which state with great authority that you need multivariable calculus in order to understand AI.  Not true.  Deep Learning author and guru [Andrew Trask](https://www.cs.ox.ac.uk/people/andrew.trask/) says that, if you took three semesters of college-level calc, only a tiny subset of that material would be useful for learning back propagation: the Chain Rule.  But, even if you took those 3 semesters of college calculus, the chain rule is often presented very differently in college from the way you would use it in back propagation.  You'd be much better off reading Kalid Azad on [the Chain Rule](https://betterexplained.com/articles/derivatives-product-power-chain/) five times.

So, bottom line?  Don't make the same mistake I did: I panicked every time I saw the word, "derivative," and it was self-defeating.  You must fight that inner voice saying, "I don't have the background to master this."  There are workarounds--simple ways to do calculus without calling it "calculus."  But there is no workaround for grit. 

Here is my favorite saying: "There is the task, and there is the drama ***about*** the task."  
Leave your drama here now.  Please give me your grit, and your trust.  Let's learn back prop.

#The Big Picture of Back Propagation: In What Direction?  And, by how much?
What is the purpose of back prop?  To find the best set of adjustments with which to tweak our network so that it gives a better prediction in the next iteration.  In other words, certain values in certain matrices in our network need to be adjusted to give a better prediction.  To adjust each of those numbers, we must answer the two key questions:

1) In what direction do I adjust the number?  Do I increase the value, or decrease it?  Positive direction, or negative? and

2) By how much do I increase or decrease the number?  A little, or a lot?

We will examine these two basic questions in great detail below.  But first:
#Exactly what ***is*** this network we're going to tweak?

When you first look at the diagram below, you might assume the brain of the network is the circles, the "neurons" of the neural network brain.  In fact, the real brain of a neural network, the part that actually learns and remembers, is the synapses, those lines that connect the circles in this diagram.  We have control over 16 variables in our network: 12 variables in the 3x4 matrix syn0, and 4 variables in the 4x1 vector syn1.  Look at this diagram and understand that every line (aka, "edge" or synapse) you see represents one variable, containing one number, aka one weight.  

![alt text](https://lh3.googleusercontent.com/CkHAz8xdeL8F5_ZqEysrOsFK-mMa2VLn_dqNcFX56tQOSR62eiNbw2OVGVPKehgkrYA8tAy9bXbfC09qglgVAVqZPghLRvJ_xaEzm1nRbltZGCvkCqZ7O59cihqL0723imx9LLdtz26LQNtKaJC13uPuWORKAx3rAfK3ZOz2MIlHLV_MO0b730fUqZLO9HwOtU8sIkFvwyLQp2tMy7G5JTGi-3gAqhspbKI684vLav73AsmHVNwcd4bnBB_0F844ehpNHjKp_ei1zl1g1WFHvM8yPbvU4WsNgBGpiVdrBJmhlFTuAcCjlMBqB8jOHP_9JMXZRaOcBNjo5QSfQ56DILw2C4h5W6g8VHpEv1k4HRWrQ4jqKPXWIeiTQB1ZBrSZwNBJfqK9yxFU1XDgP_N1s6uUGCF3v2Ae5qGiab15bRKUkORmXKN6IAS3kxfVgt3fXypSxr76Gv5Gdg27D3oJ4jmkCiiP8f7rsP6TvRH6PaTGqqcsgxzxx3KeQ9Wd3UdMsDhnt1g_2KMnPK0NvBmx3PFcrNc3dY6L_5fBzrc_-p-5R4Ch1anPROZUKX92Zrp1VFAE-kg2Mb9JdCzzOKopmpL6-SMPxGUQKAUJg2UQJDaI0GfGRM6OGpCZJ19Sl0gVY2unHZmpWroCm21TV_IvF6zzaJxc5vxJ=w905-h510-no)

These 16 weights are all we can control.  

l0, our input, is fixed and unchanging.  l1 is determined *exclusively* by the weights in syn0 by which you multiply the fixed values of l0.  And l2 is determined *exclusively* by the weights in syn1 by which you multiplied l1.  Those 16 lines pictured above, the synapses, the weights, are the only numbers you can tweak to achieve your goal, which is an l2_error that gets smaller and smaller until l2 almost equals y.  l2_error is what we call your "cost," and back propagation is the tool that we use to figure out how to tweak the network to reduce the cost as much as possible, as quickly as possible.











































#Step 1: Confidence Levels Help to Answer, "By How Much Do I Adjust the Numbers of My Network?"
You might call Step 10, "How much do I tweak my network before its next iteration and prediction?"  For statistics and calculus buffs, we could simply say, "In line 57 and following, we compute how much to increment the weights in the opposite direction of the gradient of the error with respect to the weights in the synapses."  Whew!  For the rest of us mere mortals, let's unpack that a bit, by returning to our (spellbinding) fairytale:

##*The Princess and the Castle, Chapter 2: Learning from Your Errors.*

You may recall, back in Step 7, you made a feed forward pass and drove to l2, your best guess as to where the castle y is located, but you arrived at l2 only to discover you were *closer* to the castle, but not yet arrived.  And you know that soon, you will (POOF!) disappear and wake up the next morning back at your house, l0, and start over.

How can you improve your driving directions to get closer to the love of your life tomorrow?

First, when you arrive at today's destination, you eagerly ask a local knight how far today's arrival place is from the Princess's castle.  This chivalrous knight tells you the distance you are from Castle y (this is the l2_error, or "how much you missed the princess by").  Every day, at the end of each trip, before you disappear for the day, you want to compute **by how much** you want to change the weights that created today's failed l2 prediction such that tomorrow your l2 prediction will be perfect and you can fall into your beloved's arms.  This is the l2_delta (clearly marked in the diagram below).  It is the amount you want to change today's l2 so that tomorrow, that new-and-improved l2 will hopefully lead to the castle drawbridge!

Note that the l2_delta is NOT the same as l2_error because l2_error only tells you how many miles you are from your princess.  l2_delta also factors in how confident you were in the turn-by-turn directions by which you missed the castle today.  These confidence numbers are the derivatives (forget calculus, you don't need it here, so let's just use the word "slope," as in Good Ol' rise-over-run), or slope of each value of l2.  Think of these slopes as the confidence levels you had in each of the turns in the set of directions we're using for today's trip.  Some of those turns you were super-confident of.  With other turns, you weren't certain if they were right or not.  

But wait: perhaps this concept of using confidence levels to compute where you want to arrive tomorrow seems a bit abstract and confusing?  Actually, you use confidence levels to navigate all the time--you just aren't conscious of it.  

Think about a time when you got lost.  At first, you started out assuming you were on the right route, or you wouldn't have taken that route in the first place.  You started out confident.  But your trip seems to be taking longer than you expected, and you wonder, "Gee, did I miss a turn?"  Less confident.  Then, as time passes and you should have arrived by now, you become more certain you missed that turn.  Low confidence.  And you know you are not at your destination, but you are not sure where your destination is from your current location.  So now, you stop and ask a nice lady for directions, but she tells you more turns and landmarks than you can remember, so you can only follow her directions part-way before you are again unsure how to proceed.  So you ask directions again, but this time you are closer, so the directions are simpler, and you follow them to the letter and arrive joyfully at your destination.

It's very important to notice a couple of things: 

First, you just learned by trial-and-error, and you had varying confidence levels.  A bit later below, I will explain in detail how those confidence levels allow our network to learn by trial-and-error, and then I will explain how our beloved Sigmoid function gives us those all-important confidence levels.  

Second, notice that your trip had two segments--the first segment was your route up to where you asked the nice lady for directions (l1), and segment two was your route from the nice lady to l2, the place you thought was your destination.  But then you realized you arrived somewhere else, and had to ask how far you were from your true destination.  

You see how confidence plays a role in your navigation?  At first, you were sure you were on the correct route. Then, you wondered if you missed a turn. Then you were certain you missed a turn, and stopped to ask directions before proceeding further.  Those 2 segments of your daily trip look like the dog legs pictured here, and each day with your improvements, the dog leg gets a bit straighter.  

This is like the process you go through as our romantic, driving, neural network: 

![alt text](https://lh3.googleusercontent.com/nRe205F0sgwTUasEbQDKFAeURmbTEhJ6HmQXd_bp7SJ9HqgAhZsbU7zjJmzuCOtQXKH4IbtQhDFW9sM6TxRVUrQZckBj32HLRT7GHllUUd-tWux553T-iHx2C7CE1ecTFazwjytbdgODR9KPAi381H4ow0cUxdoVbFgfGy97nAsbbeEBoicTP0CxVHq4tRbBaNPQ3OFNIEqoTKulvce94DRqJH9D1RUnhMSvBsjY6Fc7W34rMHU9tifMNeNetP4sjkWGmIy2A1v5FfT6F4ax-rg1iuqpOQmXjyrVdccmofIAzfw3W9hqg99U9Dae7MI7_1JLsNT31JU28Pzs7Ji1cVDi2Nx9dI77KisaC_1Y1heI6zxQYVDnBa1NXe1H6v6ku1Ode1eI07KsfFTMwHEkClqU5gyEAMdhsSX3cBKrSsIoIqlUdXNnS57A3spD0vIE3sRhrhtp9Zfg1iQhPIg39o7bHD-sPO4iP_rlJvL2Y109xoITCnPmEhe_spZrtCVnel0Yqo3Yy2GOTrEQgl9UPUxPYvyyrzGL1pKvRnMWfw7QQFYwJ0-AsC6-UsSuxt9XIfRlt6pZNx4OGkCYuLBPZ9yBM-41uC_lLuxPY0bT7CffxMVXhmMJh2U5V8CDyXI34Vt8TBiMGEwbZGRMsu05AOrjrz3vFjHl=w905-h510-no)

Every day, on every trip, you (our 3-layer network) start out with a set of directions to the princess (syn0).  When those directions end, you stop at l1 and ask for further directions (syn1).  These take you to your final destination of the day, your prediction of where you *thought* the princess was.  But no castle stands before you.  So you ask the knight, "How far to the castle? (l2_error)" And because you are a genius, you can multiply the l2_error by how confident you were in each turn of your directions (the derivative, or slope, of l2) and come up with where you want to arrive tomorrow (your l2_delta).  

(True Confession: the place where my "driving directions" metaphor is incorrect is when you stop at l1 to ask for further directions.  I imply that the nice lady gave you fresh directions.  In fact, your directions (the values of syn1) were already chosen for you during the update of the last iteration.  So, it would be more accurate to say the nice lady knew you'd arrive at her house, and today she only tells you the directions you told her to remind you of before your disappeared Poof! last night.)
















































So you must compute 3 facts before you can learn from them and re-attempt your quest for the princess's castle.  You must know:

1.   Your current location (l2);
2.   How far you are from the princess's castle (l2_error); and
3.   What changes you need to make in your set of turns to increase your certainty that your next driving attempt will get you closer to the castle (l2_delta).

Once you possess these three facts, then you can compute the required changes in the navigation turns (i.e., the weights of the synapses).   This is line 75, the change to the weights of syn1, which is the product of l1 and l2_delta.  The changes in syn1 will help to bring about the changes you seek in your next l2 (i.e., to end up closer to that darn castle).

Now, of course the smartypants readers will notice that I have only told you how to improve tomorrow's directions for Part 2 of our journey, from l1 to l2.  Ahh, there's always a stickler in the crowd, isn't there?  Well, we're going to learn how to improve the directions (syn0) of Part 1 of our journey (l0 to l1) in Step 11 of our process.  Right now, I want to do a deep dive into using confidence levels to compute the l2_delta.  Wake up now, because below is some fascinating and important stuff:





















#Step 2: How the Slope-Taking Feature of the Sigmoid Function Gives You Confidence Levels
Here is where you will see the beauty of the Sigmoid function in four *magical* steps.  To me, the genius of our neural network's ability to learn is found largely in these four steps.  We saw how Part 1 was when the `nonlin()` transformed each value of l2_LH into a statistical probability (i.e., a prediction) between 0 and 1 (aka l2).  This is ***Big Advantage #1*** of the ***Four Big Advantages of the Sigmoid Function:***  

But I have yet to explain how Part 2 of the sigmoid, namely "`nonlin(l2,deriv=True)`" can transform those 4 values in l2 into confidence measures.  This is ***Big Advantage #2.***  If our network's prediction, the four values of l2, is both high-accuracy and high-confidence, that's an outstanding prediction, and we want to leave the syn0 and syn1 weights that produced that outstanding prediction alone.  We don't want to mess with what's working; we want to fix what's NOT working.

OK, fine.  An output of 0.999 is the equivalent of the network saying "I am extremely confident the customer will buy Litter Rip!."  A number of 0.001 is the equivalent of "There is no way in Hell the customer will buy Litter Rip!."  But what about all those numbers in the middle?  Low confidence numbers are in the vicinity of 0.5.  For example, a value of 0.4 would be similar to "The customer might buy Litter Rip!, but I'm not sure."

That's why we focus our attention on the numbers in the middle:  all numbers approaching 0.5 in the middle are wishy-washy, and lacking confidence.  So, how can we tweak our network to produce four l2 values that are both high-confidence and high-accuracy?  

The key lies in the values, or ***weights*** of syn0 and syn1.  As I mentioned above, syn0 and syn1 are the center, the absolute *brains* of our neural network.  We are going to take the four values of the l2_error and perform beautiful, elegant math on them to produce an l2_delta.  l2_delta means, basically, "the change we want to see in the output of the network (l2) so that it better resembles y (the truth about who bought Litter Rip!)."  In other words, l2_delta is the change you want to see in l2 in the next feed-forward pass in the next iteration.

***Get ready for beauty.***

Here is ***Big Advantage #3*** of the ***Four Big Advantages of the Sigmoid Function:*** Do you remember that diagram of the beautiful S-curve of the Sigmoid function that I showed you above?  Well, lo-and-behold, each of the 4 probability values of l2 lies somewhere on the S curve of the sigmoid graph (pictured again below, but this time with more detail).  For example, let's say value 1 of l2 is 0.9.  If we search for 0.9 on the Y axis of the graph below, we can see that it corresponds with a point on the S curve roughly where you see the green dot: ![alt text](https://iamtrask.github.io/img/sigmoid-deriv-2.png)
(taken with gratitude from: 
[Andrew Trask](https://iamtrask.github.io/2015/07/12/basic-python-network/))

Did you notice not only the green dot but also the green line through the dot?  That green line is meant to represent the slope of the *tangent* to the line at the exact point where that dot is.  You don't need to know calculus to take the slope of a curve at a particular point--the computer will do that for you.  But you do have to notice that the S curve above has very shallow slope at both the upper extreme (near 1) and the lower extreme (near 0).  Does that sound familiar?  Wonder of wonders, a shallow slope on the sigmoid curve coincides with high confidence and high accuracy in our predictions!
**DC QUESTION: Am I correct?  Or better to simply say, "...coincides with high confidence" only?**

And you also need to know that a shallow slope on the S-curve comes out to a tiny number for slope.  That's good news.  Why?

Because, when we go to update our synapses, we basically want to leave our high confidence, high accuracy weights alone since they are already good.  To "leave them alone" means to multiply them by tiny numbers, near zero, so the values remain virtually unchanged.  And here comes ***Big Advantage #4*** of the ***Four Big Advantages of the Sigmoid Function:*** Miracle-of-miracles, our high-confidence numbers correspond to shallow slope on the S-curve, which corresponds to tiny slope numbers.  Therefore, multiplying the values of syn0 and syn1 by these teeny-tiny numbers has exactly the effect we want: the corresponding values in our synapses are left virtually unchanged, so our confident, accurate, high-performing values in l2 remain so.

By the same token, our wishy-washy, indecisive, low-accuracy l2 values, which correspond to points in the middle of the S-curve, are the numbers that have the biggest slope on our S-curve.  What I mean is, the values around 0.5 can be traced on the Y axis of our graph below to the middle of the S-curve, where the slope is steepest, and therefore the value of that slope is a big number.  Those big numbers mean a big change when we multiply them by the wishy-washy values in l2, as we do in line 61.

In detail now, how do we compute the l2_delta?  

You may recall we already found l2_error, which measures how much our first prediction, l2, missed the target values of y, our truth, our future, and our princess.  We are particularly interested in the Big Misses.  























































In line 57, the first thing we do is use **the second part** of our beloved Sigmoid function, "`nonlin(l2,deriv=True)`" to find the slope of each of the 4 values in our l2 prediction.  This slope tells us which predictions were confident, and which were (wait for it...) Wishy-Washy.  This is how we find and fix the weakest links in our network, the low-confidence predictions.  We then multiply those 4 slopes, (or confidence measures) by the four misses in`l2_error`and the product of this multiplication will be `l2_delta`.  Oh, Lordy!  This  is an important step--did you notice that we are multiplying the Big Misses by the Wishy-Washy Predictions (i.e., the l2 predictions that had big slopes)?  That is a super-duper key point, as I'll explain below.  But first, let's make sure you can visualize what I just said:
```
Below is the matrix multiplication of this line of code, in order of operations: l2_delta = l2_error*nonlin(l2,deriv=True)
    
Take l2 predictions, find their slopes, multiply them by the l2_error, and the product is l2_delta

y:        l2:         l2_error:    
[1]      [0.50]      [ 0.50]
[1]  _   [0.90]   =  [ 0.10]
[0]      [0.05]      [-0.05]
[0]      [0.70]      [-0.70]

l2 slopes after nonlin():    l2_error:                l2_delta: 
[0.25] Not Confident        [ 0.50] Big Miss         [ 0.125] Big change
[0.09] Fairly Confident  X  [ 0.10] Small Miss    =  [ 0.009] Small-ish Change
[0.05] Confident            [-0.05] Tiny miss        [-0.003] Tiny change
[0.21] Not Confident        [-0.70] Very Big Miss    [-0.150] Huge Change
```
Notice that, the Big Misses are (relatively speaking), the biggest numbers in l2_error.  And the Wishy-Washy's have the steepest slope, so they are the biggest numbers in `nonlin(l2,deriv=True)`.  So, when we multiply the Big Misses X The Wishy-Washy's, we are multiplying the biggest numbers by the biggest numbers, which will give us--guess what?--the biggest numbers in our vector, l2_delta.  

Why is that fabulous news?  Think of l2_delta as "the change we want to see in l2 in the next iteration."  The **big** l2_delta values are the **big** changes we want to have in the l2 prediction of the next iteration, and we'll make those happen by making **big** tweaks in the corresponding values of syn1 and syn0 below.  Those *big tweak* values will be added to the existing values of syn1 (that's what the "+=" operator does).  The result means that the updated set of weights will contribute to a better l2 prediction in the next iteration!  Happy Happy!  Joy Joy!





























#Step 3: Gradient Descent Answers, "In What Direction Do I Adjust my Weights?"
When we update our synapse matrix by multiplying its corresponding element (aka, its value or number) with that large slope number, it's going to give that element a big nudge in the right direction towards a confident and accurate prediction.  When I say, "in the right direction," what I mean is that some values of our l2_delta are going to be negative values, because we want the product of these negative values, when multiplied by the weight values of our synapse matrix, to approach 0 (which would produce a confident prediction that the customer did not buy Litter Rip!  Other values of our l2_delta are going to be positive, because we want them to increase the weight values and thereby nudge the elements in syn0 and syn1 to approach 1 (which would imply that the savvy customer bought Litter Rip!)

So it's important to notice that there is a sense of "direction" involved here.  When we talk about "what direction is the target y value from our current l2 value?" we mean, do we need to multiply each weight in syn1 by a positive l2_delta value to move it in a positive, larger direction (towards a Yes probability predicted), or by a negative l2_delta value to move it in a negative direction (towards a No probability predicted)?

![alt text](https://mail.google.com/mail/u/0?ui=2&ik=e3f869f938&attid=0.2&permmsgid=msg-a:r4352876950048414936&th=1691255aa52a4d54&view=fimg&sz=s0-l75-ft&attbid=ANGjdJ8FdFORGv3w0jn-Bs8GhlKpg2D1XPRzSF6OaNCqE8hchNYMIAymIg-nK1xCdIsQup54rJmkW2l0qttCzg03Hq8PJOv4KX0ae14e2dkswvLMt74Rzdhwt2ZJQBQ&disp=emb&realattid=ii_jsexnu8o2)
(taken with gratitude from [Grant Sanderson](https://www.youtube.com/watch?v=IHZwWFHWa-w&t=2s&index=3&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) Ch. 2)






























Above is a nice, simple picture of the "rolling ball" of gradient descent (which, you may recall, is the cost function of our network, aka the total l2_error (the height of the yellow arrow from the white grid to Point A).  

Line 61 computes each value of l2 as a slope value.  Of the 3 "bowls" pictured in the diagram above, it is clear that the true global minimum is the deepest bowl on the far left.  But, for simplicity's sake, let's pretend that the bowl in the middle is the global minimum.  So, a steep slope downwards to the right (i.e., a negative slope value, as depicted by the green line in the picture) means our ball will roll a lot in the negative direction (i.e., to the right), causing a big, negative adjustment to the corresponding weights of syn1 that will be used in the next iteration to predict l2.  In other words, some of the four l2 values will approach zero and a probability prediction that the customer in question did NOT buy Litter Rip!.

However, if for example you have a shallow slope downwards to the left, that would mean the prediction value is already accurate and confident, which produces a tiny, positive slope value, so the ball will roll very little to the left (i.e., in a positive direction), thus adjusting by very little the corresponding weight in syn1, so the next iteration's prediction of that value will remain largely unchanged.  This makes sense because the back-and-forth motion of the rolling ball is becoming smaller and smaller before it soon comes to rest at the global minimum, the bottom of the bowl, so there's no need to move much.  It is already close to the ideal.

The above 2-dimensional diagram is a tad oversimplified, so here is a link to a more accurate picture of what gradient descent looks like.  This is another good image, similar to the curvy red bowl I showed you at the beginning, sitting on the "tabletop" plane of the syn0, syn1 grid, with an arrow showing Feed Forward and a tiny arrow showing slope/gradient descent:  https://docs.google.com/document/d/1_YXtTs4vPh33avOFi5xxTaHEzRWPx2Hmr4ehAxEdVhg/edit?usp=sharing

This is a perfect example of why the best teacher is someone who learned yesterday the material you are learning today.  I only discovered the insight that the red, gradient descent bowl was sitting on the darn grid of syn0 and syn1 after a year of studying gradient descent, because all the experts take this point so for-granted that they don't bother mentioning it.  Here is a BIG chance for you to learn from my mistakes, and gain a super-key insight that eluded me for over a year, even though it was right under my nose.  

Take your time with the above points and make sure you understand them.  Do you see why the sigmoid function is a thing of beauty?  It takes any random number in one of our matrices and:


1.   turns it into a statistical probability, 
2.   transforms that value into a confidence level as well, 
3.   which creates a big-or-small tweak of our synapses, and
4.   that tweak, in a positive or negative direction, is (almost) always in the direction of greater confidence and accuracy. 

The sigmoid function is the miracle by which mere numbers in this matrix can "learn." A single number, along with its many colleagues in a matrix, can represent probability and confidence, which allows a matrix to "learn" by trial-and-error.  That is a thing of beauty, but there is more elegance to come!  As you learn other networks, you will see there are many functions that "learn" in even more beautiful ways than the sigmoid we have studied here.


##11) Back Propagation: Lines 59-64
```
  l1_delta = l1_error * nonlin(l1,deriv=True)
```
First, let's do a quick review:
In Step 8, we computed l2_error, which is the difference between the network's prediction of probability (aka layer2) that the customer would buy Litter Rip! and the actual, Yes/No, 1/0 response the customer made to Question Four about a purchase of Litter Rip! in y.  Next, in Step 10 we computed l2_delta by multiplying l2_error times the confidence levels we had in each value of l2.  And soon in Step 13 we will update syn1 by adding to syn1 the product of l1 X the l2_delta.  This multiplication is needed so that larger changes are applied to weights that have more impact on improving l2.  However, we do not apply this update until after we are ready to update syn0, for the sake of consistency.    

This is an important distinction.  We know our goal is to get to The Ideal l2 Prediction, which is as close as possible to our actual truth, y.  But we'd only be using half our horsepower to get there if we only update syn1.  We want to update both syn1 and syn0 in order to maximize our efficiency in creating The Ideal I2.  Finding the l1_delta is the key to updating syn0, so that's our next job now.

In Steps 11 and 12, we are going to compute the l1_delta, but in a slightly different manner than we computed l2_delta.  We are going to use back propagation.    

Ahh, you.  You clever network, you.  Continuing your (unmistakeable) genius, now you can work backwards to learn from your mistakes.  Your l2_delta tells you the confidence you had in each of the turns (syn1) that brought you to l2.  So, you go back through your turns and eyeball the low-confidence turns to figure out the first place you went wrong.  In math terms, you will multiply tomorrow's ideal arrival and the confidence you had in today's predictions (l2_delta) by today's not-so-perfect directions (syn1) to figure out the first place you went wrong on today's route (l1_error, which is the distance between today's screwy l1 stop and tomorrow's fabulous l1 stop).  

You can then multiply the l1_error by how confident you were in your *first* set of turns/predictions that got you to l1, and the product will be the l1_delta, with which you can eagerly update syn0 to bring you ever-closer to your betrothed.  I am brushing back the tears, just thinking about it...

Consider this: back when we were finding l2_error, life was easy.  We knew l2, and we knew "The Ideal l2" we were shooting for, which was simply y.  But with l1_error, things are different.  We don't know what "The Ideal l1" is.  There's no y to tell us that.  y only tells us The Ideal l2.  So, how are we going to figure out what The Ideal l1 is?  We must first figure out how much l1 would need to change to produce the desired effect on l2.  We're going to have to take what we *do* know and work backwards.  That's "backwards" as in, "back propagation."  Strap in, Folks.  Now we're goin' bigtime:














#The Big Picture on Back Prop
So, here's the challenge: every time you tweak one of your 16 lines/weights/variables in syn0 and syn1, it has a ripple effect throughout the whole network.  How can you calculate the best value for each of the 16 weights while taking into consideration its ripple effect on all the other 15 weights, all at the same time?  That sounds crazy-complex, right?

Can do.  Let me show you the World's Greatest Parlor Trick.  Those of you who know calculus will understand when I say we are going to use the Chain Rule to take derivatives.  But those of us who don't know calculus are not intimidated in the least, right?  That's right, because we will use slope, Good Ol' rise-over-run, to juggle 16 bowling pins in the air at the same time.  The secret?  In this context, to find the slope is to take the derivative.  They are exactly the same thing in this context.  (Psst: Don't tell all the calculus teachers.  They'll be out of a job...)

Here is the key overall question: "When we adjust syn0,1 by nudging it up or down, how much does that increase or decrease the l2_error?"  In other words, think of a derivative as a "sensitivity," or a relationship, or a ratio: we know that if we wiggle syn0,1, up or down, then the l2_error will wiggle up or down in proportion to that nudge.  But will it move a little?  A lot?  What is the ratio of l2_error's wiggle to syn0,1's wiggle?  How sensitive is l2_error with respect to syn0,1?

Remember: the goal of back prop is to figure out the amount to tweak each of our 16 weights so that we can reduce our l2_error as much as possible in the next iteration.  But the challenge is that each of these 16 weights affects many of the other 15 weights, so you might say that how much you adjust Weight 1 *depends on* how much you adjust Weight 2, which *depends on* how much you adjust Weight 3, and so on.  Keep this *"depends on"* image in your mind because it will come in handy later in the math.  As an analogy, imagine these 16 weights as 16 of the Minions from the *"Despicable Me"* movie, and these 16 minions are aligned in perfect formation and cooperating perfectly as one body in order to serve their master, Felonius Gru, whose mission is to arrive at perfect predictions with zero error.

#The Butterfly Effect
OK.  Our key question is, "How much do I adjust syn0,1 so that syn0,1 working in synch with its other 15 minions, minimizes l2_error so I can arrive at Castle y and fall into the arms of my princess?"  

The answer to that question is kind of like The Butterfly Effect, where the butterfly flapping its wings in New Mexico sets off a chain of events that results in a hurricane in China.  (Did you notice?  "chain of events" sounds a tad like "chain rule" in calculus, no?  Oh Dave, so very clever...).  Let's look at a picture of how this analogy might apply to the series of ratios we must calculate in back propagation:



![alt text](https://lh3.googleusercontent.com/KdcvLEaAVBlzJrdks_FXzMHX9u-wvxkPtp6TGKTiHZruX0l827g1i6hZsc20GDZxIMYGTTgU2ynpzg0Spq8cTHSf9F8Bt_Up_kyBgxr5TCN2DzQaOH-X5wCDPNKkFYx0-AoUUokaGO_EMU9vHlbNasiFBN2XuXGetGrKbnKWQhjaS0j21VCvo7G16tRVRpKE9EagNcmz7hiCGY0bDqjFjCjYw8XQgt5aP_ajTozE9Lm7kzVDnEVm8bHV9-jyNFYKBIBYykI83YvF8OrIx7tpJX3HABpk89laGwT-TjBcmBHQxryPbxa3PS5bV8tsSs5U9zJwmLqtFJmY2wOibNRd8QIs-gRyLU2e_Ndi6-MilywEVOe9HM8wQk9O3tV1F5dXhL-0-Mt7IxaQ-lCqFwgZ3nhPb0eAH8nq8igM0Ki-l_lixTK2ta53K7SOw1-J5EAUzHku5lBAiwS8DfkdHtnk8j_MsOFuCJp9iw3r-EO6neuB7PRMOujXlSoZBQoeZoH6qyPXsAlWm1hzuDD5lZmKgR0YC18kVQ7beOeNi5Gc2jUgY0O_31RRfQnihoeeHyhfLGbD88xKoyGtYS8ycbfpOlahxKzGxVAIi6ASMNjhc85ENE2Ek3ts03tHI4_fVqzsbnY-GYqTxiPaNOROk5JFLx-M1GB1FP2x=w905-h369-no)

Now, let's walk through the math of our butterfly analogy:  When we increase or decrease the value syn0,1, that's like the butterfly in New Mexico flapping its wings.  For simplicity's sake, let's say we just increased our syn0,1 to 3.66.  This increase will now ripple through our chain of events.  

Now, follow along with the above diagram and let's combine two analogies:
#The Butterfly Effect Meets the Ripple Effect





























Ripple 1, the first ripple effect of our tweak to syn0,1 will cause l1_LH to increase by a certain proportion, aka ratio.  That's the "gust of wind in Nevada." (see the grey line below)

Since l1_LH is the input of our sigmoid function, to calculate that ratio of change between l1_LH and L1 (aka, "to take the slope") is to measure Ripple 2, the "heavy winds in L.A."  (purple line below)

Then, l2_LH will obviously be affected in proportion to the change in l1 and its subsequent multiplication by syn1,1 (which does not change-- we're leaving syn1,1 and the other 14 weights unchanged for now, to simplify our example), so measuring that ratio of change will give us Ripple 3, the "thunderstorm in Hawaii."  (yellow line below)

You can probably guess l2 will change in proportion to the change in l2_LH, so taking the slope of those 2 numbers will give us Ripple 4, the "storm over the Pacific." (green line below)

Finally, when this new l2 is subtracted from our target y value, the remainder, which is l2_error will change, and this is Ripple Effect #5, the "hurricane in China."  (light blue line below)






























![alt text](https://lh3.googleusercontent.com/H3AZ3eEfKsjodFN6Nd0adMZHO9bewgae5zXfV4DiYYH1F6AanbW3fKaDUImpj1wWtjP08VBYBKIeSIpNQarFRhHqHpeqvF1I_jCI2lUsJsVup7GYrM0WrHLqSylxgPknoGNxVLPsQlaI85Kx6dqdVjOFicRJIR0oTme0rFD9hZ2-Z9tRecgk6J5QTmONCQuepnbmtrRz6LbCq1aabaAKNh6kzSMksuIdGZv4x4aaSKRx6NlVZnSLwmXWUsS7XYXACIWf7pZGuiqCF3YTv8Msl3c3lY3AK391zYodlNDiKp1tJIbQD-DfEQGlDVOqu6ag7Hjx0m1_HphiYrpCau9B86B0GIyw8koQcAibEXE3FfIaHAdxTs0mo-iYYDytPK7xXaLgJUKa4TJz5QDGaEVw9Vt8fmEj87ZgwAeSO9QyRrNVttBEppwjEydTN1t0hdBOzm2FHfMUfYPRtidmyrtYyYA3c1ZHCI00EZT9PAitrrDmDMnejQlKJnq0xKX_PIFLxvEUL-y-9TLDwzsfm-u1pJirc4-iYTXhSp9ypdQnuX2QJynlrRt9hh93yVF48_dxI9IONw-g4Zu-H5dagikBT7VGwjwlIQNWlGib9JXK40RNJPeoiI6jw1eSr6h53HONh6m-PFH8eIrDVZ5Xw9TU69JH0gJzPZlS=w905-h510-no)

Our goal is to calculate the ratio by which each ripple ripples, in order to know the amount we want to increase/decrease syn0,1 in order to minimize l2_error on our next iteration (the dark blue arc line above).  When we say our neural network "learns," we really mean it reduces l2_error with each iteration such that the network's predictions become more and more accurate each time.  So, tweaking syn0,1 is like tweaking the flap of the butterfly's wings, which ripples through a chain of events right up to the hurricane in China, which in our example is the reduction of l2_error. 






Since we're working backwards, we would say, "How much the hurricane l2_error changes *depends on* how much l2 changes, which *depends on* how much l2_LH changes, which *depends on* how much l1 changes, which depends on how much l1_LH changes, which depends on how much our butterfly, syn0,1 changes.  Let's have a clean, simple look at that in this diagram:

![alt text](https://lh3.googleusercontent.com/sjrkERdgtEpIRHlIzNJqgPwUFbio7--buJJQetEHqTBZMrmBHgoCharDPCfG-f0XvTvA2mgooiFnfT4oHmPkerQNY6vT4icHCtyNRr2mxbXzJlIOpiQsJen-AozAbDD6qNIE2xli-WbyLD7gHNRuW25RYbDHsvfWRNWiD_sbjvTAeD8mSuqWwjUHzjkxjIXay8mnpKMklmtULzBHhYSJjeTh8shbjnMA0O-04wH-XqeLvt8mOCV1NgXf5PWtzYgyE9zJCiQyPq1b8i4O6zLpSzuM5VPkwX-cEic4Ag2z3RmwZHw-05pJUAW9B4NlO_cQe3DFN2vOoSZ8LqoRL49h8yoVIdyqgvO_Oy7R0TTXndxXxdAY_c0ramKXilgUn-wu3V6a0nl-bccZE0ADCjB3_5IrVOGX27T8AYXdLoWkTrNUOS2NS7s6TqjS2PwgBU5cQLdofwdx8mzxGzjCgHggrDbRNYSQGZZPXZUCDz0-ugf82ApAF_G-O7somXPYBleoE2Y_zVaWMw6fd92juYUa5az2J48FpZdifz05LH4gFnys35V_qRGejjR5SPLvPovDwJLzlEnyu6pmyJJOiPzH8Q_DWdw9u6Ev7C9RUvY3FKe-jiEDGIh8si0TwN00n1z34d5gnPpCop3Ha2803hgStYW4g2b2zpQR=w905-h252-no)

If you compare the code above to the math in the diagram and they don't look consistent to you, that is only because our original code breaks the back propagation process down into several intermediary steps with several extra, intermediary variables.  Take a look at the code after I remove those three intermediary variables:
1.   l2_delta;
2.   l1_error; and
3.   l1_delta

Focus on the bottom line of the diagram, with the green arrows pointing upward:
![alt text](https://lh3.googleusercontent.com/VSxM6JaRc7I8CPhCm7AQFQI5bxkWjpMqwSAnQte9g8ZUcoZmRL6Od0osNqA_rq8x77vsZLLH1XkSRcrSF6FOaYTYuWu85EUwi-8fXDsCJ4ArsP1hjdMuIRgKORw-H_r1e-garDwD5UoZL_fvFmvVjpM38vt30KRRvhECGLlA5aHADdTqVCpUB1P2743YQcZhSMcXZzjjD-jylSChF5_wt5WvYQe4klee7QV3j1xbYIdWEVizZs6p7KRxCqdEzbbOjXWKA5pEV9I7-s152H5MYjxAzkRO4ibAfCkcfvVj5m7Pjk8Gh-AQhpW--ZEmTd-MAf07DMNeVcNpnEPg_A2m7ESfkRSCr04NCZwkUnU2q_eJwfJAEIMeJE9eSt9S0tszXG6auHHI7qsG2nrfvYkzple4z55gwfj1jvN4z9pSuENaLojJ3wXb6q6Z5ePWjB3CEBEbda0oqCqamO9nLTM995yRzLyyuuMhKYlIDTfO4BmiLT4ThDpftdo7KppXltN7rp_4Q1seXBeDBv_qWtarkh6lS0G0QKrzGy_BlVxxq1AHVioGP9ZWHZxGyfccjzbDzeXvvxNapN8yE7D1Itpz7PbxdDLxKjAbz5cybr3xUw96kLH4gsFb89f6GVb2v_MBacZcNfpvVwSyEcV2V6TpvQfkhYxk4MgR=w905-h310-no)


Do you see how, suddenly, things align very nicely!  Just follow the arrows from each piece of code to each ratio.  





Makes sense?  Lines up nicely, right?  That, my friend, is back propagation.  It took me about one month of daily study to finally arrive at this insight, so don't feel badly if you haven't nailed it yet.  



So you can see that there is a chain of 5 ratios we need to calculate and multiply together in order to find the ultimate ratio of how much a change in our butterfly, syn0,1 creates the change we want in our hurricane, the l2_error.  How do we calculate those ratios?  For your convenience, here again are the variables we found in the Step 7 Forward Feed:
```
l0= 1
syn0,1= 3.66
l1_LH= 3.82
l1= 0.98
syn1,1= 12.21
l2_LH= 0.00
l2= 0.50
y= 1 (This is a "Yes" answer to survey question 4, "Ever bought Litter Rip?" which corresponds to training example #1, i.e., row 1 of l0)
l2_error = y-l2 = 1-0.5 = 0.5
```
Don't make the same mistake I made.  Here's how I first attempted to calculate the ratios (i.e., the wrong way):

![alt text](https://lh3.googleusercontent.com/S2xDotEdfwmL-c4aT7auhsXMSb_d2kMbDIyHcJBaAV0tBV20GOTSguM5esGwFR5igC7196fIuh-gTYoy2Jx-dfpxcoxYS0cwFXitQHpH1__FoWQsCRDNXZvaAkF15LNRnu5jo1d0611_W6Uzox1a9GWfd12VEHxOtkTZgSm-o5-H_di9Rrdt6wbiX0BZUEFh-HzEmw7KjLwOTCespXW3ttTW8IbjLAtPK_njwHPLUpjVFcf8JvaCXXwvGTnJMuDeH9s0SwGXOwPDa8JkNcuQFaPQOh7wtFPSScG6MecCPunoe1-Z4hA4qPwkT73JGZAUd9LOYyFP9IUKborIZNNDzelBlkyY__Jga36qEweN1aOyVx2klD_5Yr37m7kQtOfb_bv_nn1fBUfK7UVBRbUcLxaua4j-AJ7-COPXFJIN0OJLaB46-tj8TvDRw3PYYfASEVdb12SmdyESpQkgg2B0d7PuXQuTg1SATSBvE4qjMPhJpogBnkRs1_94JvKpQc11iJpuF4KJSOvXRkXDQJwfOs7SzNYZirEjXl9nNUFqEYM4RaeELMb2JEsUWQcYSf0gOvB8LEbMAiZ2L9lNPQTFdHZ8hQxoBY59zOqiA92fClwYHH2Tkooc8eX3YoJDazG6WJXlFDFP3gn6jXgucHH2bBN4Joc_FB0J=w905-h227-no)

For $250,000 cash and a trip to our bonus round, **What's wrong with the above picture?**

Answer: In my above calculations, I forgot that the goal is to calculate **change**.  I mistakenly believed that each of these 5 ratios is fixed.  In fact, how much a given ratio "B" is changed by the change of the preceding ratio "A" depends on *where* the coordinates of Ratio B fall on the graph of Function B on the grid.  So calculating change is not possible if a ratio is fixed.  It *is* possible if a function is linear or nonlinear, but in order to calculate change you must calculate the slope of the ratio.  Why?  Because how much a ratio changes depends on where it falls on the grid in the graph of the function.  Once you know the slope of a function, you know how much it varies given its location.  And to calculate slope, you need at least 2 sets of coordinates.  

I want to make sure this is clear for you.  Consider for a moment:  we want to predict the future (so to speak).  We want to know how much a *change* syn0,1 will ripple through out network to cause a *change* for the better in our next, future iteration of l2_error.  In other words, we're not just comparing how a single number 3.66 in syn0,1 affects a single number 0.5 in l2_error.  We want to compare how a *change* in the number 3.66 in syn0,1 will in turn change the 5 rippling ratios to ultimately create a better, future l2_error.  Specifically, we want to make that 0.5 value smaller, as quickly as possible.  So, we don't want a bunch of ratios of numbers; we want the ratio of **CHANGE** in the numbers, the numbers that change our future.  The delta.  

OK, so hopefully you now agree that we need *two* values for each of our variables.  We already have a current value for each variable.  We need to provide a *second* value for each variable and subtract one value from the other.  The remainder = an "amount of change," or delta, that we can then compare to the amounts of change, or deltas, of the other variables.  It looks like this formula in the top part (ignore the math below for now):


![alt text](https://lh3.googleusercontent.com/DjlNxIzJRqvBQAzUAUKft3M7D5RLlyDMLHP763AXcNlAtVtHfSwe6uaiYHmAIVZyNU7zsf1BhlU_QnMn6dqUqgdsK90yn0yAzhGfwavbAP4bko7odIvQkXCFZyEeWTIIEKyGHWSiM9wii8IYXJRIREWbzLLlGAFgZ8Ewq7ANwWxek1I-CqCZp_Oj3_gad1AdEUDHxl4ImaXtB8r7dJIU9CverBUVjISMEfl3ZLdbRwCgiNiPHoeDX4itWJK_vc_ihUYZaKYLlkQFm9JBFn30sAphM4xg6OnCq7yVDhaTQNUDl4aG-SJK2VvvHcHsNFz9Dg-rt0qTIdppsdo7C67sZZ1ZxzHJ--kIVHm2-jCWVwX7hNvdn8ce6vSuZsLqK5NL7CDIF6hNiRqkrIIu3AsfB4dmDMJTNj-YidDcMRFORtFIxT7k6QDDNZoO2TJ787IjI0Evqpwq0-RaZ28jKQYtDrhrSnBFw76dm_ELtea10krvzkLyiELgDCOFYa8i_DCTVU_OIKKhrH_Xw1mrIpSdEYUWdO1ygelOBFzA3Jcd_L2V2hC4-9KLr0dd10u1lNXaEWmSa3Uw8iTFYJzDKXeCgvsN4HW9gi6UItphcVSsTqm3hRGhUQfSQQBwtha6ESEdOcPtjhfW_LfMfCOBmCXnyxa4Ghw-gqp8=w905-h165-no)

"Current" above refers to the current values we have for each variable.  "nearby" means we want to provide a number pretty close to our current number, for convenience's sake.  That way, when we subtract the current number from the nearby number, it will yield a small number that's easy to calculate in our ratio of change.  And it will give a more accurate slope when your two points are on a curvy graph line.  Let's walk through our 5 ratios together so we can practice finding the "nearby's" for each ratio.  Once we've done a full example, you'll be a giant step closer to understanding back propagation.

![alt text](https://lh3.googleusercontent.com/ODiRcFvxzkCO8Lzm5KbJXaihDWxmO_8nErbfKgRJd1LClMounnr8Ap2_qJ_5xjNFFzmE7Eza8bbNAGKckWiRuQzMXCUZHf_ajypoenwee-mjixGNl6yZ_Ra4R52hCS-FRCDKJO8e9DxCGn6DnnwXil814vmp2va8YHKxCS_fCL90Yz0QeLt60MeZxegJLRDX3zvubtyVfb1dwV_JJWBUCQyh2PywndzYDL_y3M_GJqQ1nAymKLnnJJZIRugyyMLAtJDD0YAM_KDuh7oF24i0y9zeQ8yvPKCOa90S3Ik0XiKmFohMiGeGcUOAxiatxPPtpFblWAQvmuYlzxHScNqnxKfAhSdeX0ViUAUmlgy6fiNlP4tiyMUSCp_i4mhTw3froPjcGqJWNR4AHVvS--Ox0pzcSvjyyH1BJssi9vFLw_e5RXeAoE2QFeB3_ohw1a5lX0YzCVEmWWJESxxeKDoMYupayWuWQ8gy_K2QHGk9qbpIYzvsZbe85nSYFZcvRI9CclvEUYtftMOHWrIr0loSEOFAePWQVaXgblLhTejKkwqE7g0SJBynsrf3z4pbC0oPl2fn0MONt4rBdFP7yOvmIKyVulGgt2q6BfMs9Be2geFUmvrElRQ8--tMZqux6VYMYjJ0lcqq7exiLrtByA_ZwtKA1qwWHgFN=w905-h510-no)

OK.  Ratio 1, which is Ripple 5, the hurricane in China (We start in China since "back propagation" means we're working *backwards* through the 5 ripples of our ripple effect, right?):  d l2_error / d l2.  Where did our "currents" and our "nearby's" come from?  

1.   x_current is the l2 we calculated from our forward feed, 0.5;
2.    y_current is y-l2 = 1-0.5 = 0.5, our l2_error;
3.   x_nearby is simply a convenient example we made up.  We know that if l2 were 0.6, which is indeed nearby our x_current of 0.5, then y-0.6 would be 0.4.
4.   Hence, x_nearby = 0.6 and y_nearby = 0.4
5.   Once you are clear on your 4 variables, the math is easy and the slope, aka the sensitivity = -1.  
6.   This means that for every 1 you increase l2, it decreases the l2_error by 1.    A delta of 1 in our l2 produces a delta of -1 in our l2_error.  Nice.

Ratio 2, which is Ripple 4, the storm over the Pacific.  d l2_LH / d l2.  

1.   We know x_current is l2_LH from our forward feed: 0.00.
2.   y_current is our l2, 0.50.
3.   But how do we find the nearby's?  We eyeball the S-curve of our sigmoid function in the diagram above, to find a convenient ratio to plug in.  
4.   We notice that at 0.1 on the X axis, Y is 0.525.  Nice, let's use that.  
5.   So, our x_nearby becomes 0.1 and our y_nearby becomes 0.525, and it's all over but the math: answer is 0.25.  

Why, you might ask, do we eyeball the S-curve of our sigmoid function?  Look at the corresponding code that the blue arrow points to.  It's not asking us to squish the LH side of l2 into a number between 0 and 1 with the `return 1/(1+np.exp(-x))` portion of our sigmoid code.  Rather, this time it's asking us to take the slope (aka derivative) of l2, by calling the `return x*(1-x)` code with `(deriv==True)`.  So, input is the X axis (i.e., 0, and output is the Y axis i.e., 0.5)

Next is Ratio 3, Ripple 3, the thunderstorm in Hawaii. d l2_LH / d l1.  
1.   We know x_current is L1, or 0.98. 
2.   y_current is L2_LH, the product when we multiply the entire first row of l1 by the full column of syn1.  Answer is 0.0.
3.   Note that, on this ratio, the code does not ask us to take the derivative (aka slope).  So for our x_nearby and our y_nearby, we can choose any darn number we please.  Let's choose 1 for x_nearby and 0.2442 for y_nearby.  
4.   Subtract, divide, done, and notice that it equals our syn1 first value, 12.21.  Hallelujah.

Ratio 4, Ripple 2, the heavy winds in L.A..  Having waded far upriver, we are now nearing the source of our ripples, syn0,1!  Exciting.  Note that the code in line 71 on this one is asking us to take slope, so you know we'll be eyeballing our S-curve again.  


1.   x_current is l1_LH, or 3.82, so look that up on our X axis.
2.   y_current lines up at about 0.975, no?  Done.  
3.   Now, since we're taking slope this time, for our x-and-y nearby's we have to find a nice pair of coordinates on our S-curve that make for convenient math.  How 'bout x_nearby as 4?  That would make y_nearby as about 0.982.  Lovely.  
4.   Do the math and it's 0.04.  Joy.
















Final Ratio, #5, Ripple 1, the gust of wind in Nevada.  We are nearing the "source of our mountain stream."  x_current is syn0,1, or 3.66.  y_current is l1_LH, or 3.82.  Our code is not asking for any slopes/derivatives because it is a linear function (which looks like a straight line on a graph), so the distance between coordinates can be very large and the slope will still be exactly right.  For curvy functions (parabolas, sigmoids, etc) the points should be close together to minimize the effect of curvature on your estimate of the slope.  Thus, for our nearby's on this linear function, we can use numbers that are, well...convenient, rather than nearby.  Hence: x_nearby is 4 and y_nearby is 4.16. And the math happens to work out (again, conveniently) to 1, which is the l0.  

OK.  We now have our 5 ratios, so let's multiply them together to come up with an answer to our question, "How much will l2_error increase/decrease, depending on how much I increase/decrease syn0,1?"  
```
1 x 0.04 x 12.21 x 0.25 x -1 = -0.1221 = d l2_error / d syn0,1
```
In other words, for every 1 we increase syn0,1, the l2_error decreases by 0.1221.

Now let's walk through what would happen, if I updated syn0 for the next iteration, using line 75 of our code:
```
l2 slopes after nonlin():    l2_error:                l2_delta: 
[0.25] Not Confident        [ 0.50] Big Miss         [ 0.125] Big Change
[0.09] Fairly Confident  X  [ 0.10] Small Miss    =  [ 0.009] Small-ish Change
[0.05] Confident            [-0.05] Tiny Miss        [-0.003] Tiny Change
[0.21] Not Confident        [-0.70] Very Big Miss    [-0.150] Huge Change

syn1 += l1.T.dot(l2_delta) 
Note that "+=" means to add to the existing, so:
12.21 += 0.98x(-0.1221)
= 12.21 + -0.1221
= 12.09 = our new syn1 to be used in our next iteration!
```
If you have followed things thus far, then you are well on your way to becoming a Back Propagation Rock Star.  If not yet, hey--no problem!  Just reread the above several more times and click on the helpful links of the Super Teachers I have cited above.

Here's a question I had when I had arrived at this stage: Why bother taking the slope of l2?  

We take the slope of l2 to fix the most mistaken of our 16 weights faster.  How?  Well, you may recall from our discussion of the Sigmoid function (and the S-curve diagram) above that the slope of l2 is the confidence level of l2.  The smallest slope numbers indicate the highest confidence level.  Therefore, multiplying the corresponding values of the l2_error by these small numbers ain't gonna cause a big change in the l2_delta product, which is good.  We don't want to change those weights much, because we're already pretty confident in the job they're doing.  

But the l2 prediction numbers that we are *least* confident in have the steepest slope, which yields a larger number.  When we multiply that larger number by the l2_error then the resulting l2_delta has a bigger number.  When we update syn1 later on, that bigger multiplier is going to mean a bigger product, and therefore a bigger change, or tweak, in that value.  This is as it should be, because we want to take the weights we have the least confidence in and change them the most.  That's where we will get the biggest "bang for our buck" when it comes to tweaking the 16 weights of our system.  To summarize, taking the slope of l2 gives us the confidence of each l2 prediction, which allows us to home in on the numbers that most need fixing, and fix them the fastest.

The next key step is for you to understand how the computer code is doing the same thing we just did manually with our math equations.  I want to show you how these few lines of code...
```
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
```
...are doing the same thing as this math equation below.

![alt text](https://lh3.googleusercontent.com/ODiRcFvxzkCO8Lzm5KbJXaihDWxmO_8nErbfKgRJd1LClMounnr8Ap2_qJ_5xjNFFzmE7Eza8bbNAGKckWiRuQzMXCUZHf_ajypoenwee-mjixGNl6yZ_Ra4R52hCS-FRCDKJO8e9DxCGn6DnnwXil814vmp2va8YHKxCS_fCL90Yz0QeLt60MeZxegJLRDX3zvubtyVfb1dwV_JJWBUCQyh2PywndzYDL_y3M_GJqQ1nAymKLnnJJZIRugyyMLAtJDD0YAM_KDuh7oF24i0y9zeQ8yvPKCOa90S3Ik0XiKmFohMiGeGcUOAxiatxPPtpFblWAQvmuYlzxHScNqnxKfAhSdeX0ViUAUmlgy6fiNlP4tiyMUSCp_i4mhTw3froPjcGqJWNR4AHVvS--Ox0pzcSvjyyH1BJssi9vFLw_e5RXeAoE2QFeB3_ohw1a5lX0YzCVEmWWJESxxeKDoMYupayWuWQ8gy_K2QHGk9qbpIYzvsZbe85nSYFZcvRI9CclvEUYtftMOHWrIr0loSEOFAePWQVaXgblLhTejKkwqE7g0SJBynsrf3z4pbC0oPl2fn0MONt4rBdFP7yOvmIKyVulGgt2q6BfMs9Be2geFUmvrElRQ8--tMZqux6VYMYjJ0lcqq7exiLrtByA_ZwtKA1qwWHgFN=w905-h510-no)


```
d l2_err / d l2 = -1 (The target value of l2 is 1.  If l2 increases, then the corresponding error decreases.  Increasing l2 by 0.1 causes the error to change by -0.1.  The ratio of the change in error to the change in l2 is -1.)
d l2 / d k2 = slope of l2
d k2 / d l1 = weight syn1,1
d l1 / d k1 = slope of l1
d k / d syn0,1 = l0
```
Nice work.  All the tough stuff is done, now, and we are close to the finish line.  Onward!


##12) In what DIRECTION is the target (ideal) l1?  Lines 66-69
As before, we compute l1_delta by multiplying l1_error by the derivative of the sigmoid to aggressively change low confidence values.  We will use the exact same process as Step 10 to find in what direction our gradient descent should be moving in order to take us closer to the perfect l1 that will contribute to us finding the perfect l2, our ultimate goal.

We want to answer the question, "In what DIRECTION is l1, the desired target value of our hard-working middle layer 1, from l1's latest prediction in this current iteration?  We want to tweak this middle layer of our network so it sends a better prediction to l2, making it easier for l2 to better predict target y.  In order to answer this question, we need to find the l1_delta, which tells us how much to adjust the weights to produce large changes in low confidence values and small changes in high confidence values.


##13) Gradient Descent: How the synapses, rather than the neurons, are the core of your network's "brain."
Lines 71-74

This final step is all the Glory Moment:  all our work is complete, and we reverently carry our hard-earned l1_delta and l2_delta up the steps of the podium to our hallowed leader, Emperor Synapse, the true brains of our operation.  
We compute the update to syn0 by multiplying l1_delta by the input l0.  This causes large changes in components of syn0 that have stronger effects on l1.
We update syn1 and syn0 so they will learn from their mistakes of this iteration, and in the next iteration they will lead us one step closer to that ideal bottom of our bowl, where error is smallest, predictions are most accurate, and joy abounds!

It is efficient to change weights in the synapse that correspond to larger values of l1 (i.e. if a node of l1 has a large value, a small change in the weights that are multiplied by this value can have a large effect on l2).  The multiplication ensures that the total change applied to the synapse maximizes the impact on l2.  In other words, it produces an increment in the direction of steepest descent (opposite of the gradient).




#In Closing...
Andrew Trask gave me a fabulous gift when he wrote that memorizing these lines of code leads to mastery, and I agree for two reasons: 

1) When you try to write out this code from memory, you will find that the places where you forget the code are the places where you don't understand the code.  Once you understand this code perfectly, every part of it will make sense to you and therefore you will remember it forever;

2) This code is the foundation on which (perhaps) all Deep Learning networks are built.  If you master this code, every network you learn and every paper you wade through will be clearer and easier because of your work in memorizing this code.

Memorizing this code was made easy for me by making up an absolutely ridiculous story that ties all the concepts together in a fairy-tale mnemonic.  You will remember better if you make your own, but here's mine, to give you an idea.  I count the 13 steps on my fingers as I recite this story out loud:

1) Sigmund Freud (think: Sigmoid Function) absolutely *treasured* his neural network, and he buried it like a pirate's treasure, 

2) "X" marks the spot (Creating X input that will become l1).  

3) "Why," I asked him (Create the y vector of target values), "didn't you plant 

4) Seeds instead?" (Seed your random number generator)  "You could have grown a lovely garden of 

5) Snapdragons," (Create Synapses: Weights) "which could be fertilized by the 

6) Firm poop" (For loop) "of the deer that 

7) Feed on the flowers" (Feed Forward Network)!  Then suddenly, an archer 

8) Missed his target (By How Much Missed the Target?) and killed a grazing deer.  As punishment, he was forced to 

9) Print his error (Print Error) 500 times on a blackboard facing the 

10) Direction of his target (In What Direction is y?).  But he noticed behind the 

11) BACK of his target two deer were mating and PROPAGATING their species (Back Propagation) and he shouted for them to stop but they wouldn't take 

12) Direction and ignored him (In what Direction is the l1 target?).  He got so angry that his mind 

13) Snapped and he Descended into Gradient insanity (Update Synapses, Gradient Descent).  

So, this is a very silly story, but I can tell you that it has burned those 13 steps into my brain, and once I can write down those 13 steps, and I understand that code in each step, to write the code perfectly from memory becomes easy.

I hope you can tell that I love my journey into Deep Learning, and I wish you the same joy I find!
Feel free to email me improvements to this article at: DavidCode1@gmail.com

THE END
