### Part Ⅰ: Understanding a Neural Network

Neural networks are a model inspired by how the brain works. Similar to neurons in the brain, our ‘mathematical neurons’ are also, intuitively, connected to each other; they take inputs(dendrites), do some simple computation on them and produce outputs(axons).

<p>
    <img src = "assets/1.png/">
</p>

As in the figure above, most of the time you will see a neural network depicted in a similar way. But this succinct and simple looking picture hides a bit of the complexity. Let’s expand it out.

<p>
    <img src = "assets/2.png/">
</p>

Now, let’s go over each node in our graph and see what it represents.

<p>
    <img src = "assets/3.png/">
</p>

These nodes represent our inputs for our first and second features, x₁ and x₂, that define a single example we feed to the neural network, thus called “Input Layer”

<p>
    <img src = "assets/4.png/">
</p>
w₁ and w₂ represent our weight vectors (in some neural network literature it is denoted with the theta symbol, θ). Intuitively, these dictate how much influence each of the input features should have in computing the next node. 

**Weights are the main values our neural network has to “learn”. So initially, we will set them to random values and let the “learning algorithm” of our neural network decide the best weights that result in the correct outputs.**

<p>
    <img src = "assets/5.png/">
</p>

This node represents a linear function. Simply, it takes all the inputs coming to it and creates a linear equation/combination out of them. ( By convention, it is understood that a linear combination of weights and inputs is part of each node, except for the input nodes in the input layer, thus this node is often omitted in figures, like in Fig.1.)

<p>
    <img src = "assets/6.png/">
</p>

This σ node takes the input and passes it through the following function, called the sigmoid function(because of its S-shaped curve), also known as the logistic function:

<p>
    <img src = "assets/7.png/">
</p>

Sigmoid is one of the many “activations functions” used in neural networks. The job of an activation function is to change the input to a different range. For example, if z > 2 then, σ(z) ≈ 1 and similarly, if z < -2 then, σ(z) ≈ 0. So, the sigmoid function squashes the output range to (0, 1)

In our above neural network since it is the last node, it performs the function of output. The predicted output is denoted by ŷ. (Note: in some neural network literature this is denoted by ‘h(θ)’, where ‘h’ is called the hypothesis i.e. this is the hypothesis of the neural network, a.k.a the output prediction, given parameter θ; where θ are weights of the neural networks)


Now that we know what each and everything represents let’s flex our muscles by computing each node by hand on some dummy data.

<p>
    <img src = "assets/8.png/">
</p>

The data above represents an OR gate(output 1 if any input is 1). Each row of the table represents an ‘example’ we want our neural network to learn from. After learning from the given examples we want our neural network to perform the function of an OR gate; given the input features, x₁ and x₂, try to output the corresponding y(also called ‘label’).

<p>
    <img src = "assets/9.png/">
</p>

This OR-gate data is particularly interesting, as it is linearly separable i.e. we can draw a straight line to separate the green cross from the red dot.

Data flows from left-to-right in our neural network. In technical terms, this process is called **‘forward propagation’**; the computations from each node are forwarded to the next node, it is connected to.

Let’s go through all the computations our neural network will perform on given the first example, x₁=0, and x₂=0. Also, we’ll initialize weights w₁ and w₂ to w₁=0.1 and w₂=0.6 (recall, these weights a have been randomly selected)

<p>
    <img src = "assets/10.png/">
</p>

With our current weights, w₁= 0.1 and w₂ = 0.6, our network’s output is a bit far from where we’d like it to be. The predicted output, ŷ, should be ŷ≈0 for x₁=0 and x₂=0, right now its ŷ=0.5.

So, how does one tell a neural network how far it is from our desired output? **In comes the Loss Function to the rescue.**

#### Loss Function

**The Loss Function is a simple equation that tells us how far our neural network’s predicted output(ŷ) is from our desired output(y), for ONE example, only.**

The derivative of the loss function dictates whether to increase or decrease weights. A positive derivative would mean decrease the weights and negative would mean increase the weights. The steeper the slope the more incorrect the prediction was.

<p>
    <img src = "assets/11.png/">
</p>

The Loss function curve depicted above is an ideal version. In real-world cases, the Loss function may not be so smooth, with some bumps and saddles points along the way to the minimum.

There are many different kinds of loss functions each essentially calculating the error between predicted output and desired output. Here we’ll use one of the simplest loss functions, the squared-error Loss function. Defined as follows:

<p>
    <img src = "assets/12.png/">
</p>

Taking the square keeps everything nice and positive and the fraction (1/2) is there so that it cancels out when taking the derivative of the squared term (it is common among some machine learning practitioners to leave the fraction out).

Intuitively, the Squared Error Loss function helps us in minimizing the vertical distance between our predictor line(blue line) and actual data(green dot). Behind the scenes, this predictor line is our z(linear function) node.

<p>
    <img src = "assets/13.png/">
</p>

Now that we know the purpose of a Loss function let’s calculate the error in our current prediction ŷ=0.5, given y=0

<p>
    <img src = "assets/14.png/">
</p>

as we can see the Loss is 0.125. Given this, we can now use the derivative of the Loss function to check whether we need to increase or decrease our weights.

**This process is called backpropagation**, as we’ll be doing the opposite of the forward phase. Instead of going from input to output we’ll track backward from output to input. Simply, backpropagation allows us to figure out how much of the Loss each part of the neural network was responsible for.

**To perform backpropagation we’ll employ the following technique: at each node, we only have our local gradient computed(partial derivatives of that node), then during backpropagation, as we are receiving numerical values of gradients from upstream, we take these and multiply with local gradients to pass them on to their respective connected nodes.**

<p>
    <img src = "assets/15.png/">
</p>

**This is a generalization of the chain rule from calculus.**

Since ŷ(predicted label) dictates our Loss and y(actual label) is constant, for a single example, we will take the partial derivative of Loss with respect to ŷ

<p>
    <img src = "assets/16.png/">
</p>

<p>
    <img src = "assets/17.png/">
</p>

Next, we require the derivative of sigmoid function, which can be derived as below:

<p>
    <img src = "assets/18.png/">
</p>
<p>
    <img src = "assets/19.png/">
</p>
<p>
    <img src = "assets/20.png/">
</p>

Let’s use this in the next backward calculation

<p>
    <img src = "assets/21.png/">
</p>

The backward computations should not propagate all the way to inputs as we don’t want to change our input data(i.e. red arrows should not go to green nodes). We only want to change the weights associated with inputs.

<p>
    <img src = "assets/22.png/">
</p>

**The derivatives to the Loss with respect to the weights,w₁ & w₂, are ZERO! We can’t increase or decrease the weights if their derivatives are zero. So then, how do we get our desired output in this instance if we can’t figure out how to adjust the weights? The key thing to note here is that the local gradients (∂z/∂w₁ and ∂z/∂w₂) are x₁ and x₂, both of which, in this example, happens to be zero (i.e. provide no information)**

This brings us to the concept of bias.

#### Bias:

Recall the equation of a line from your high school days.

<p>
    <img src = "assets/23.png/">
</p>

Here b is the bias term. Intuitively, the bias tells us that all outputs computed with x(independent variable) should have an additive bias of b. So, when x=0(no information coming from the independent variable) the output should be biased to just b.

Note that without the bias term a line can only pass through the origin(0, 0) and the only differentiating factor between lines would then be the gradient m.

<p>
    <img src = "assets/24.png/">
</p>

So, using this new information let’s add another node to a neural network; the bias node. (In neural network literature, every layer, except the input layer, is assumed to have a bias node, just like the linear node, so this node is also often omitted in figures.)

<p>
    <img src = "assets/25.png/">
</p>


Now let’s do a forward propagation with the same example, x₁=0, x₂=0, y=0 and let’s set bias, b=0 (initial bias is always set to zero, rather than a random number), and let the backpropagation of Loss figure out the bias.

<p>
    <img src = "assets/26.png/">
</p>

Well, the forward propagation with a bias of “b=0” didn’t change our output at all, but let’s do the backward propagation before we make our final judgment.

As before let’s go through backpropagation in a step by step manner.

<p>
    <img src = "assets/27.png/">
</p>

<p>
    <img src = "assets/28.png/">
</p>

<p>
    <img src = "assets/29.png/">
</p>

**Hurrah! we just figured out how much to adjust the bias. Since the derivative of bias(∂L/∂b) is positive 0.125, we will need to adjust the bias by moving in the negative direction of the gradient(recall the curve of the Loss function from before).** This is technically called **gradient descent,** as we are “descending” away from the sloping region to a flat region using the direction of the gradient.

<p>
    <img src = "assets/30.png/">
</p>

Now, that we’ve slightly adjusted the bias to b=-0.125, let’s test if we’ve done the right thing by doing a forward propagation and checking the new Loss.

<p>
    <img src = "assets/31.png/">
</p>

<p>
    <img src = "assets/32.png/">
</p>

**Now our predicted output is ŷ≈0.469(rounded to 3 decimal places), that’s a slight improvement from the previous 0.5 and Loss is down from 0.125 to around 0.109. This slight correction is something that the neural network has ‘learned’ just by comparing its predicted output with the desired output, y, and then moving in the direction opposite of the gradient. Pretty cool, right?**

Now you may be wondering, this is only a small improvement from the previous result and how do we get to the minimum Loss. Two things come into play: 
a) how many iterations of ‘training’ we perform (each training cycle is forward propagation followed by backward propagation and updating the weights through gradient descent). 
b) the learning rate.

Let's see **learning rate**:

#### Learning Rate:

Recall, how we calculated the new bias, above, by moving in the direction opposite of the gradient(i.e. gradient descent).

<p>
    <img src = "assets/33.png/">
</p>

Notice that when we updated the bias we moved 1 step in the opposite direction of the gradient.


<p>
    <img src = "assets/34.png/">
</p>

**We could have moved 0.5, 0.9, 2, 3 or whatever fraction of steps we desired in the opposite direction of the gradient. This ‘number of steps’ is what we define as the learning rate, often denoted with α(alpha).**

<p>
    <img src = "assets/35.png/">
</p>

Learning rate defines how quickly we reach the minimum loss. Let’s visualize below what the learning rate is doing:

<p>
    <img src = "assets/36.png/">
</p>

As you can see with a lower learning rate(α=0.5) our descent along the curve is slower and we take many steps to reach the minimum point. On the other hand, with a higher learning rate(α=5) we take much bigger steps and reach the minimum point much faster.


The keen-eyed may have noticed that gradient descent steps(green arrows) keep getting smaller as we get closer and closer to the minimum, why is that? Recall, that the learning rate is being multiplied by the gradient at that point along the curve; as we descend away from sloping regions to flatter regions of the u-shaped curve, near the minimum point, the gradient keeps getting smaller and smaller, thus the steps also get smaller. Therefore, changing the learning rate during training is not necessary(some variations of gradient descent start with a high learning rate to descend quickly down the slope and then reduce it gradually, this is called **“annealing the learning rate” or “learning rate decay”)**


So what’s the takeaway? Just set the learning rate as high possible and reach the optimum loss quickly. NO. Learning rate can be a double-edged sword. Too high a learning rate and the parameters(weights/biases) don’t reach the optimum instead start to diverge away from the optimum. To small a learning rate and the parameters take too long to converge to the optimum.

<p>
    <img src = "assets/37.png/">
</p>

Small learning rate(α=5*10⁻¹⁰) resulting is numerous steps to reach the minimum point is self-explanatory; multiply gradient with a small number(α) results in a proportionally small step.

Large learning rate(α=50) causing gradient descent to diverge may be confounding, but the answer is quite simple; note that at each step gradient descent approximates its path downward by moving in straight lines(green arrows in the figures), in short, it estimates its path downwards. When the learning rate is too high we force gradient descent to take larger steps. Larger steps tend to overestimate the path downwards and shoot past the minimum point, then to correct the bad estimate gradient descent tries to move towards the minimum point but again overshoots past the minimum due to the large learning rate. This cycle of continuous overestimates eventually cause the results to diverge(Loss after each training cycle increase, instead of decrease).

**Learning rate is what’s called a hyper-parameter.** Hyper-parameters are parameters that the neural network can’t essentially learn through backpropagation of gradients, they have to be hand-tuned according to the problem and its dataset, by the creator of the neural network model. (The choice of the Loss function, above, is also hyper-parameter)

In short, the goal is not the find the “perfect learning rate ” but instead a learning rate large enough so that the neural network trains successfully and efficiently without diverging.

So, far we’ve only used one example(x₁=0 and x₂=0) to adjust our weights and bias(actually, only our bias up till now) and that reduced the loss on one example from our entire dataset(OR gate table). But we have more than one example to learn from and we want to reduce our loss across all of them. Ideally, in one training iteration, we would like to reduce our loss across all the training examples. This is called Batch Gradient Descent(or full batch gradient descent), as we use the entire batch of training examples per training iteration to improve our weights and biases. (Other forms are mini-batch gradient descent, where we use a subset of the data set in each iteration and stochastic gradient descent, where we only use one example per training iteration as we’ve done so far).

A training iteration where the neural network goes through all the training examples is called an **Epoch**. If using mini-batches then an epoch would be complete after the neural network goes through all the mini-batches, similarly for stochastic gradient descent where a batch is just one example.

Before we proceed further we need to define something called a Cost Function.

#### Cost Function:

When we perform “batch gradient descent” we need to slightly change our Loss function to accommodate not just one example but all the examples in the batch. This adjusted Loss function is called the Cost Function.

Also, note that the curve of the Cost Function is similar to the curve of the Loss function(same U-Shape).

Instead of calculating the Loss on one example the cost function calculates average Loss across ALL the examples.

<p>
    <img src = "assets/38.png/">
</p>

**Intuitively, the Cost function is expanding out the capability of the Loss function. Recall, how the Loss function was helping to minimize the vertical distance between a single data point and the predictor line(z). The Cost function is helping to minimize the vertical distance(Squared Error Loss) between multiple data points, concurrently.**

<p>
    <img src = "assets/39.png/">
</p>

During batch gradient descent we’ll use the derivative of the Cost function, instead of the Loss function, to guide our path to minimum cost across all examples. (In some neural network literature, the Cost Function is at times also represented with the letter ‘J’.)

Let’s take a look at how the derivative equation of the Cost function differs from the plain derivative of the Loss function.

<p>
    <img src = "assets/40.png/">
</p>

Taking the derivative of this Cost function, which takes vectors as inputs and sums them, can be a bit dicey. So, let’s start out on a simple example before we generalize the derivative.

<p>
    <img src = "assets/41.png/">
</p>


Nothing new here in the calculation of the Cost. Just as expected the Cost, in the end, is the average of the Loss, but the implementation is now vectorized(we performed vectorized subtraction followed by element-wise exponentiation, called Hadamard exponentiation). Let’s derive the partial derivatives.

<p>
    <img src = "assets/42.png/">
</p>

From this, we can generalize the partial derivative equation.

<p>
    <img src = "assets/43.png/">
</p>

**Right now we should take a moment to note how the derivative of the Loss is different from the derivative of the Cost.**


<p>
    <img src = "assets/44.png/">
</p>

#### There are two ways to perform batch gradient descent:

1. For each training iteration create separate temporary variables(capital deltas, Δ) that will accumulate the gradients(small deltas, δ) for the weights and biases from each of the “m” examples in our training set, then at the end of the iteration update the weights using the average of the accumulated gradients. This is a slow method. (for those familiar time complexity analysis you may notice that as the training data set grows this becomes a polynomial-time algorithm, O(n²))


<p>
    <img src = "assets/45.png/">
</p>

2. The quicker method is similar to above but instead uses vectorized computations to calculate all the gradients for all the training examples in one go, so the inner loop is removed. Vectorized computations run much quicker on computers. This is the method employed by all the popular neural network frameworks and the one we’ll follow for the rest of this blog.

For vectorized computations, we’ll make an adjustment to the “Z” node of the neural network computation graph and use the Cost function instead of the Loss function.

<p>
    <img src = "assets/46.png/">
</p>

Note that in the figure above we take dot-product between W and X which can be either an appropriate size matrix or vector. The bias, b, is still a single number(a scalar quantity) here and will be added to the output of the dot product in an element-wise fashion. The predicted output will not be just a number, but instead a vector, Ŷ, where each element is the predicted output of their respective example.

Let’s set up out data(X, W, b & Y) before doing forward and backward propagation.


<p>
    <img src = "assets/47.png/">
</p>

We are now finally ready to perform forward and backward propagation using Xₜᵣₐᵢₙ, Yₜᵣₐᵢₙ, W, and b.

(NOTE: All the results below are rounded to 3 decimal points, just for brevity)

<p>
    <img src = "assets/48.png/">
</p>

How cool is that we calculated all the forward propagation steps for all the examples in our data set in one go, just by vectorizing our computations.

We can now calculate the Cost on these output predictions.

<p>
    <img src = "assets/49.png/">
</p>

Our Cost with our current weights, W, turns out to be 0.089. Our Goal now is to reduce this cost using backpropagation and gradient descent. As before we’ll go through backpropagation in a step by step manner


<p>
    <img src = "assets/50.png/">
</p>
<p>
    <img src = "assets/51.png/">
</p>
<p>
    <img src = "assets/52.png/">
</p>
<p>
    <img src = "assets/53.png/">
</p>
<p>
    <img src = "assets/54.png/">
</p>

<p>
    <img src = "assets/55.png/">
</p>

Voila, we used a vectorized implementation of batch gradient descent to calculate all the gradients in one go.
(Those with a keen eye may be wondering how are the local gradients and the final gradients are being calculated in this last step. Don’t worry, I’ll explain the derivation of the gradients in this last step, shortly. For now, its suffice to say that the gradients defined in this last step are an optimization over the naive way of calculating ∂Cost/∂W and ∂Cost/∂b)

Let’s update the weights and bias, keeping learning rate same as the non-vectorized implementation from before i.e. α=1.

<p>
    <img src = "assets/56.png/">
</p>

Now that we have updated the weights and bias lets do a forward propagation and calculate the new Cost to check if we’ve done the right thing.

<p>
    <img src = "assets/57.png/">
</p>

<p>
    <img src = "assets/58.png/">
</p>

<p>
    <img src = "assets/59.png/">
</p>

So, we reduced our Cost(Average Loss across all examples) from an initial Cost of around 0.089 to 0.084. We will need to do multiple training iterations before we can converge to a low Cost.

At this point, I would recommend that you perform backpropagation step yourself. The result of that should be (rounded to 3 decimal places): ∂Cost/∂W = [-0.044, -0.035] and ∂Cost/∂b = [-0.031].

Recall, before we trained the neural network, how we predicted the neural network can separate the two classes in Figure 9, well after about 5000 Epochs(full batch training iterations) Cost steadily decreases to about 0.0005 and we get the following decision boundary :

<p>
    <img src = "assets/60.png/">
</p>

<p>
    <img src = "assets/61.png/">
</p>

The Cost curve is basically the value of Cost plotted after a certain number of iterations(epochs). Notice that the Cost curve flattens after about 3000 epochs this means that the weights and bias of the neural network have converged, so further training will only slightly improve our weights and bias. Why? Recall the u-shaped Loss curve, as we descend closer and closer to the minimum point(flat region) the gradients become smaller and smaller thus the steps gradient descent takes are very small.


The Decision Boundary shows the line along which the decision of the neural network changes from one output to the other. We can better visualize this by coloring the area below and above the decision boundary.


<p>
    <img src = "assets/62.png/">
</p>

The red shaded area is the area below the decision boundary and everything below the decision boundary has an output( ŷ) of 0. Similarly, everything above the decision boundary, shaded green, has an output of 1. In conclusion, our simple neural network has learned a decision boundary by looking at the training data and figuring out how to separate its two output classes(y=1 and y=0)🙌. Now the output neuron fires up🔥(produces 1) whenever x₁ or x₂ or both are 1.

### but how did you calculate the gradients ∂Cost/∂W and ∂Cost/∂b ?



### REFERENCES:

- https://medium.com/towards-artificial-intelligence/nothing-but-numpy-understanding-creating-neural-networks-with-computational-graphs-from-scratch-6299901091b0