In [1]:
import numpy as np

In [4]:
input_vector = np.array([1.72, 1.23])
weights_1 = np.array([1.26, 0])
weights_2 = np.array([2.17, 0.32])

In [5]:
dot1 = input_vector@weights_1

In [6]:
print(f"The dot product is: {dot1}")

The dot product is: 2.1672


In [7]:
dot2 = np.dot(input_vector, weights_2)

In [8]:
print(f"The dot product is: {dot2}")

The dot product is: 4.1259999999999994


Working with neural networks consists of doing operations with vectors. You represent the vectors as multidimensional arrays. Vectors are useful in deep learning mainly because of one particular operation: the dot product. The dot product of two vectors tells you how similar they are in terms of direction and is scaled by the magnitude of the two vectors.

## Making predictions

If you add more layers but keep using only linear operations, then adding more layers would have no effect because each layer will always have some correlation with the input of the previous layer. 

What you want is to find an operation that makes the middle layers sometimes correlate with an input and sometimes not correlate.

You can achieve this behavior by using nonlinear functions. These nonlinear functions are called activation functions. There are many types of activation functions. The ReLU (rectified linear unit), for example, is a function that converts all negative numbers to zero. This means that the network can “turn off” a weight if it’s negative, adding nonlinearity.

The network you’re building will use the **sigmoid activation function**.

<img src="https://robocrop.realpython.net/?url=https%3A//files.realpython.com/media/sigmoid_function.f966c820f8c3.png&w=578&sig=559e58b0e39bc1d37841223862ceabbd6ae8be22">

Probability functions give you the probability of occurrence for possible outcomes of an event. The only two possible outputs of the dataset are 0 and 1, and the Bernoulli distribution is a distribution that has two possible outcomes as well. The sigmoid function is a good choice if your problem follows the Bernoulli distribution, so that’s why you’re using it in the last layer of your neural network.

Since the function limits the output to a range of 0 to 1, you’ll use it to predict probabilities. If the output is greater than 0.5, then you’ll say the prediction is 1. If it’s below 0.5, then you’ll say the prediction is 0. This is the flow of the computations inside the network you’re building:

<img src="https://robocrop.realpython.net/?url=https%3A//files.realpython.com/media/network_architecture.406cfcc68417.png&w=700&sig=ce1ed03252df1cdbaa626424ffbb9084ab2b7b5e">

In [10]:
input_vector = np.array([1.66, 1.56])
weights_1 = np.array([1.45, -0.66])
bias = np.array([0.0])

In [11]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [12]:
weights1 = np.concatenate((bias, weights_1))
weights1

array([ 0.  ,  1.45, -0.66])

In [15]:
def make_prediction(input_vector, weights, bias):
    layer_1 = np.dot(input_vector, weights) + bias
    layer_2 = sigmoid(layer_1)
    return layer_2

In [16]:
prediction = make_prediction(input_vector, weights_1, bias)
prediction

array([0.7985731])

In [18]:
def predict_class(prediction):
    if prediction > 0.5:
        return 1
    return 0

In [19]:
predict_class(prediction)

1

In [20]:
input_vector = np.array([2, 1.5])
prediction = make_prediction(input_vector, weights_1, bias)
prediction

array([0.87101915])

In [21]:
predict_class(prediction)

1

## Training

To adjust the weights, you’ll use the gradient descent and backpropagation algorithms. Gradient descent is applied to find the direction and the rate to update the parameters. 

### Find the error

To understand the magnitude of the error, you need to choose a way to measure it. The function used to measure the error is called the cost function, or loss function. In this tutorial, you’ll use the **mean squared error (MSE)** as your cost function.

In [23]:
target = 0

In [26]:
mse = np.square(prediction - target)
mse

array([0.75867436])

### Reducing the error

The goal is to change the weights and bias variables so you can reduce the error. To understand how this works, you’ll change only the weights variable and leave the bias fixed for now. You can also get rid of the sigmoid function and use only the result of layer_1. 

You compute the MSE by doing error = np.square(prediction - target). If you treat (prediction - target) as a single variable x, then you have error = np.square(x), which is a quadratic function. Here’s how the function looks if you plot it:

<img src="https://robocrop.realpython.net/?url=https%3A//files.realpython.com/media/quatratic_function.002729dea332.png&w=578&sig=1df4f5711e982f821d54ab9634ac28bd9cd0312d">

The error is given by the y-axis. If you’re in point A and want to reduce the error toward 0, then you need to bring the x value down. On the other hand, if you’re in point B and want to reduce the error, then you need to bring the x value up. To know which direction you should go to reduce the error, you’ll use the derivative. A derivative explains exactly how a pattern will change.

Another word for the derivative is gradient. Gradient descent is the name of the algorithm used to find the direction and the rate to update the network parameters. 

In [28]:
derivative_of_mse = 2 * (prediction - target)
derivative_of_mse

array([1.7420383])

### Update the weights

In [29]:
weights_1 = weights_1 - derivative_of_mse if derivative_of_mse >= 0 else weights_1 + derivative_of_mse

In [30]:
prediction = make_prediction(input_vector, weights_1, bias)
prediction

array([0.01496248])

In [32]:
error = (prediction - target) ** 2
error

array([0.00022388])

The error dropped down to almost 0!

In this example, the derivative result was small, but there are some cases where the derivative result is too high. Take the image of the quadratic function as an example. High increments aren’t ideal because you could keep going from point A straight to point B, never getting close to zero. To cope with that, you update the weights with a fraction of the derivative result.

To define a fraction for updating the weights, you use the alpha parameter, also called the learning rate.

Now you want to know how to change weights_1 and bias to reduce the error. You already saw that you can use derivatives for this, but instead of a function with only a sum inside, now you have a function that produces its result using other functions.

Since now you have this function composition, to take the derivative of the error concerning the parameters, you’ll need to use the chain rule from calculus. With the chain rule, you take the partial derivatives of each function, evaluate them, and multiply all the partial derivatives to get the derivative you want. 

<img src="https://robocrop.realpython.net/?url=https%3A//files.realpython.com/media/partial_derivative_weights_2.c792633559c3.png&w=750&sig=77881b051de83d0af835c87b5abc82dbd340bef5">

### Adjusting the parameters with Backpropagation

You want to take the derivative of the error function with respect to the bias, derror_dbias. Then you’ll keep going backward, taking the partial derivatives until you find the bias variable.

Since you are starting from the end and going backward, you first need to take the partial derivative of the error with respect to the prediction. That’s the derror_dprediction in the image below:

<img src="https://robocrop.realpython.net/?url=https%3A//files.realpython.com/media/partial_derivative_bias_2.177c16a60b9d.png&w=750&sig=72cd2e7882a87d1ef09e678d9cfa9517e0ae63c8">

The function that produces the error is a square function, and the derivative of this function is 2 * x, as you saw earlier. You applied the first partial derivative (derror_dprediction) and still didn’t get to the bias, so you need to take another step back and take the derivative of the prediction with respect to the previous layer, dprediction_dlayer1.

The prediction is the result of the sigmoid function. You can take the derivative of the sigmoid function by multiplying sigmoid(x) and 1 - sigmoid(x). This derivative formula is very handy because you can use the sigmoid result that has already been computed to compute the derivative of it. You then take this partial derivative and continue going backward.

Now you’ll take the derivative of layer_1 with respect to the bias. There it is—you finally got to it! The bias variable is an independent variable, so the result after applying the power rule is 1.

In [34]:
def dsigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))

In [35]:
derror_dprediction = 2 * (prediction - target)

In [37]:
layer_1 = np.dot(input_vector, weights_1) + bias

In [38]:
dprediction_dlayer1 = dsigmoid(layer_1)

In [39]:
dlayer1_dbias = 1

In [40]:
derror_dbias = (derror_dprediction * dprediction_dlayer1 * dlayer1_dbias)

Follow the same process to update the weights

In [41]:
dlayer1_dweights = input_vector

In [42]:
derror_dweights = (derror_dprediction * dprediction_dlayer1 * dlayer1_dweights)

**Reference:** https://realpython.com/python-ai-neural-network/#wrapping-the-inputs-of-the-neural-network-with-numpy