<a href="https://colab.research.google.com/github/aircable/AInotebooks/blob/main/NN_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Simple Neural Network

In this tutorial we will implement a simple neural network from scratch using PyTorch. The idea of the tutorial is to teach you the basics of PyTorch and how it can be used to implement a neural network from scratch. I will go over some of the basic functionalities and concepts available in PyTorch that will allow you to build your own neural networks.


The `torch` module provides all the necessary **tensor** operators you will need to implement your first neural network from scratch in PyTorch. In PyTorch everything is a Tensor, so this is the first thing you will need to get used to. Let's import the libraries we will need for this tutorial.

In [None]:
import torch
import torch.nn as nn

## Data
Let's start by creating some sample data using the `torch.tensor` command. In Numpy, this could be done with `np.array`. Both functions serve the same purpose, but in PyTorch everything is a Tensor as opposed to a vector or matrix. We define types in PyTorch using the `dtype=torch.xxx` command.

In the data below, `X` represents the amount of hours studied and how much time students spent sleeping, whereas `y` represent grades. The variable `xPredicted` is a single input for which we want to predict a grade using the parameters learned by the neural network. Remember, the neural network wants to learn a mapping between `X` and `y`, so it will try to take a guess from what it has learned from the training data.

<img src="input_table.jpg" alt="input table">

In [None]:
X = torch.tensor(([2, 9], [1, 5], [3, 6]), dtype=torch.float) # 3 X 2 tensor
y = torch.tensor(([92], [100], [89]), dtype=torch.float) # 3 X 1 tensor
xPredicted = torch.tensor(([4, 8]), dtype=torch.float) # 1 X 2 tensor

You can check the size of the tensors we have just created with the `size` command. This is equivalent to the `shape` command used in tools such as Numpy and Tensorflow.

In [None]:
print(X.size())
print(y.size())

torch.Size([3, 2])
torch.Size([3, 1])


## Scaling

Below we are performing some scaling on the sample data. Notice that the `max` function returns both a tensor and the corresponding indices. So we use `_` to capture the indices which we won't use here because we are only interested in the max values to conduct the scaling. Perfect! Our data is now in a very nice format our neural network will appreciate later on.

In [None]:
# scale units
X_max, _ = torch.max(X, 0)
xPredicted_max, _ = torch.max(xPredicted, 0)

X = torch.div(X, X_max)
xPredicted = torch.div(xPredicted, xPredicted_max)
y = y / 100  # max test score is 100
print(xPredicted)

tensor([0.5000, 1.0000])


Notice that there are two functions `max` and `div` that I didn't discuss above. They do exactly what they imply: `max` finds the maximum value in a vector... I mean tensor; and `div` is basically a nice little function to divide two tensors.

## Model (Computation Graph)
Once the data has been processed and it is in the proper format, all you need to do now is to define your model. Here is where things begin to change a little as compared to how you would build your neural networks using, say, something like Keras or Tensorflow. However, you will realize quickly as you go along that PyTorch doesn't differ much from other deep learning tools. At the end of the day we are constructing a computation graph, which is used to dictate how data should flow and what type of operations are performed on this information.

For illustration purposes, we are building the following neural network or computation graph:


![alt text](https://drive.google.com/uc?export=view&id=1l-sKpcCJCEUJV1BlAqcVAvLXLpYCInV6)

In [None]:
class Neural_Network(nn.Module):
    def __init__(self, ):
        super(Neural_Network, self).__init__()
        # parameters
        # TODO: parameters can be parameterized instead of declaring them here
        self.inputSize = 2
        self.outputSize = 1
        self.hiddenSize = 3

        # weights
        self.W1 = torch.randn(self.inputSize, self.hiddenSize) # 3 X 2 tensor
        self.W2 = torch.randn(self.hiddenSize, self.outputSize) # 3 X 1 tensor

    def forward(self, X):
        self.z = torch.matmul(X, self.W1) # 3 X 3 ".dot" does not broadcast in PyTorch
        self.z2 = self.sigmoid(self.z) # activation function
        self.z3 = torch.matmul(self.z2, self.W2)
        o = self.sigmoid(self.z3) # final activation function
        return o

    def sigmoid(self, s):
        return 1 / (1 + torch.exp(-s))

    def sigmoidPrime(self, s):
        # derivative of sigmoid
        return s * (1 - s)

    def backward(self, X, y, o):
        self.o_error = y - o # error in output
        self.o_delta = self.o_error * self.sigmoidPrime(o) # derivative of sig to error
        self.z2_error = torch.matmul(self.o_delta, torch.t(self.W2)) # z2 error: how much our hidden layer weights contributed to output error  
        self.z2_delta = self.z2_error * self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error  
        self.W1 += torch.matmul(torch.t(X), self.z2_delta) # adjusting first set (input --> hidden) weights 
        self.W2 += torch.matmul(torch.t(self.z2), self.o_delta) # adjusting second set (hidden --> output) weights
        # Note: in a real-world scenario, you would use an optimizer to update weights

    def train(self, X, y):
        # forward + backward pass for training
        o = self.forward(X)
        self.backward(X, y, o)

    def saveWeights(self, model):
        # we will use the PyTorch internal storage functions
        torch.save(model, "NN")
        # you can reload model with all the weights and so forth with:
        # torch.load("NN")

    def predict(self):
        print ("Predicted data based on trained weights: ")
        print ("Input (scaled): \n" + str(xPredicted))
        print ("Output: \n" + str(self.forward(xPredicted)))


For the purpose of this tutorial, we are not going to be talking math stuff, that's for another day. I just want you to get a gist of what it takes to build a neural network from scratch using PyTorch. Let's break down the model which was declared via the class above.

## Class Header
First, we defined our model via a class because that is the recommended way to build the computation graph. The class header contains the name of the class `Neural Network` and the parameter `nn.Module` which basically indicates that we are defining our own neural network.

```python
class Neural_Network(nn.Module):
```

## Initialization
The next step is to define the initializations ( `def __init__(self,)`) that will be performed upon creating an instance of the customized neural network. You can declare the parameters of your model here, but typically, you would declare the structure of your network in this section -- the size of the hidden layers and so forth. Since we are building the neural network from scratch, we explicitly declared the size of the weights matrices: one that stores the parameters from the input to hidden layer; and one that stores the parameter from the hidden to output layer. Both weight matrices are initialized with values randomly chosen from a normal distribution via `torch.randn(...)`. Note that we are not using bias just to keep things as simple as possible.  

```python
def __init__(self, ):
    super(Neural_Network, self).__init__()
    # parameters
    # TODO: parameters can be parameterized instead of declaring them here
    self.inputSize = 2
    self.outputSize = 1
    self.hiddenSize = 3

    # weights
    self.W1 = torch.randn(self.inputSize, self.hiddenSize) # 3 X 2 tensor
    self.W2 = torch.randn(self.hiddenSize, self.outputSize) # 3 X 1 tensor
```

## The Forward Function
The `forward` function is where all the magic happens (see below). This is where the data enters and is fed into the computation graph (i.e., the neural network structure we have built). Since we are building a simple neural network with one hidden layer, our forward function looks very simple:

```python
def forward(self, X):
    self.z = torch.matmul(X, self.W1)
    self.z2 = self.sigmoid(self.z) # activation function
    self.z3 = torch.matmul(self.z2, self.W2)
    o = self.sigmoid(self.z3) # final activation function
    return o
```

The `forward` function above takes the input `X`and then performs a matrix multiplication (`torch.matmul(...)`) with the first weight matrix `self.W1`. Then the result is applied an activation function, `sigmoid`. The resulting matrix of the activation is then multiplied with the second weight matrix `self.W2`. Then another activation if performed, which renders the output of the neural network or computation graph. The process I described above is simply what's known as a `feedforward pass`. In order for the weights to optimize when training, we need a backpropagation algorithm.

## The Backward Function
The `backward` function contains the backpropagation algorithm, where the goal is to essentially minimize the loss with respect to our weights. In other words, the weights need to be updated in such  a way that the loss decreases while the neural network is training (well, that is what we hope for). All this magic is possible with the gradient descent algorithm which is declared in the `backward` function. Take a minute or two to inspect what is happening in the code below:

```python
def backward(self, X, y, o):
    self.o_error = y - o # error in output
    self.o_delta = self.o_error * self.sigmoidPrime(o)
    self.z2_error = torch.matmul(self.o_delta, torch.t(self.W2))
    self.z2_delta = self.z2_error * self.sigmoidPrime(self.z2)
    self.W1 += torch.matmul(torch.t(X), self.z2_delta)
    self.W2 += torch.matmul(torch.t(self.z2), self.o_delta)
```

Notice that we are performing a lot of matrix multiplications along with the transpose operations via the `torch.matmul(...)` and `torch.t(...)` operations, respectively. The rest is simply gradient descent -- there is nothing to it.

## Training
All that is left now is to train the neural network. First we create an instance of the computation graph we have just built:

```python
NN = Neural_Network()
```

Then we train the model for `1000` rounds. Notice that in PyTorch `NN(X)` automatically calls the `forward` function so there is no need to explicitly call `NN.forward(X)`.

After we have obtained the predicted output for ever round of training, we compute the loss, with the following code:

```python
    torch.mean((y - NN(X))**2).detach().item()
```

The next step is to start the training (foward + backward) via `NN.train(X, y)`. After we have trained the neural network, we can store the model and output the predicted value of the single instance we declared in the beginning, `xPredicted`.  

### Let's train!

Let's add a real time graph for how your neural network learns. It feels like magic sometimes, but at the end of the day, any neural network is simply trying to get loss to 0.

In [1]:
# for plotting we may need matplotlib
!pip install matplotlib



In [None]:
import matplotlib.pyplot as plt
count = [] # list to store iteration count
loss = [] # list to store loss values

NN = Neural_Network()
for i in range(1000):  # trains the NN 1,000 times
    # real time graph how the NN is learning
    print("# " + str(i) + "\n")
    print("Input (scaled): \n" + str(X))
    print("Actual Output: \n" + str(y))
    print("Predicted Output: \n" + str(nn.forward(X)))
    loss = str(np.mean(np.square(y - nn.forward(X))))
    print("Loss: \n" +  loss ) # mean squared error
    print("\n")
    count.append(i)
    loss.append(np.round(float(loss), 6))
    plt.cla()
    plt.title("Loss over Iterations")
    plt.xlabel("Iterations")
    plt.ylabel("Loss")
    plt.plot(count, loss)
    plt.pause(.001)
      
    if (i % 100) == 0:
        print ("#" + str(i) + " Loss: " + str(torch.mean((y - NN(X))**2).detach().item()))  # mean sum squared loss
    NN.train(X, y)
NN.saveWeights(NN)
NN.predict()

print("Finished training!")

#0 Loss: 0.24544493854045868
#100 Loss: 0.0026628002524375916
#200 Loss: 0.0024748605210334063
#300 Loss: 0.002363199135288596
#400 Loss: 0.0022466194350272417
#500 Loss: 0.0021235516760498285
#600 Loss: 0.001996910898014903
#700 Loss: 0.0018705682596191764
#800 Loss: 0.0017485078424215317
#900 Loss: 0.0016340742586180568
Predicted data based on trained weights: 
Input (scaled): 
tensor([0.5000, 1.0000])
Output: 
tensor([0.9529])
Finished training!


  "type " + obj.__name__ + ". It won't be checked "


The loss keeps decreasing, which means that the neural network is learning something. 

## Calculations Behind Our Network Training: Forward Propagation
We start with random numbers in our NN network.

<img src="weights.png"  width="25%" alt="weights">

Our neural network can be represented with matrices. We take the dot product of each row of the first matrix and each column of the second matrix. What's the dot product? The dot product is the sum of the products of the elements. This is done element-wise. So, in order to get the first element in the hidden layer matrix, you multiply 2 by .2 and add that to 9 times .8 to get 7.6.

<img src="matrix_multiplication.png"  width="25%" alt="matrix1">

Once you repeat this process for all the columns in the weights matrix, you get all the hidden layer neuron values which are shown in the hidden layer matrix on the right.

```
(2 * .2) + (9 * .8) = 7.6
(2 * .6) + (9 * .3) = 3.9
(2 * .1) + (9 * .7) = 6.5
```

This is the fundamental concept behind forward propagation. Now, let's put all the other inputs into the inputs matrix.

<img src="matrix_multiplication_2.png"  width="25%" alt="matrix2">

As you might notice, every time we move down a row in the inputs, we move down a row in the result. 

The values we got in our hidden matrix are in the small font on the top left, why? Well, we must apply the **activation function** on each of these values. In an artificial neural network, an activation function of a neuron defines the output of the neuron given a set of inputs. In other words, activation functions give a network a sense of how activated a neuron is by mapping the input set to some value in between a given lower and upper bound. This is inspired by biology as certain Neural_Networkneurons within our brains are either firing or not depending on stimuli. You may think of a neuron firing as represented with a 1 and a neuron not firing as represented with a 0.

There are many activation functions out there. In this case, we’ll stick to one of the more popular ones — the sigmoid function. The sigmoid function maps all input values to some value between a lower limit of 0 and an upper limit of 1. If the input is very negative, the number will be transformed into a number very close to 0. If the input is very positive, the number will be transformed to a number very close to 1. If the input is close to 0, the number will be transformed into some number in between 0 and 1.

<img src="sigmoid.png"  width="25%" alt="weights">

```
S(7.6) = 0.9994997
S(3.9) = 0.9801597
S(6.5) = 0.9984988
```

Now, we need to use matrix multiplication again, with another set of random weights, to calculate our output layer value.

<img src="matrix_multiplication_3.png"  width="25%" alt="matrix2">

```
(.9994 * .4) + (.9802 * .5) + (.9984 * .9) = 1.78842
```

Lastly, to normalize the output, we just apply the activation function again.

```
S(1.78842) = .8567335
```

And, there you go! We just did ***ONE** forward propagation! With those weights, our neural network will calculate .85 as our test score! However, our target was .92. Our result wasn’t poor, it just isn’t the best it can be. We just got a little lucky when we chose the random weights for this example.

## The “Learning” of Our Network: Backpropagation
Since we have a random set of weights, we need to alter them to make our neural network guess the correct test scores. This is done through a method called backpropagation.

Backpropagation works by using a loss function to calculate how far the network was from the target output.

What does backpropagation adjust to make the neural network better at guessing the correct values?

- The weights of the network.

### Calculating Error
One way of representing the loss (cost) function is by using the mean squared error function:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (predictedOut - actualOutput)^2
$$
The mean squared error function is the sum, over all the data points, of the square of the difference between the predicted and actual target variables, divided by the number of data points. 

### Bias in Neural Networks
In a neural network, we would ideally would want to have a bias. A bias allows us to shift the activation function either to the right or left, which means that we would be able to fit the prediction with the input data much better. The initial bias is dependent on your dataset, but this value should be updated along with the weights during backpropagation.

For the sake of simplicity, we assume bias to be 0 in this tutorial.

### Gradient Descent
Now that we have the ****loss (cost) function**, our goal is to get it as close as we can to 0. As we are training our network, all we are doing is minimizing the loss function. In other words, we are optimizing to find the local minimum of our loss function.

If you remember, our loss function is dependent on our output, which is dependent on our weights. We could brute force search all possible combinations of weights to minimize loss. However, this would take a really, really, really long time and just isn't practical. We need an intuitive method to find the local minimum of the loss function.

For this optimization, we will use the method of gradient descent. Let's look at the function f(x), where x is our input weight and f(x) is our loss function. What if we could take the derivative at a given weight? This would allow us to understand which way is downhill and whether to make our x (weight) smaller or larger to decrease our loss. In other words, to figure out which direction to alter our weights, we need to find the rate of change of our loss with respect to our weights (a partial derivative!).

If this partial derivative of our loss with respect to our weights is positive, then the cost function is going uphill. If it is negative, then the cost function is going downhill. Therefore, we're able to save time by making sure that we're optimizing in the right direction.

<img src="gradient_descent.png"  width="25%" alt="weights">

Above is an illustration describing our method of gradient descent. By knowing which way to alter our weights, our outputs can only get more accurate.

Here’s how we will calculate the incremental change to our weights:

- Find the **margin of error** of the output layer (o) by taking the difference of the predicted output and the actual output (y)
- Apply the derivative of our sigmoid activation function to the output layer error. We call this result the **delta output sum**.
- Use the **delta output sum** of the output layer error to figure out how much our z2 (hidden) layer contributed to the output error by performing a dot product with our second weight matrix. We can call this the z2 error.
- Calculate the **delta output sum** for the z2 layer by applying the derivative of our sigmoid **activation function** (just like step 2).
- Adjust the weights for the first layer by performing a **dot product of the input layer** with the **hidden (z2) delta output sum**. For the second layer, perform a dot product of the hidden(z2) layer and the **output (o) delta output sum**.

Calculating the delta output sum and then applying the derivative of the sigmoid function are very important to backpropagation. The derivative of the sigmoid, also known as **sigmoid prime**, will give us the rate of change, or slope, of the activation function at the output sum.

We are adding a sigmoidPrime (derivative of sigmoid) function:

```python
def sigmoidPrime(self, s):  #derivative of sigmoid  
    return s * (1 - s)
```

<img src="sigmoid_derivative.png"  width="25%" alt="weights">

Here is an illustration of what the derivative of the sigmoid function looks like. The value is very small when you get far from 0 and gets larger only close to 0. Basically, when the neural network is very sure about a certain neuron, we do not want to change the value of that neuron and passing the value of the neuron through the sigmoid derivative will help with that. On the other hand, if the neural network is not as sure about the neuron, we want to change it more. 

### Putting it All Together
Let's implement our backward propagation function by using the method of gradient descent we just discovered.

First, we will find the error in our function by taking the difference of our output layer (o) and the actual value (y). Next we need to figure out how much to change the output layer, so we calculate this delta by multiplying the error of the output layer with the derivative of the sigmoid function. Luckily, we've already defined a function for this.

Once we've figured out the delta output sum for o, we go back to the hidden layer, z2, and calculate its error by taking the dot product of our o_delta and the transpose of the weights we used on it, W2. The reason we use the transpose of the second set of weights is so that we can apply the error of the output to each weight. Remember that o and W2 are 3x1 matrices; in order to do multiplication via the dot product W2 is transposed so the resulting matrix for z2_error is 3x3.

Next, we do the same thing as we did with the output error and multiply the error by the derivative of the sigmoid function to figure out the change in z2.

Now, we adjust our weights accordingly. Let's adjust the first set of weights by dotting our input with the change in the hidden layer. We take the transpose again to make the multiplication possible. Then we can combine the result with the first set of weights, element-wise (thanks Numpy!). We do the same thing to the second set of weights except we use the hidden layer and the output layer to do so.

```python
def backward(self, X, y, o):  # backward propagate through the network  
    self.o_error = y - o # error in output  
    self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error  
    self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error  
    self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error  
    self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights  
    self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights
```    

### Defining Loss
Here's a quick one. We've already defined the for loop to run our neural network a thousand times. Fill in the calculation for the loss function below! Take the mean of the square of the difference between the predicted and the actual output.

```python
nn = neural_network()
for i in range(1000): # trains the nn 1,000 times
  print("Input: \n" + str(X))
  print("Actual Output: \n" + str(y))
  print("Predicted Output: \n" + str(nn.forward(X)))
  print("Loss: \n" + str(np.mean(np.square(y - nn.forward(X))))) # mean squared error
  print("\n")
  nn.train(X, y)
```  

Why take the square? Some of the errors will be negative. So, if we averaged the errors without squaring we might get close to 0 when the real loss is much larger. This Loss value is simply a way to quantify how far we are from the 'perfect' neural network.

### Predict Function
Now, let’s create a new function that prints our predicted output for x_predicted. All we have to run is forward(x_predicted) to return an output!

Let's write a predict member function within our class that prints out the input x_predicted matrix and the output matrix after it is passed into the forward() function.

```python
def predict(self):
  print("Predicted data based on trained weights: ")
  print("Input (scaled): \n" + str(x_predicted))
  print("Output: \n" + str(self.forward(x_predicted)))
```  

### Overfitting
Here’s what we got after training the network 150,000 times. 
Keep in mind that doing this is grossly overfitting our data: 
training to essentially memorize the testing set. 
Models like these aren't very useful for making predictions on data that does not originate 
from the dataset.

```
# 150000

Input (scaled):
[[0.4 0.9]
 [0.2 0.5]
 [0.6 0.6]]
Actual Output:
[[0.92]
 [0.86]
 [0.89]]
Predicted Output:
[[0.92]
 [0.86]
 [0.89]]
Loss:
5.904467817735647e-17

Predicted data based on trained weights:
Input (scaled):
[[1. 1.]]
Output:
[[0.93545994]]
```



#### That's it. Congratulations! 

You have just learned how to create and train a neural network from scratch using PyTorch. There are so many things you can do with the shallow network we have just implemented. You can add more hidden layers or try to incorporate the bias terms for practice. 

### This is the plain code with just NUMPY, no PyTorch.

In [None]:
import numpy as np

# X = (hours studying, hours sleeping), y = score on test
xAll = np.array(([2, 9], [1, 5], [3, 6], [5, 10]), dtype=float) # input data
y = np.array(([92], [86], [89]), dtype=float) # output

# scale units
xAll = xAll/np.amax(xAll, axis=0) # scaling input data
y = y/100 # scaling output data (max test score is 100)

# split data
X = np.split(xAll, [3])[0] # training data
xPredicted = np.split(xAll, [3])[1] # testing data

y = np.array(([92], [86], [89]), dtype=float)
y = y/100 # max test score is 100

class Neural_Network(object):
  def __init__(self):
  #parameters
    self.inputSize = 2
    self.outputSize = 1
    self.hiddenSize = 3

  #weights
    self.W1 = np.random.randn(self.inputSize, self.hiddenSize) # (3x2) weight matrix from input to hidden layer
    self.W2 = np.random.randn(self.hiddenSize, self.outputSize) # (3x1) weight matrix from hidden to output layer

  def forward(self, X):
    #forward propagation through our network
    self.z = np.dot(X, self.W1) # dot product of X (input) and first set of 3x2 weights
    self.z2 = self.sigmoid(self.z) # activation function
    self.z3 = np.dot(self.z2, self.W2) # dot product of hidden layer (z2) and second set of 3x1 weights
    o = self.sigmoid(self.z3) # final activation function
    return o

  def sigmoid(self, s):
    # activation function
    return 1/(1+np.exp(-s))

  def sigmoidPrime(self, s):
    #derivative of sigmoid
    return s * (1 - s)

  def backward(self, X, y, o):
    # backward propagate through the network
    self.o_error = y - o # error in output
    self.o_delta = self.o_error*self.sigmoidPrime(o) # applying derivative of sigmoid to error

    self.z2_error = self.o_delta.dot(self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
    self.z2_delta = self.z2_error*self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error

    self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
    self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights

  def train(self, X, y):
    o = self.forward(X)
    self.backward(X, y, o)

  def saveWeights(self):
    np.savetxt("w1.txt", self.W1, fmt="%s")
    np.savetxt("w2.txt", self.W2, fmt="%s")

  def predict(self):
    print ("Predicted data based on trained weights: ")
    print ("Input (scaled): \n" + str(xPredicted))
    print ("Output: \n" + str(self.forward(xPredicted)))

NN = Neural_Network()
for i in range(1000): # trains the NN 1,000 times
  print ("# " + str(i) + "\n")
  print ("Input (scaled): \n" + str(X))
  print ("Actual Output: \n" + str(y))
  print ("Predicted Output: \n" + str(NN.forward(X)))
  print ("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
  print ("\n")
  NN.train(X, y)

NN.saveWeights()
NN.predict()


## References:
- [PyTorch nn. Modules](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-custom-nn-modules)
- [Build a Neural Network with Numpy](https://enlight.nyc/neural-network)
