# Neural Nets from Scratch

This notebook contains material on how to implement feed forward neural nets from scratch, following the book named as such: https://nnfs.io/


In [31]:
import nnfs
from nnfs.datasets import spiral_data
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
nnfs.init()
sns.set_theme()

# Layers

When data forward-propogates through the network, it passes through successive `layers`.


## Dense layers

In  **`Dense layers`**, each neuron in a layer receives the **weighted sum** of the `inputs` and `weights` from the previous layer.

- Each data point has a weight assigned to it. So: `len(input) = len(weights)`

For each layer, the **input size must match the output size** from the previous layer

In [20]:
class DenseLayer:
    
    def __init__(self, n_inputs, n_neurons):
        
        # initialize random weights. 
        # The shape is (input size, # of neurons) instead of (# of neurons, input size) to avoid having to transpose later 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) # 0.01 scales the weights down for faster training
        
        # initialize biases as zero vectors
        self.biases = np.zeros((1, n_neurons))
         
    def forward(self, inputs):
        # weighted sum
        
        self.inputs = inputs
        
        self.output = np.dot(inputs, self.weights) + self.biases
        
    def backward(self, dvalues):
        # gradients on parameters
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        
        # gradient on values
        self.dinputs = np.dot(dvalues, self.weights.T)

# Activation functions

`Activation functions` are used to apply transformations (mostly non-linear) to our input data.

Most activations are `non-linear` because we want to solve `non-linear` problems.

## Sigmoid

`Sigmoid` is known as the **squashification** function because it reduces all values into a range of [0,1]

Today, `sigmoid` is not used much inside of hidden layers, but sometimes can be found in the output layer for classification tasks.

<font size="5"> $y = \frac{1}{1+ e^{-x}}$ </font>


## ReLU

`ReLU` is super simple: it takes the **max** between 0 and the input data.

`ReLU` is the **most widely used activation** in hidden layers, mainly because of its speed and efficiency

$$ y =   \left\{
\begin{array}{ll}
      x & x>0 \\
      0 & x \le 0\\
\end{array} 
\right.  $$


## Softmax

`Softmax`, like sigmoid, squishes values into [0,1]. However, it outputs a `probability distribution`, meaning all the output values sum to 1

The output values can also be thought of as **`confidence scores`**:

- `Higher probability` = higher confidence the model has 
- `Lower probability` = lower confidence the model has 

<font size="5">$s_{i,j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^L e^{z_{i,j}}}$</font>


In [22]:
class ReLU:
    """Some notes about ReLU:
    1. ReLU is not normalized, meaning values can range from [0,infinity]
    2. ReLU outputs are completely independent of each other (exclusive)
    3. Because of the two reasons above, ReLU cannot be used in the final layer for predicting probabilities (classification)
    4. np.maximum takes the element wise max between two arrays
    """
    
    def forward(self, inputs):
        
        self.inputs = inputs
        
        # any negative values are turned into 0
        self.output = np.maximum(0, inputs)
        
        
        
    def backward(self, dvalues):
        
        self.dinputs = dvalues.copy()
        self.dinputs[self.inputs <= 0] = 0
       
class Softmax:
    """Some notes about softmax:
    1. Softmax returns a probability distribution (all the floats add up to 1)
    2. Each probability score also represents a confidence score (i.e., [.45, .55] means the model has low confidence)
    3. Softmax is almost exclusively used in the output layer
    """
    
    def forward(self, inputs):
        
        """More notes:
        1. axis=1 specifies that we should only operate across rows, not columns
        
        2. keepdims=True makes it so the output array has the same dimensions as the input
        
        3. we subtract the largest of the inputs to prevent "dead neurons" and exploding values.
            - Dead neurons = when neurons start always outputting a specific value and thus have a zero gradient
            - exploding values = when values start getting exponentially large
            
        4. performing this subtraction scales the values to a range [-1,0]
        
        """
        
        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # normalize them for each sample
        probs = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probs
        
    def backward(self, dvalues):
        self.dinputs = np.empty_like(dvalues)
        
        for index, (single_output, single_dvalues) in enumerate(zip(self.output, dvalues)):
            
            single_output = single_output.reshape(-1,1)
            
            # calculate jacobian matrix of the output 
            jacobian_matrix = np.diagflat(single_output) - np.dot(single_output, single_output.T)
            
            # calculate sample-wise gradient
            self.dinputs[index] = np.dot(jacobian_matrix, single_dvalues)
        
       

# Loss functions

`Loss functions` (also knows as **cost functions**) are the mechanism that tells our ML algorithms how wrong the predictions were.

Loss functions are different depending on whether you're doing **regression** or **classification**

Because neural networks output `confidence levels`, simply taking the **argmax** of an output vector will not suffice.


## Categorical Cross Entropy Loss

`Categorical Cross Entropy` (or **CCE**) is used to compare a `target probability distribution` ($y$) and some `predicted probability distribution` ($\hat{y}$).

Often used as loss when **softmax** is in the output layer.

<font size="5"> $L_i = -\sum_j y_{i,j} log(\hat{y}_{i,j})$</font>

where:
- $L_i$ = sample loss value
- $i$ = $i$-th sample
- $j$ = label/output index
- $y$ = target values
- $\hat{y}$ = predicted values


The equation above, however, can be simplified to:

<font size="5"> $L_i = -log(\hat{y}_{i,k})$</font>

where:
- $k$ = index of the "true" probability

We can simplify to this equation because the targets (in this case) are `sparse vectors` or `sparse matrices` (one-hot).

This means our targets will only ever be 1 or 0, allowing us to make simplifications.

In [10]:
class Loss:
    
    def calculate(self, output, y):
        
        # calculate the sample losses
        sample_losses = self.forward(output,y)
        
        # calculate the mean loss
        data_loss = np.mean(sample_losses)
        
        return data_loss
    
class CategoricalCrossEntropy(Loss):
    
    def forward(self, y_pred, y_true):
        
        # number of samples in a batch
        samples = len(y_pred)
        
        # np.clip will turn every value under the minimum into a_min, and every value over the maximum to a_max
        
        # clip data to prevent division by zero
        # clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, a_min=1e-7, a_max=1-1e-7)
        
        
        #! probabilities for target values:
        
        # only if categorical labels (1D one-hot vectors)
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples), y_true]
        
        # only for one-hot matrices
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
            
            
        # losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods
    
    def backward(self, dvalues, y_true):
        
        samples = len(dvalues)
        labels = len(dvalues[0]) # number of labels in every sample.
        
        # if labels are sparse, make them one-hot
        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]
            
        # calculate gradient
        self.dinputs = -y_true / dvalues
        
        # normalize gradient
        self.dinputs = self.dinputs / samples
        
class Softmax_with_CCE_loss():
    """This is softmax combined with CCE unser one constructor. This enables faster backward passes"""
    
    def __init__(self):
        self.softmax = Softmax()
        self.cce = CategoricalCrossEntropy()
        
    def forward(self, inputs, y_true):
        
        self.softmax.forward(inputs)
        
        self.output = self.softmax.output
        
        return self.cce.calculate(self.output, y_true)
    
    def backward(self, dvalues, y_true):
        
        samples = len(dvalues)
        
        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)
            
        self.dinputs = dvalues.copy()
        
        self.dinputs[range(samples), y_true] -= 1
        
        self.dinputs = self.dinputs / samples
        

# Optimization

After **backpropagation** calculates the gradients of all the functions in the network, the next step is for the `optimizer` to adjust the parameters (weights and biases) of the network.

The most commomnly used, albeit possibly outdated optimizer is called **`Stochastic Gradient Descent`** or SGD. 

Most optimizers used today are just **variants of SGD**

## Stochastic Gradient Descent

`SGD` is an optimizer that processes either a single or multiple samples at once.

In [28]:
class SGD:
    
    def __init__(self, lr = 1.0):
        self.lr = lr
        
    def update_params(self, layer):
        """Multiplies the negated lr with the gradients stored in the layers and adds the result to the layer's params
        
           We negate the lr because we want to go in the opposite direction of the gradient"""
        layer.weights += -self.lr * layer.dweights
        layer.biases += -self.lr * layer.dbiases
        
        

# Code execution
The following cells are for executing and testing the neural network code

-----

The cell below combines what we've done so far:

- Gather our data
- Instantiate two hidden layers
- Assign `ReLU` to **layer 1** and `softmax` to **layer 2**
- Feed our data through each layer
- Feed the softmax output into the CategoricalCrossEntropy loss function

In [33]:
X,y = spiral_data(samples=100, classes=3)

dense1 = DenseLayer(2,64)
relu = ReLU()

dense2 = DenseLayer(64,3)
softmax = Softmax()

loss_activation = Softmax_with_CCE_loss()

optimizer = SGD()



for epoch in range(10001):

    # layer 1 forward pass
    dense1.forward(X)
    relu.forward(dense1.output)

    # layer 2 forward pass
    dense2.forward(relu.output)

    # forward pass through second layer activation function
    # and through the loss function
    loss = loss_activation.forward(dense2.output, y)

    # calculate accuracy from output of softmax and targets
    predictions = np.argmax(loss_activation.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
        
    accuracy = np.mean(predictions == y)

    if not epoch % 100: # prints every 100 epochs (everytime epoch % 100 equals 0)
        print(f"Epoch: {epoch}\nAccuracy: {accuracy}\nLoss: {loss}")


    # backward pass (a.k.a backpropagation)
    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    relu.backward(dense2.dinputs)
    dense1.backward(relu.dinputs)

    optimizer.update_params(dense1)
    optimizer.update_params(dense2)


Epoch: 0
Accuracy: 0.3433333333333333
Loss: 1.0986183881759644
Epoch: 100
Accuracy: 0.4066666666666667
Loss: 1.083395004272461
Epoch: 200
Accuracy: 0.39666666666666667
Loss: 1.0709291696548462
Epoch: 300
Accuracy: 0.41
Loss: 1.069575548171997
Epoch: 400
Accuracy: 0.41333333333333333
Loss: 1.0685349702835083
Epoch: 500
Accuracy: 0.41333333333333333
Loss: 1.0667901039123535
Epoch: 600
Accuracy: 0.41
Loss: 1.0637015104293823
Epoch: 700
Accuracy: 0.4266666666666667
Loss: 1.0575783252716064
Epoch: 800
Accuracy: 0.44666666666666666
Loss: 1.04678213596344
Epoch: 900
Accuracy: 0.4033333333333333
Loss: 1.0477620363235474
Epoch: 1000
Accuracy: 0.39666666666666667
Loss: 1.0402601957321167
Epoch: 1100
Accuracy: 0.42
Loss: 1.032832384109497
Epoch: 1200
Accuracy: 0.44666666666666666
Loss: 1.0229641199111938
Epoch: 1300
Accuracy: 0.47333333333333333
Loss: 1.0182338953018188
Epoch: 1400
Accuracy: 0.44
Loss: 1.016717553138733
Epoch: 1500
Accuracy: 0.4633333333333333
Loss: 1.0010089874267578
Epoch: 1600