# Neural Nets from Scratch

This notebook contains material on how to implement feed forward neural nets from scratch, following the book named as such: https://nnfs.io/


In [4]:
import nnfs
from nnfs.datasets import spiral_data
import numpy as np
import matplotlib.pyplot as plt
nnfs.init()

# Layers

When data forward-propogates through the network, it passes through successive `layers`.


## Dense layers

In  **`Dense layers`**, each neuron in a layer receives the **weighted sum** of the `inputs` and `weights` from the previous layer.

- Each data point has a weight assigned to it. So: `len(input) = len(weights)`

For each layer, the **input size must match the output size** from the previous layer

In [5]:
class DenseLayer:
    
    def __init__(self, n_inputs, n_neurons):
        
        # initialize random weights. 
        # The shape is (input size, # of neurons) instead of (# of neurons, input size) to avoid having to transpose later 
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons) # 0.01 scales the weights down for faster training
        
        # initialize biases as zero vectors
        self.biases = np.zeros((1, n_neurons))
         
    def forward(self, inputs):
        # weighted sum
        self.output = np.dot(inputs, self.weights) + self.biases

# Activation functions

`Activation functions` are used to apply transformations (mostly non-linear) to our input data.

Most activations are `non-linear` because we want to solve `non-linear` problems.

## Sigmoid

`Sigmoid` is known as the **squashification** function because it reduces all values into a range of [0,1]

Today, `sigmoid` is not used much inside of hidden layers, but sometimes can be found in the output layer for classification tasks.

<font size="5"> $y = \frac{1}{1+ e^{-x}}$ </font>


## ReLU

`ReLU` is super simple: it takes the **max** between 0 and the input data.

`ReLU` is the **most widely used activation** in hidden layers, mainly because of its speed and efficiency

$$ y =   \left\{
\begin{array}{ll}
      x & x>0 \\
      0 & x \le 0\\
\end{array} 
\right.  $$


## Softmax

`Softmax`, like sigmoid, squishes values into [0,1]. However, it outputs a `probability distribution`, meaning all the output values sum to 1

The output values can also be thought of as **`confidence scores`**:

- `Higher probability` = higher confidence the model has 
- `Lower probability` = lower confidence the model has 

<font size="5">$s_{i,j} = \frac{e^{z_{i,j}}}{\sum_{l=1}^L e^{z_{i,j}}}$</font>


In [6]:
class ReLU:
    """Some notes about ReLU:
    1. ReLU is not normalized, meaning values can range from [0,infinity]
    2. ReLU outputs are completely independent of each other (exclusive)
    3. Because of the two reasons above, ReLU cannot be used in the final layer for predicting probabilities (classification)
    4. np.maximum takes the element wise max between two arrays
    """
    
    def forward(self, inputs):
        
        # any negative values are turned into 0
        self.output = np.maximum(0, inputs)
       
class Softmax:
    """Some notes about softmax:
    1. Softmax returns a probability distribution (all the floats add up to 1)
    2. Each probability score also represents a confidence score (i.e., [.45, .55] means the model has low confidence)
    3. Softmax is almost exclusively used in the output layer
    """
    
    def forward(self, inputs):
        
        """More notes:
        1. axis=1 specifies that we should only operate across rows, not columns
        
        2. keepdims=True makes it so the output array has the same dimensions as the input
        
        3. we subtract the largest of the inputs to prevent "dead neurons" and exploding values.
            - Dead neurons = when neurons start always outputting a specific value and thus have a zero gradient
            - exploding values = when values start getting exponentially large
            
        4. performing this subtraction scales the values to a range [-1,0]
        
        """
        
        # get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        
        # normalize them for each sample
        probs = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        
        self.output = probs

# Loss functions

`Loss functions` (also knows as **cost functions**) are the mechanism that tells our ML algorithms how wrong the predictions were.

Loss functions are different depending on whether you're doing **regression** or **classification**

Because neural networks output `confidence levels`, simply taking the **argmax** of an output vector will not suffice.


## Categorical Cross Entropy Loss

`Categorical Cross Entropy` (or **CCE**) is used to compare a `target probability distribution` ($y$) and some `predicted probability distribution` ($\hat{y}$).

Often used as loss when **softmax** is in the output layer.

<font size="5"> $L_i = -\sum_j y_{i,j} log(\hat{y}_{i,j})$</font>

where:
- $L_i$ = sample loss value
- $i$ = $i$-th sample
- $j$ = label/output index
- $y$ = target values
- $\hat{y}$ = predicted values


The equation above, however, can be simplified to:

<font size="5"> $L_i = -log(\hat{y}_{i,k})$</font>

where:
- $k$ = index of the "true" probability

We can simplify to this equation because the targets (in this case) are `sparse vectors` or `sparse matrices` (one-hot).

This means our targets will only ever be 1 or 0, allowing us to make simplifications.

In [3]:
class Loss:
    
    def calculate(self, output, y):
        
        # calculate the sample losses
        sample_losses = self.forward(output,y)
        
        # calculate the mean loss
        data_loss = np.mean(sample_losses)
        
        return data_loss
    
class CategoricalCrossEntropy(Loss):
    
    def forward(self, y_pred, y_true):
        
        # number of samples in a batch
        samples = len(y_pred)
        
        # np.clip will turn every value under the minimum into a_min, and every value over the maximum to a_max
        
        # clip data to prevent division by zero
        # clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, a_min=1e-7, a_max=1-1e-7)
        
        
        #! probabilities for target values:
        
        # only if categorical labels (1D one-hot vectors)
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples), y_true]
        
        # only for one-hot matrices
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
            
            
        # losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods

# Code execution

The following cells are for executing and testing the neural network code

In [17]:
X,y = spiral_data(samples=100, classes=3)

# layer one: two input features, 3 neurons
dense1 = DenseLayer(2,3)
relu = ReLU()

# create second Dense layer as the output layer with 3 input features (output of previous layer) and 3 output values
dense2 = DenseLayer(3,3)
softmax = Softmax()

# instantiate our loss function
cce_loss = CategoricalCrossEntropy()

# training data forward pass through Dense layer and forward pass through ReLU
dense1.forward(X)
relu.forward(dense1.output)

# make a forward pass through the second Dense layer
# takes the output of previous layer as input
dense2.forward(relu.output)
softmax.forward(dense2.output)

print("Softmax output:\n",softmax.output[:5])

loss = cce_loss.calculate(softmax.output, y)

print(f"Loss: {loss}")

Softmax output:
 [[0.33333334 0.33333334 0.33333334]
 [0.33333308 0.33333337 0.33333355]
 [0.33333322 0.33333355 0.33333322]
 [0.33333248 0.33333322 0.33333427]
 [0.33333248 0.33333355 0.33333403]]
Loss: 1.0986123085021973


In [16]:

# ! Accuracy calculation
# !we can calculate the accuracy of our model from the softmax output and targets

# calculate values along the second axis
predictions = np.argmax(softmax.output, axis=1)

# if targets are one-hot matrices, convert them
if len(y.shape) == 2:
    y = np.argmax(y, axis=1)

accuracy = np.mean(predictions == y)

print(f"Accuracy: {accuracy}")

Accuracy: 0.3433333333333333
