# Fully Connected Neural Network(FCNN) trained on CIFAR 10

## CIFAR-10 Dataset Overview:

- Number of Classes: 10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)
- Number of Training Samples: 50,000
- Number of Test Samples: 10,000
- Image Size: 32x32 pixels (3 color channels)

## Step 1: Load CIFAR-10 Dataset

To download and load the CIFAR-10 dataset, we'll use the `urllib` module and extract the data from the downloaded file.

In [1]:
import numpy as np
import pickle
import urllib.request
import os
import tarfile
import os  # Import os module here
import shutil  # Import shutil module here

# Download and extract the CIFAR-10 dataset
def download_and_extract_cifar10():
    url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
    file_path = "cifar-10-python.tar.gz"
    
    if not os.path.exists(file_path):
        print("Downloading CIFAR-10 dataset...")
        urllib.request.urlretrieve(url, file_path)
    
    if not os.path.exists('cifar-10-batches-py'):
        print("Extracting CIFAR-10 dataset...")
        with tarfile.open(file_path) as tar:
            tar.extractall()

download_and_extract_cifar10()

# Function to load CIFAR-10 data
def load_cifar10_batch(batch_id):
    with open(f'cifar-10-batches-py/data_batch_{batch_id}', mode='rb') as file:
        batch = pickle.load(file, encoding='latin1')
    features = batch['data'].reshape((len(batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    labels = np.array(batch['labels'])
    return features, labels

# Load all the training batches
X_train = []
y_train = []

for i in range(1, 6):
    features, labels = load_cifar10_batch(i)
    X_train.append(features)
    y_train.append(labels)

X_train = np.concatenate(X_train)
y_train = np.concatenate(y_train)

# Load the test batch
def load_cifar10_test():
    with open('cifar-10-batches-py/test_batch', mode='rb') as file:
        batch = pickle.load(file, encoding='latin1')
    features = batch['data'].reshape((len(batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    labels = np.array(batch['labels'])
    return features, labels

X_test, y_test = load_cifar10_test()

# Check the shapes
print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

Training data shape: (50000, 32, 32, 3)
Test data shape: (10000, 32, 32, 3)


## Step 2: Preprocess the Data

Before feeding the data into the neural network, we'll normalize the pixel values (from 0-255 to 0-1) and flatten the image data.

In [2]:
# Normalize the images
X_train = X_train / 255.0
X_test = X_test / 255.0

# Flatten the images into vectors of 3072 elements (32*32*3)
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)

# One-hot encode the labels
def one_hot_encode(y, num_classes):
    one_hot = np.zeros((y.size, num_classes))
    one_hot[np.arange(y.size), y] = 1
    return one_hot

y_train = one_hot_encode(y_train, 10)
y_test = one_hot_encode(y_test, 10)

## Simple Fully Connected Neural Network for CIFAR-10

### 1. Network Architecture
- **Input Layer**: Each CIFAR-10 image is 32x32 pixels with 3 color channels (RGB), so the input has 3072 features (32 × 32 × 3).
- **Hidden Layer**: There is a single hidden layer with 64 neurons. The activation function used is **sigmoid**, and **reLU** on the second network.
- **Output Layer**: The output layer has 10 neurons, one for each class in CIFAR-10. We use the **softmax** activation function to convert the output into class probabilities.

### 2. Forward Pass
- The input data (image pixels) is passed through the network:
  - **Input to Hidden Layer**: Each neuron in the hidden layer computes a weighted sum of its inputs, adds a bias, and applies the **sigmoid activation function**:
    $$
    Z_1 = W_1 \cdot X + b_1
    $$
    $$
    A_1 = \sigma(Z_1) = \frac{1}{1 + e^{-Z_1}}
    $$
  - **Hidden to Output Layer**: The output layer computes a weighted sum of the hidden layer outputs, adds biases, and applies the **softmax function**:
    $$
    Z_2 = W_2 \cdot A_1 + b_2
    $$
    $$
    A_2 = \text{softmax}(Z_2) = \frac{e^{Z_2}}{\sum_{j} e^{Z_{2,j}}}
    $$

### 3. Prediction
- The output is a probability distribution over the 10 classes. The class with the highest probability is the predicted class.

### 4. Loss Function
- We use **cross-entropy loss** to measure the difference between predicted probabilities and actual class labels:
  $$
  \text{Loss} = - \frac{1}{m} \sum_{i=1}^{m} \sum_{c=1}^{C} y_{i,c} \cdot \log(\hat{y}_{i,c})
  $$
  where:
  - $( m $) is the number of training samples,
  - $( C $) is the number of classes (10 for CIFAR-10),
  - $( y_{i,c} $) is the true label for sample $( i $) and class $( c $),
  - $( \hat{y}_{i,c} $) is the predicted probability for sample $( i $) and class $( c $).

### 5. Backpropagation (Gradient Descent)
- The model computes gradients (errors) for both the hidden layer and output layer, and updates the weights using **gradient descent**:
  - For the weights and biases between the hidden and output layers:
    $$
    W_2 = W_2 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial W_2}
    $$
    $$
    b_2 = b_2 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial b_2}
    $$
  - For the weights and biases between the input and hidden layers:
    $$
    W_1 = W_1 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial W_1}
    $$
    $$
    b_1 = b_1 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial b_1}
    $$

### 6. Training
- The network is trained over multiple **epochs**. In each epoch:
  1. Perform a **forward pass** to compute the output.
  2. Compute the **loss** using cross-entropy.
  3. Perform **backpropagation** to adjust the weights and biases to minimize the loss.

### 7. Evaluation (Accuracy)
- After training, the model is tested on the test set. **Accuracy** is calculated as the percentage of correctly classified images:
  $$
  \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Test Images}} \times 100
  $$

# Simple Neural Network with Sigmoid

$$
\begin{array}{cccc}
\text{Input Layer} & \longrightarrow & \text{Hidden Layer} & \longrightarrow & \text{Output Layer} \\
\left(3072\right)  &                  & \left(64\right)     &                  & \left(10\right) \\
\end{array}
$$


---

# Step 1: Initialization (`__init__`)

In this step, we need to initialize:

Weights: These connect the neurons between the input layer, hidden layer, and output layer. You’ll need two weight matrices:
- `W1`: connects the input layer to the hidden layer (dimensions: input_size x hidden_size).
- `W2`: connects the hidden layer to the output layer (dimensions: hidden_size x output_size).
Biases: Bias vectors for each layer:
- `b1`: for the hidden layer (dimensions: `1 x hidden_size`).
- `b2`: for the output layer (dimensions: `1 x output_size`).

The weights are initialized with small random values, typically scaled by a small factor (e.g., `*0.01`), so that the network can learn meaningful patterns. Initializing them too large can cause issues with learning, and setting them all to zero would prevent the network from learning properly (since all gradients would be identical).

The biases are typically initialized to zeros.

## Why Initialize Like This?

In backpropagation, if weights are too large, they can cause the gradients to become extremely large or extremely small during updates, leading to slow or poor learning (this is the problem of vanishing or exploding gradients). By scaling the random initialization, we reduce the likelihood of this happening, ensuring that gradients stay at reasonable values in early training.

## Why Initialize Bias to Zero?

Role of Bias: The bias term allows the activation of neurons to shift. It acts as an additional parameter that enables the model to fit the data more flexibly. Mathematically, a neuron computes:

$ z=W⋅X+bz=W⋅X+b$  

Here, `b` is the bias. Without the bias, the neuron would always pass through the origin (0,0), limiting its ability to fit data effectively. Bias helps the neuron to make decisions independent of the input by introducing a constant shift.

### Why Zero Initialization for Bias:

Bias does not affect symmetry breaking: Unlike weights, biases don’t impact the symmetry breaking during learning. Symmetry breaking refers to how each neuron needs to learn a unique function, which requires weights to be initialized randomly so that different neurons start off differently. Since biases don't control interactions between neurons but rather shift the activations, initializing them to zero doesn’t affect this.

Gradients for Biases: During backpropagation, the gradient for the bias is simply the sum of the gradients from the next layer, so initializing it to zero doesn’t cause issues like symmetry lock (which can happen with zero-initialized weights).

Alternatives to Zero Initialization: While initializing biases to zero is a common practice and works well in most cases, sometimes small random values are used, especially for deep networks. The idea is that starting with a slight bias may speed up convergence for some networks, but zero is the default for simplicity and effectiveness in most networks.

In summary, biases are initialized to zero because:

- They don’t break symmetry, so initializing them to zero is safe.
- They are updated via gradient descent like weights, and zero initialization doesn’t cause any learning issues.

## Weights as Matrices

In a fully connected neural network, each neuron in one layer is connected to every neuron in the next layer. The weights determine the strength of these connections.

For this network, there are:

## Input Layer to Hidden Layer Weights (`W1`):
- There are `input_size` neurons in the input layer (in this case, 3072 neurons for the 32x32 RGB images).
- There are `hidden_size` neurons in the hidden layer (in this case, 64 neurons).

    Therefore, the weight matrix W1 connects these two layers, and its dimensions will be:
    $ W1∈R_{input\_size×hidden\_size}$

    In this case: $W1∈R_{3072×64}$

    Each element of this matrix corresponds to the weight of the connection between one input neuron and one hidden layer neuron.

    ## Hidden Layer to Output Layer Weights (`W2`):
- `hidden_size` neurons in the hidden layer (64 neurons).
- `output_size` neurons in the output layer (10 neurons, since you’re doing classification into 10 categories).

    The weight matrix `W2` will connect these two layers, and its dimensions will be:
    $W2∈R_{hidden\_size×output\_size}$

    In this case: $W2∈R_{64×10}$

Each row of the weight matrix corresponds to a different neuron in the previous layer, and each column corresponds to a different neuron in the next layer.
Why Matrices?

Matrix multiplication allows us to compute the output of all neurons in the next layer simultaneously. When we multiply the input vector by the weight matrix, we effectively compute the weighted sum of all inputs for every neuron in the next layer at once.

So in summary, the weights are matrices where:

- W1 connects the input layer to the hidden layer (3072 x 64).
- W2 connects the hidden layer to the output layer (64 x 10).
## Why 3072 Neurons in First Layer
Step-by-Step Breakdown:

### Image Size (32x32):
- The image size is 32×32 pixels.
- For a grayscale image, each pixel would be represented by a single intensity value (one channel).

###  RGB Channels:
- For an RGB image, each pixel is represented by three color channels (Red, Green, Blue), which means each pixel has three values.
- So, for each pixel, there are three values: one for red, one for green, and one for blue.

### Total Number of Inputs:
- The total number of pixels in a 32x32 image is:
- 32×32=1024 pixels
- Since each pixel has 3 color channels (RGB), the total number of input values (features) is:
- 1024×3=3072
This means the input layer of the neural network has 3072 neurons, one for each input value (R, G, and B for each pixel).

### Why Flatten?

Most basic neural networks (before convolutional layers) take input as a 1D vector, so the 32x32x3 image is flattened into a 1D array with 3072 elements. Each element in this array corresponds to the intensity of one color channel in one pixel.

# Sigmoid Activation Function

The sigmoid function is a smooth, differentiable function that outputs values between 0 and 1. It's commonly used in the output layer for binary classification, but it can also be used in hidden layers for learning non-linear patterns.

## Mathematical Definition:
The sigmoid function is defined as:
$$
\text{Sigmoid}(z) = \frac{1}{1 + e^{-z}}
$$
Where:
- $ z $ is the input to the neuron (i.e., the weighted sum of inputs plus bias).
- $ e $ is the base of the natural logarithm.

The sigmoid function "squashes" the input to a range between 0 and 1, making it ideal for probability-based outputs (for instance, in binary classification problems).

## Derivative of the Sigmoid Function:
The derivative of the sigmoid function (used in backpropagation) is:
$$
\text{Sigmoid}'(z) = \sigma(z) \cdot (1 - \sigma(z))
$$
Where $ \sigma(z) $ is the sigmoid of $ z $. This derivative will be useful during the backward pass to update the weights.

## When to Use Sigmoid:
- **Hidden Layers**: Sigmoid can be used in hidden layers, but ReLU has largely replaced it due to the **vanishing gradient problem**.
- **Output Layer**: Sigmoid is commonly used in the output layer when performing **binary classification** (i.e., when predicting probabilities of two classes).


# Softmax Function

The softmax function is used in the output layer of a neural network for **multi-class classification**. It converts raw scores (logits) into probabilities, ensuring that the sum of the output probabilities equals 1. This is useful when predicting which class an input belongs to.

## Mathematical Definition:
The softmax function for a given input $ z_i $ is defined as:

$$
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}
$$

Where:
- $ z_i $ is the input to the softmax function (logit) for class $ i $.
- $ e $ is the base of the natural logarithm.
- The denominator $ \sum_{j} e^{z_j} $ is the sum of the exponentials of all the logits, ensuring the outputs sum to 1.

## Explanation:
- The softmax function transforms the logits into a probability distribution, where each element represents the probability of the input belonging to a particular class.
- The exponentiation $ e^{z_i} $ ensures that all values are positive.
- Dividing by the sum of all exponentials normalizes the values, so that they sum to 1.

## Example:
If there are three classes and the logits are $ z = [1.0, 2.0, 0.5] $, applying the softmax function will output probabilities such as $ [0.211, 0.576, 0.213] $, which sum to 1.

## Adjustment for Numerical Stability:

When working with the softmax function, exponentiating large values in `z` can lead to overflow, causing numerical issues. To avoid this, before calculating the exponentials, subtract the maximum value in `z` from every element in `z`. This won’t affect the output of the softmax function but will prevent potential overflow.


# Forward Pass 

The forward pass involves calculating the activations for each layer of the neural network. Here's the step-by-step breakdown:

1. **Weighted Input to the Hidden Layer**:
   $$
   Z_1 = X \cdot W_1 + b_1
   $$
   Where:
   - $ X $ is the input to the network.
   - $ W_1 $ is the weight matrix connecting the input layer to the hidden layer.
   - $ b_1 $ is the bias vector for the hidden layer.
   - $ Z_1 $ is the linear combination of inputs, weights, and biases at the hidden layer.

2. **Activation of the Hidden Layer** (using the sigmoid function):
   $$
   A_1 = \text{Sigmoid}(Z_1) = \frac{1}{1 + e^{-Z_1}}
   $$
   Where $ A_1 $ is the activation of the hidden layer after applying the sigmoid activation function.

3. **Weighted Input to the Output Layer**:
   $$
   Z_2 = A_1 \cdot W_2 + b_2
   $$
   Where:
   - $ A_1 $ is the activation from the hidden layer.
   - $ W_2 $ is the weight matrix connecting the hidden layer to the output layer.
   - $ b_2 $ is the bias vector for the output layer.
   - $ Z_2 $ is the linear combination of inputs, weights, and biases at the output layer.

4. **Activation of the Output Layer** (using the softmax function):
   $$
   A_2 = \text{Softmax}(Z_2) 
   $$
   Where $ A_2 $ is the final output of the network, after applying the softmax activation function. This gives the final output probabilities, which will sum to 1.

---

This forward pass computes the activations of each layer and ultimately the output of the neural network. The sigmoid function is used to introduce non-linearity at each layer.


# Backward Pass Implementation

The backward pass is responsible for calculating the gradients of the loss with respect to the network's weights and biases and updating them to minimize the loss using backpropagation. Here's the step-by-step breakdown:

1. **Output Layer Error**:
   - Compute the error at the output layer, which is the difference between the predicted output (`A2`, stored as `output`) and the true labels (`y`):
   $$
   \text{error\_output} = A2 - y
   $$
   This gives the error for each class in the final output.

2. **Gradient for `W2` and `b2` (Output Layer Weights and Biases)**:
   - The gradient of the loss with respect to the weights connecting the hidden layer to the output layer (`W2`) is computed as:
   $$
   dW2 = A1^T \cdot \text{error\_output}
   $$
   - The gradient of the loss with respect to the bias for the output layer (`b2`) is:
   $$
   db2 = \sum \text{error\_output}
   $$
   These gradients will be used to update the weights and biases for the output layer.

3. **Hidden Layer Error**:
   - Backpropagate the error to the hidden layer using the weights of the second layer (`W2`) and the derivative of the activation function (sigmoid in this case):
   $$
   \text{error\_hidden} = (\text{error\_output} \cdot W2^T) \cdot \text{sigmoid\_derivative}(Z1)
   $$
   This computes the error for the hidden layer neurons.

4. **Gradient for `W1` and `b1` (Hidden Layer Weights and Biases)**:
   - The gradient of the loss with respect to the weights connecting the input to the hidden layer (`W1`) is:
   $$
   dW1 = X^T \cdot \text{error\_hidden}
   $$
   - The gradient of the loss with respect to the bias for the hidden layer (`b1`) is:
   $$
   db1 = \sum \text{error\_hidden}
   $$

5. **Update Weights and Biases**:
   - Use the computed gradients to update the weights and biases using gradient descent:
   $$
   W1 = W1 - \text{learning\_rate} \cdot dW1
   $$
   $$
   b1 = b1 - \text{learning\_rate} \cdot db1
   $$
   $$
   W2 = W2 - \text{learning\_rate} \cdot dW2
   $$
   $$
   b2 = b2 - \text{learning\_rate} \cdot db2
   $$

This completes the backpropagation process by adjusting the weights and biases to minimize the loss.

---


# Backward Pass: Mathematical Breakdown

The goal of the backward pass is to compute the gradients of the loss function with respect to the weights and biases and then update these parameters using gradient descent.

---

## 1. **Output Layer Error**:
The error at the output layer is the difference between the predicted values and the true labels:
$$
\text{error\_output} = \hat{y} - y
$$
Where:
- $ \hat{y} $ is the predicted output (from the softmax function).
- $ y $ is the true label.

---

## 2. **Gradient of the Loss with Respect to $ W_2 $ (Output Layer Weights)**:
The weight gradient for $ W_2 $ is the dot product of the hidden layer activations $ A_1 $ (transpose) and the output error:
$$
dW_2 = A_1^T \cdot \text{error\_output}
$$
Where:
- $ A_1^T $ is the transpose of the activations from the hidden layer.
- The gradient needs to be averaged over the batch size $ m $ to prevent gradient updates from being too large:
$$
dW_2 = \frac{1}{m} A_1^T \cdot \text{error\_output}
$$

---

## 3. **Gradient of the Loss with Respect to $ b_2 $ (Output Layer Bias)**:
The bias gradient for $ b_2 $ is the sum of the output error across all examples in the batch:
$$
db_2 = \sum \text{error\_output}
$$
And similarly, this gradient is averaged over the batch size $ m $:
$$
db_2 = \frac{1}{m} \sum \text{error\_output}
$$

---

## 4. **Backpropagate the Error to the Hidden Layer**:
The error at the hidden layer is computed by backpropagating the output error using the weights $ W_2 $ and the derivative of the activation function (sigmoid) for the hidden layer:
$$
\text{error\_hidden} = (\text{error\_output} \cdot W_2^T) \cdot \sigma'(Z_1)
$$
Where:
- $ W_2^T $ is the transpose of the weights connecting the hidden layer to the output layer.
- $ \sigma'(Z_1) $ is the derivative of the sigmoid function applied to $ Z_1 $ (the pre-activation values from the hidden layer).

---

## 5. **Gradient of the Loss with Respect to $ W_1 $ (Hidden Layer Weights)**:
The weight gradient for $ W_1 $ is computed similarly:
$$
dW_1 = X^T \cdot \text{error\_hidden}
$$
Where:
- $ X^T $ is the transpose of the input data.
- Like before, this gradient is averaged over the batch size $ m $:
$$
dW_1 = \frac{1}{m} X^T \cdot \text{error\_hidden}
$$

---

## 6. **Gradient of the Loss with Respect to $ b_1 $ (Hidden Layer Bias)**:
The bias gradient for $ b_1 $ is the sum of the hidden layer error across all examples in the batch:
$$
db_1 = \sum \text{error\_hidden}
$$
Averaged over the batch size $ m $:
$$
db_1 = \frac{1}{m} \sum \text{error\_hidden}
$$

---

## 7. **Update the Weights and Biases**:
Finally, update the weights and biases using gradient descent:
$$
W_1 = W_1 - \alpha dW_1
$$
$$
b_1 = b_1 - \alpha db_1
$$
$$
W_2 = W_2 - \alpha dW_2
$$
$$
b_2 = b_2 - \alpha db_2
$$
Where:
- $ \alpha $ is the learning rate.
- $ dW_1, db_1, dW_2, db_2 $ are the gradients calculated earlier.

# Why Divide by $ m $ in Backpropagation?

When performing backpropagation and computing the gradients of the loss function, it’s important to divide the gradients by the number of samples in the batch, denoted as $m $. This is done to **average the gradients** over all the training examples in the batch.

## Reasons for Dividing by $ m $:

1. **Averaging the Gradient**:
   - When working with mini-batch gradient descent or full-batch gradient descent, the loss is typically the sum of the losses for all the examples in the batch.
   - To ensure the gradient reflects the **average loss per sample** rather than the total loss, the gradient is divided by $ m $, the batch size.
   - This prevents the gradient from becoming too large when using larger batches and ensures that the step size (controlled by the learning rate) remains consistent, regardless of batch size.

2. **Scaling with Respect to the Batch Size**:
   - Without dividing by $ m $, the magnitude of the gradient would scale directly with the number of examples in the batch. This would cause larger batches to have larger gradient updates, which could destabilize training.
   - By dividing by $ m $, the update becomes independent of the batch size, ensuring that the weight updates are more stable and consistent, regardless of how many samples are processed at once.

3. **Maintaining Stability in Training**:
   - If we didn’t divide by $ m $, the gradients would be much larger for large batches, and smaller for small batches, making it difficult to tune the learning rate.
   - Dividing by $m $ helps **normalize** the gradient, so that the learning rate $ \alpha $ can be used effectively without needing to be adjusted based on the batch size.

#### Example:

Consider a batch size $ m = 64 $. If the sum of the errors across the batch is used directly, the resulting gradient would be 64 times larger than if a batch size of 1 was used. By dividing the gradient by 64, we ensure that the weight update is scaled appropriately for the number of examples in the batch.

---

In summary, dividing by $ m $ ensures that:
- The gradient reflects the average per-sample contribution to the loss.
- The weight updates remain stable and are not affected by changes in batch size.
- It becomes easier to tune and apply a consistent learning rate during training.

---
# Loss Tracking and Computing Loss During Training

To track the performance of the model during training, we can compute the **cross-entropy loss** at the end of each epoch. This gives a measure of how well the predictions match the true labels, and tracking the loss over epochs helps monitor the learning progress.

---

## Cross-Entropy Loss Formula:

For multi-class classification, the cross-entropy loss is computed as:

$$
L = - \frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{C} y_{ij} \log(\hat{y}_{ij})
$$

Where:
- $ m $ is the number of training examples (batch size).
- $ C $ is the number of classes.
- $ y_{ij} $ is the true label for sample $ i $ and class $ j $ (`1` for the correct class, `0` for others).
- $ \hat{y}_{ij} $ is the predicted probability for sample $ i $ and class $ j $, which is the output from the softmax function.

---

## Steps for Loss Tracking:

1. **Forward Pass**:
   - After each forward pass during training, compute the predictions $ \hat{y} $ (output) using the softmax function.

2. **Cross-Entropy Loss Calculation**:
   - Compute the cross-entropy loss between the predicted output and the true labels $ y $.

3. **Optional: Print Loss**:
   - Print the loss every few epochs (e.g., every 100 epochs) to track the training progress.

---

## Loss Calculation in Python:

1. Compute the cross-entropy loss:
   - Use the formula: 
   $$
   \text{loss} = -\frac{1}{m} \sum \left[ y \cdot \log(\hat{y}) \right]
   $$
   Where $ y $ is the true labels and $ \hat{y} $ is the predicted output from the softmax function.

2. Print the loss every $n $ epochs (e.g., every 100 epochs) to monitor how the training progresses.

## Example of Loss Tracking:

- Inside the training loop, after the `forward` method:
   - Compute the loss using the predicted output and the true labels.
   - Print the loss every 100 epochs.

This process will help visualize whether the model is improving its predictions over time.




In [3]:
# Simple Neural Network with Sigmoid

def time_training_numpy(nn, X_train, y_train, epochs):
    import time
    # Start timer
    start_time = time.time()
    
    # Train the neural network
    nn.train(X_train, y_train, epochs)
    
    # End timer
    end_time = time.time()
    
    # Calculate total training time
    training_time = end_time - start_time
    print(f"Training Time: {training_time:.2f} seconds")
    
    return training_time

class SimpleNeuralNetworkSigmoid:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
        # Initialize weights and biases
        self.W1 = np.random.randn(input_size, hidden_size)*.01   #input_size x hidden_size
        self.W2 = np.random.randn(hidden_size, output_size)*.01   #hidden_size x output_size
        self.b1= np.zeros((1,hidden_size))  #1 x hidden_size 
        self.b2= np.zeros((1,output_size))  #1 x output_size
        self.learning_rate = learning_rate  # Set the learning rate

    def sigmoid(self, z):
        return 1/(1+np.exp(-z))

    def sigmoid_derivative(self, z):
        return z * (1 - z)

    def softmax(self, z):
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)


    def forward(self, X):
        self.Z1 = X@self.W1+self.b1 #First Layer Calculation
        self.A1=self.sigmoid(self.Z1) #sigmoid to first layer outputs
        self.Z2 = self.A1@self.W2+self.b2 #Second layer calculation
        self.A2 = self.softmax(self.Z2) #softmax to second layer
        return self.A2

    
    def backward(self, X, y, output):

        self.m=X.shape[0]
        
        # Output layer error
        self.error_output = output-y #output loss
        self.dW2 = (np.transpose(self.A1)@self.error_output)/self.m #output layer loss
        self.db2=np.sum(self.error_output, axis=0, keepdims=True)/self.m #output layer bias loss

        # Hidden layer error
        self.error_hidden = (self.error_output@np.transpose(self.W2))*self.sigmoid_derivative(self.A1) # Error for hidden layer neurons
        self.dW1 = (np.transpose(X)@self.error_hidden)/self.m #weight error for hidden
        self.db1 = np.sum(self.error_hidden,  axis=0, keepdims=True)/self.m #bias term error for hidden

        #Update all the weights
        self.W1 -=self.learning_rate*self.dW1
        self.b1 -= self.learning_rate*self.db1
        self.W2-=self.learning_rate*self.dW2
        self.b2-=self.learning_rate*self.db2
        
        
        
        

    def train(self, X, y, epochs=1000):
       
        for epoch in range(epochs):
        # Forward pass to get predictions
            output = self.forward(X)
        
        # Backward pass to update weights
            self.backward(X, y, output)
        
        # Compute the loss (cross-entropy)
            loss = -np.mean(np.sum(y * np.log(output + 1e-8), axis=1))  # This calculates the loss, the small added term prevents log(0) issues

        # Print loss every 100 epochs
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss}")
    
            

    def predict(self, X):
        output = self.forward(X)
        return np.argmax(output, axis=1)


# Initialize the neural network
nn = SimpleNeuralNetworkSigmoid(input_size=3072, hidden_size=64, output_size=10, learning_rate=0.01)

# Time and train the neural network
time_training_numpy(nn, X_train, y_train, epochs=1000)
# Train the neural network
#nn.train(X_train, y_train, epochs=1000)

# Make predictions
y_pred = nn.predict(X_test)
y_test_labels = np.argmax(y_test, axis=1)

# Calculate accuracy
accuracy = np.mean(y_pred == y_test_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")




Epoch 0, Loss: 2.3036117330186388
Epoch 100, Loss: 2.3017806625581962
Epoch 200, Loss: 2.3005702314823466
Epoch 300, Loss: 2.299148613082996
Epoch 400, Loss: 2.2973543057611088
Epoch 500, Loss: 2.2950474620344625
Epoch 600, Loss: 2.292079741808893
Epoch 700, Loss: 2.288280459795769
Epoch 800, Loss: 2.2834460954982663
Epoch 900, Loss: 2.277345637159762
Training Time: 317.43 seconds
Accuracy: 22.88%


## Simple Neural Network with ReLU Activation

This neural network is a simple fully connected feedforward neural network with the following components:

### 1. **Network Architecture**
- **Input Layer**: The input to the network consists of the flattened pixel values of CIFAR-10 images. Since each image is 32x32 pixels with 3 color channels (RGB), the input size is $( 32 \times 32 \times 3 = 3072 $).
- **Hidden Layer**: There is a hidden layer with 64 neurons. The activation function used in this layer is **ReLU (Rectified Linear Unit)**. ReLU introduces non-linearity and helps the network learn more complex patterns. 
  $$
  \text{ReLU}(z) = \max(0, z)
  $$
- **Output Layer**: The output layer has 10 neurons, corresponding to the 10 classes of CIFAR-10. This layer uses the **softmax** function to output probabilities for each class. Softmax ensures that the output values sum to 1, allowing us to interpret them as probabilities.
  $$
  \text{Softmax}(z) = \frac{e^{z}}{\sum_{j} e^{z_j}}
  $$

### 2. **Forward Pass**
In the forward pass, the input data is passed through the layers of the network as follows:
- **Input to Hidden Layer**: 
  $$
  Z_1 = W_1 \cdot X + b_1
  $$
  Where:
  - $( W_1 $) is the weight matrix for the input to hidden layer.
  - $( X $) is the input data.
  - $( b_1 $) is the bias vector for the hidden layer.
  - $( Z_1 $) is the pre-activation values for the hidden layer.
  
  Then, the **ReLU activation** is applied:
  $$
  A_1 = \text{ReLU}(Z_1)
  $$

- **Hidden to Output Layer**:
  $$
  Z_2 = W_2 \cdot A_1 + b_2
  $$
  Where:
  - $( W_2 $) is the weight matrix for the hidden to output layer.
  - $( A_1 $) is the output of the hidden layer after applying ReLU.
  - $( b_2 $) is the bias vector for the output layer.
  - $( Z_2 $) is the pre-activation values for the output layer.
  
  Then, the **softmax activation** is applied:
  $$
  A_2 = \text{softmax}(Z_2)
  $$

### 3. **Loss Function**
The model uses **cross-entropy loss** to measure how far the predicted probabilities are from the true class labels:
$$
\text{Loss} = - \frac{1}{m} \sum_{i=1}^{m} \sum_{c=1}^{C} y_{i,c} \cdot \log(\hat{y}_{i,c})
$$
Where:
- $( m $) is the number of training samples.
- $( C $) is the number of classes (10 for CIFAR-10).
- $( y_{i,c} $) is the true label for sample $( i $) and class $( c $).
- $( \hat{y}_{i,c} $) is the predicted probability for sample $( i $) and class $( c $).

### 4. **Backpropagation and Gradient Descent**
During backpropagation, the network computes the gradients of the loss with respect to the weights and biases, and updates them using gradient descent:
- **Output Layer Gradients**:
  $$
  W_2 = W_2 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial W_2}
  $$
  $$
  b_2 = b_2 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial b_2}
  $$
- **Hidden Layer Gradients**:
  $$
  W_1 = W_1 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial W_1}
  $$
  $$
  b_1 = b_1 - \text{learning rate} \times \frac{\partial \text{Loss}}{\partial b_1}
  $$

### 5. **Training**
The network is trained over a number of **epochs**. In each epoch:
1. The input data is passed through the network in the forward pass.
2. The loss is calculated using cross-entropy.
3. The gradients are computed in the backward pass.
4. The weights and biases are updated using gradient descent.

### 6. **Prediction and Accuracy**
- After training, the network is evaluated on the test set. The model outputs class probabilities, and the class with the highest probability is selected as the predicted class.
- **Accuracy** is calculated as the percentage of correctly classified images:
  $$
  \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Test Images}} \times 100
  $$

### Key Changes in This Model:
- **ReLU Activation**: ReLU is used in the hidden layer instead of sigmoid, allowing the network to handle more complex data and speed up learning by avoiding the vanishing gradient problem.
- **Deeper Hidden Layer**: The hidden layer with 64 neurons allows the model to learn more abstract representations of the data.

### Summary
This updated neural network with **ReLU activation** and a **deeper hidden layer** improves learning by allowing the model to capture more complex patterns in the CIFAR-10 dataset, while also addressing issues like the vanishing gradient problem with sigmoid activation.

# Simple Neural Network with ReLU Activation


$$
\begin{array}{cccc}
\text{Input Layer} & \longrightarrow & \text{Hidden Layer} & \longrightarrow & \text{Output Layer} \\
\left(3072\right)  &                  & \left(64\right)     &                  & \left(10\right) \\
\end{array}
$$


---

# Rectified Linear Unit (ReLU) Activation Function

The **Rectified Linear Unit (ReLU)** is one of the most commonly used activation functions in modern neural networks due to its simplicity and effectiveness.

## ReLU Function

The ReLU function is defined as:
$$
\text{ReLU}(z) = \max(0, z)
$$

This means:
- If $ z > 0 $, the output is $ z $.
- If $ z \leq 0 $, the output is 0.

ReLU introduces non-linearity to the model but is computationally efficient because it only involves comparing the input $ z $ to zero.

#### ReLU Derivative

The derivative of ReLU is useful for backpropagation and is defined as:
$$
\frac{d}{dz} \text{ReLU}(z) = 
\begin{cases} 
1 & \text{if } z > 0 \\
0 & \text{if } z \leq 0 
\end{cases}
$$

This means that the gradient is:
- 1 for positive inputs $ z $.
- 0 for negative inputs $z $ or exactly zero.

#### Advantages of ReLU:
1. **Non-Saturating**: Unlike sigmoid or tanh, ReLU doesn't suffer from saturation in the positive regime. This helps to mitigate the **vanishing gradient problem**.
2. **Sparse Activation**: ReLU outputs zero for all negative inputs, making the network sparse, which can help with the efficiency of learning.
3. **Efficient Computation**: ReLU is computationally simple (just a max operation) and fast to compute.

#### Disadvantages of ReLU:
1. **Dying ReLU Problem**: Sometimes, neurons can "die" during training because they output zero and never activate again, especially with poor weight initialization or high learning rates. This can cause gradients to be zero for some neurons, preventing them from updating their weights.


In [6]:
# Simple Neural Network with Relu


def time_training_numpy(nn, X_train, y_train, epochs):
    import time
    # Start timer
    start_time = time.time()
    
    # Train the neural network
    nn.train(X_train, y_train, epochs)
    
    # End timer
    end_time = time.time()
    
    # Calculate total training time
    training_time = end_time - start_time
    print(f"Training Time: {training_time:.2f} seconds")
    
    return training_time


class SimpleNeuralNetworkRelu:

    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
        # Initialize weights and biases
        self.W1 = np.random.randn(input_size, hidden_size)*.01   #input_size x hidden_size
        self.W2 = np.random.randn(hidden_size, output_size)*.01   #hidden_size x output_size
        self.b1= np.zeros((1,hidden_size))  #1 x hidden_size 
        self.b2= np.zeros((1,output_size))  #1 x output_size
        self.learning_rate = learning_rate  # Set the learning rate

    def reLU(self, z):
        return np.maximum(0,z)

    def reLU_derivative(self, z):
        return np.where(z > 0, 1, 0)

    def softmax(self, z):
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)


    def forward(self, X):
        self.Z1 = X@self.W1+self.b1 #First Layer Calculation
        self.A1=self.reLU(self.Z1) #reLU to first layer outputs
        self.Z2 = self.A1@self.W2+self.b2 #Second layer calculation
        self.A2 = self.softmax(self.Z2) #softmax to second layer
        return self.A2

    
    def backward(self, X, y, output):

        
        #Output layer error
        self.m=X.shape[0]
        self.error_output = output-y #output loss
        self.dW2 = (np.transpose(self.A1)@self.error_output)/self.m #output layer loss
        self.db2=np.sum(self.error_output, axis=0, keepdims=True)/self.m #output layer bias loss

        #Hidden Layer Error
        self.error_hidden = (self.error_output@np.transpose(self.W2))*self.reLU_derivative(self.A1) # Error for hidden layer neurons
        self.dW1 = (np.transpose(X)@self.error_hidden)/self.m #weight error for hidden
        self.db1 = np.sum(self.error_hidden,  axis=0, keepdims=True)/self.m #bias term error for hidden

        #Update all the weights
        self.W1 -=self.learning_rate*self.dW1
        self.b1 -= self.learning_rate*self.db1
        self.W2-=self.learning_rate*self.dW2
        self.b2-=self.learning_rate*self.db2
        
        
        
        

    def train(self, X, y, epochs=1000):
        m = X.shape[0]  # Defining the batch size within the method
        for epoch in range(epochs):
        # Forward pass to get predictions
            output = self.forward(X)
        
        # Backward pass to update weights
            self.backward(X, y, output)
        
        # Compute the loss (cross-entropy)
            loss = -np.mean(np.sum(y * np.log(output + 1e-8), axis=1))  # This calculates the loss, the small added term prevents log(0) issues

        # Print loss every 100 epochs
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss}")
    
            

    def predict(self, X):
        output = self.forward(X)
        return np.argmax(output, axis=1)


# Initialize the neural network
nn = SimpleNeuralNetworkRelu(input_size=3072, hidden_size=64, output_size=10, learning_rate=0.01)


# Time and train the neural network
time_training_numpy(nn, X_train, y_train, epochs=1000)

# Train the neural network
#nn.train(X_train, y_train, epochs=1000)

# Make predictions
y_pred = nn.predict(X_test)
y_test_labels = np.argmax(y_test, axis=1)

# Calculate accuracy
accuracy = np.mean(y_pred == y_test_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")



Epoch 0, Loss: 2.302996183758482
Epoch 100, Loss: 2.29281813239517
Epoch 200, Loss: 2.2685755959105887
Epoch 300, Loss: 2.218453245999388
Epoch 400, Loss: 2.1554619275496685
Epoch 500, Loss: 2.108303987880914
Epoch 600, Loss: 2.0735175012896208
Epoch 700, Loss: 2.0431460618427173
Epoch 800, Loss: 2.015858378261373
Epoch 900, Loss: 1.9921219009246585
Training Time: 273.41 seconds
Accuracy: 28.75%


# Cleanup code deletes the dataset after NN has run freeing up space

In [13]:
# Cleanup step: delete dataset after NN has run
def cleanup_cifar10():
    data_dir = './cifar-10-batches-py'
    tar_file = 'cifar-10-python.tar.gz'
    
    # Remove the extracted dataset directory
    if os.path.exists(data_dir):
        shutil.rmtree(data_dir)
        print(f"Deleted dataset directory: {data_dir}")
    
    # Optionally, remove the downloaded tar.gz file as well
    if os.path.exists(tar_file):
        os.remove(tar_file)
        print(f"Deleted dataset tar file: {tar_file}")

# Call cleanup after the NN has run
cleanup_cifar10()

Deleted dataset directory: ./cifar-10-batches-py
Deleted dataset tar file: cifar-10-python.tar.gz
