# Fashion MNIST Neural Network

Undertaking a classification task, the aim of this project is to develop skills in building neural networks from scratch, image preprocessing, and data augmentation. The dataset used is the Fashion MNIST dataset, which requires classifying instances into one of 10 different classes representing different clothing articles.

The key learning goals of this project understanding backpropagation, loss, training loop, and gradients. Nevertheless, other features will be added that will develop other skills as well.

**Main Objectives**:
- Importing the Data
- Cleaning the Data
- Train/Test/Split
- Exploratory Data Analysis (EDA) and Visualization
- Preprocessing the Data
- Data Augmentation
- Training the Model
- Hyperparameter Tuning
- Test set evaluation
- Metrics for performance - F1 score, precision, recall, confusion_matrix.
- Finding out what kind of images the model's most confident wrong and correct predictions corresponded to, as well as it's most uncertain predictions.

**Extra**:
- Implement a **Neural Network** from scratch.
- The network must have 1 input layer, 2 hidden layers, and an output layer.
- Implement the forward propagation and backpropagation algorithms.
- Use mini-batch gradient descent.
- Implement the Adam optimizer, dropout, and layer normalization.
- Add zerograd
- Regularization with weight decay.
- Softmax + Categorical Cross-Entropy
- Early Stopping
- Learning rate scheduler and momentum
- Visualize loss curves
- Implement ReLU
- Modular design: Linear, ReLU, Dropout, and Softmax as separate classes.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import seaborn as sns
import pandas as pd

## Neural Network Implementation

The implementation of a non-modular **Neural Network**. This neural network will consist of an input layer, a hidden layers, and an output layer. Forward pass and backpropagation will be based on this model architecture, but this will change with the modular setup.

Mini-batch gradient descent with Adam will be used for optimization, while dropout and weight decay will be used as regularization.

Early stopping, learning rate scheduler, and momentum will all be used to help convergence. Layer normalization will be added as well.

Lastly, the softmax and ReLU activation functions have been implemented.

For a single datapoint, here are the forward propagation and backpropagation algorithms:

**Forward Propagation** (datapoint):
$$m = W^{(1)}x + b^{(1)}$$
$$h = ReLU(m)$$
$$z = W^{(2)}h + b^{(2)}$$
$$y = softmax(z)$$
$$L = L_{CE}(y, t)$$<br>

**Backpropagation** (datapoint):
$$\overline{z}=y-t$$
$$\overline{W^{(2)}}=\overline{z}(h)^T$$
$$\overline{b^{(2)}}=\overline{z}$$
$$\overline{H}=\overline{z}W^{(2)}$$
$$\overline{M}=\overline{h}*ReLU'(m)$$
$$\overline{W^{(1)}}=\overline{m}(x)^T$$
$$\overline{b^{(1)}}=\overline{m}$$

For a batch, the algorithms are formulated as follows:

**Forward Propagation**:
$$M = X(W^{(1)})^T+b^{(1)}$$
$$H = ReLU(M)$$
$$Z = H(W^{(2)})^T+b^{(2)}$$
$$Y = softmax(Z)$$
$$E = \frac{1}{N}\sum_{i}L_{CE}(y^{(i)}, t^{(i)})$$

**Backpropagation:**
$$\overline{Z}=\frac{1}{N}(Y-T)$$
$$\overline{W^{(2)}}=(\overline{Z})^TH$$
$$\overline{b^{(2)}}=(\overline{Z})^T1$$
$$\overline{H}=\overline{Z}W^{(2)}$$
$$\overline{M}=\overline{H}*ReLU'(M)$$
$$\overline{W^{(1)}}=(\overline{M})^TX$$
$$\overline{b^{(1)}}=(\overline{M})^T1$$


In [None]:
class NeuralNetwork():
  """
  This is a custom implementation of a neural network with an input layer, a hidden layers, and an output layer.

  Parameters:
  - self.dropout
  - self.learning_rate:
  - self.h: number of units in first hidden layer
  - self.regularization: specify the kind of regularization (TO DO)
  - self.lamb: the regularization strength
  - self.batch_size: the mini-batch size
  - self.momentum:
  - self.epochs: the number of epochs to train for
  - self.W1: the weights for the first hidden layer
  - self.W2: the weights for the second hidden layer
  - self.b1: the bias for the first hidden layer
  - self.b2: the bias for the second hidden layer
  - self.H: stores the value of the second hidden layer for backprop
  - self.M: stores the value of the first hidden layer for backprop

  """
  def __init__(self, dropout=0.8, learning_rate=0.01, h1=64, regularization=None, lamb=0.2, batch_size=32, momentum=0.9, epochs=100):
    self.W1 = None
    self.W2 = None
    self.b1 = None
    self.b2 = None
    self.momentum = momentum
    self.batch_size = batch_size
    self.dropout = dropout
    self.alpha = learning_rate
    self.regularization = regularization
    self.lamb = lamb
    self.h1 = h1 # no. of units in the hidden layer
    self.epochs = epochs
    self.H = None # hidden layer
    self.M = None # input layer

  def fit(self, X, T):
    # constants for xavier initialization
    c1 = xavier_initialization(X.shape[1], self.h1)
    c2 = xavier_initialization(self.h1, T.shape[1])

    # Weight initialization
    self.W1 = np.random.uniform(low=-c1, high=c1, size=(self.h1, X.shape[1]))
    self.W2 = np.random.uniform(low=-c2, high=c2, size=(T.shape[1], self.h1))
    self.b1 = np.zeros((self.h1, 1))
    self.b2 = np.zeros((T.shape[1], 1))


    # Forward propagation -- add training loop here
    for epoch in range(epochs):
      # Forward propagation
      Y = forward_prop(self, X)

      # Compute loss
      loss = compute_loss(self, Y, T)

      W1_bar, W2_bar, b1_bar, b2_bar = backpropagation(self, X, Y, T, loss)
      gradient_descent(self, W1_bar, W2_bar, b1_bar, b2_bar)

  def forward_prop(self):
    # Forward pass
    M = X@(self.W1.T) + self.b1
    H = self.relu(M)
    Z = H@(self.W2.T) + self.b2
    Y = self.softmax(Z)

    # Store M and H
    self.H = H
    self.M = M

    return Y


  def backpropagation(self, X, Y, T, loss):
    # Backpropagation
    Z_bar = (1/N)*(Y - T)
    W2_bar = (Z_bar.T)@self.H
    b2_bar = Z_bar
    H_bar = (Z_bar.T)@self.W2
    M_bar = H_bar@self.relu_derivative(M)
    W1_bar = (M_bar.T)@X
    b1_bar = M_bar

    return W1_bar, W2_bar, b1_bar, b2_bar

  def gradient_descent(self, W1_bar, W2_bar, b1_bar, b2_bar):
    self.W1 = self.W1 - self.alpha*W1_bar
    self.W2 = self.W2 - self.alpha*W2_bar
    self.b1 = self.b1 - self.alpha*b1_bar
    self.b2 = self.b2 - self.alpha*b2_bar

  def compute_loss(self, Y, T):
    # Computes cost - categorical cross-entropy
    return (1/N)*(-(T.T)@np.log(Y))

  def Adam_opitimizer(self):
    pass

  def xavier_initialization(self, n_inputs, n_outputs):
    return np.sqrt(6/(n_inputs+n_outputs))

  def predict(self):
    M = X@(self.W1.T) + self.b1
    H = self.relu(M)
    Z = H@(self.W2.T) + self.b2
    Y = self.softmax(Z)
    return np.argmax(Y, axis=1)

  def relu_derivative(self, o):
    return np.piecewise(o, [o <= 0, o > 0], [0, 1])

  def relu(self, o):
    return np.maximum(0, o)

  def softmax(self, o):
    # Softmax function implementation, the keepdims is used for broadcasting purposes.
    z =  np.exp(o) / np.sum(np.exp(o), axis=1, keepdims=True)
    return z

  def mini_batch(self, t, X, N):
    batches = {}
    n_batches = N // self.batch_size
    # Create batches
    for i in range(n_batches):
        batches[i] = [X[i*self.batch_size:(i+1)*self.batch_size], t[i*self.batch_size:(i+1)*self.batch_size]]

    # Last batch should be compiled into its own batch, even if it's less than batch size
    if N % self.batch_size != 0 :
        batches[n_batches] = [X[n_batches*self.batch_size:], t[n_batches*self.batch_size:]]

    return batches

  def layer_normalization(self):
    pass




## Modular Implementation of a Neural Network

Divides the previous implementation k into several classes that can be combined to form the entire network. The main purpose of this is start writing modular code and create reusable classes for further projects.

## Importing the Data

## Train/Test/Split

## Exploratory Data Analysis (EDA)

## Data Augmentation

Common techniques for augmenting images are the following:
- tbd

## Data Preprocessing

## Building and Training the Model

## Hyperparameter Tuning

## Test Set Evaluation

## Other Metrics

## Analysis of Model Prediction

## Testing of Implementations