### Overview of Convolutional Neural Networks

Neural networks can generally be understood through three layers: the input, hidden, and output layers. The input layer takes some structured data and processes it for the hidden layer. The hidden layer consecutively applies learned weights and non-linear activation fucntions to the data before passing to the output layer. The output layer uses weights and an activation function to return the number of outputs required by the task. For image classification, this will be a vector whose values correspond to the probability that the input image belongs to each class (the argmax of the vector is therefore the classification we're looking for). 

[Feed Forward Networks (FNN)](https://www.geeksforgeeks.org/understanding-multi-layer-feed-forward-networks/) is a foundational neural network that perfectly fit this description, simply taking a one dimensional vector (in this case the individual pixels of an image) as input and passing it through weights and activation functions. In computer vision, FNN's are far outclassed by CNN's, which are much better at modelling image data by making stronger assumptions. The core assumption made by CNN's is that information captured by neighboring pixels are highly correlated, which means that 1. features can be captured by looking at regions of neighboring pixels and 2. not all pixels are necessary for image classification. A square 28 pixel RGB image will be transformed into a 2352x1 array by the FNN's input layer, so the network cannot infer that the first three elements represent the same pixel or that the next three elements represent a neighboring pixel. 

[Convolutional Neural Networks (CNN)](https://www.geeksforgeeks.org/introduction-convolution-neural-network/) capture image structure using <strong>convolution</strong>, which involves sliding a <strong>kernel</strong> across the input matrix to capture some feature. Kernels are square matrices of learned weights which in theory should highlight some part of the input image that is pertinent to the classification task. Since convolutions produce redundant information, the resulting matrix is put through a <strong>pooling</strong> layer which only preserves the most significant element in every stride. For example, the 2x2 Max Pool used in this immplementation only keeps the maximum element in every four, producing an output matrix half the width and height of the original. This resulting matrix is then passed through a feed forward layer which is essentially just an FNN. However, the previous steps should have both highlighted important data in the image and discarded redundant elements, allowing the weights of the feed forward layer to more optimally learn the classification features.

In [1]:
! pip install -r requirements.txt



### Fashion MNIST

The [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset was modeled on the original [MNIST](http://yann.lecun.com/exdb/mnist/) database of handwritten digits to provide a more challenging image classification task. While Fashion MNIST is also a set of 70k (60k train and 10k test) black and white 28x28 pixel images, the classes are much more abstract than the original MNIST. Each image is an article of clothing belonging to one of 10 classes: "T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", and "Ankle boot".

In [2]:
# Libraries used in this implementation; torcvhision is only used to download the dataset
import numpy as np
from numpy import logical_and, sum as t_sum
from torchvision import datasets
from tqdm.notebook import tqdm as tqdm
import matplotlib.pyplot as plt

traintensors = datasets.FashionMNIST(root="./data", train=True, download=True)
testtensors = datasets.FashionMNIST(root="./data", train=False, download=True)

# Convert the tensors to numpy arrays
trainset = np.array(traintensors.data)
trainlabels = np.array(traintensors.targets)
testset = np.array(testtensors.data)
testlabels = np.array(testtensors.targets)
classes = ("T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot")

# Take a look at the data
for i in range(9):
    plt.subplot(330 + 1 + i)
    plt.imshow(trainset[i], cmap=plt.get_cmap('gray'))
plt.show()

### Convolutional Layer

The core of CNN's relies on convlution: a 5x5 <strong>kernel</strong> (other odd-numbered square matrices are also possible) that is multiplied elementwise on the image and summed up. The kernel is iterated across the image in <strong>strides</strong> and are positionally combined to form a 2D array. However, since kernels are larger than one pixel, the resulting array will be smaller than the original image. In order to produce resulting arrays of the correct size, we <strong>pad</strong> the array with borders of zeroes. For a 5x5 kernel, the border must be size 2.

#### Convlution

In [3]:
class ConvLayer:
    def __init__(self, input_shape, kernel_size=5, num_kernels=6, padding=0):
        # Get input dimensions
        input_depth, input_height, input_width = input_shape
        self.d = input_depth
        self.h = input_height + kernel_size - 1
        self.w = input_width + kernel_size - 1
        self.input_shape = input_shape
        # Initialize kernels and bias
        self.padding = padding
        self.num_kernels = num_kernels
        self.kernel_size = kernel_size
        self.pad_size = kernel_size // 2
        self.kernel_shape = (self.num_kernels, self.d, self.kernel_size, self.kernel_size)
        self.bias_shape = (self.num_kernels, self.h - self.kernel_size + 1, self.w - self.kernel_size + 1)
        # Dividing mimics Xavier Initialization and reduces variance
        self.kernels = np.random.randn(*self.kernel_shape) / (self.kernel_size * self.kernel_size)
        self.bias = np.random.randn(*self.bias_shape) / (self.h * self.w)
    
    def iter_regions(self, image):
        """
        Generates all possible (kernel_size x kernel_size) image regions (prepadded)
        """
        for i in range(self.h - self.kernel_size + 1):
            for j in range(self.w - self.kernel_size + 1):
                im_region = image[:, i:(i + self.kernel_size), j:(j + self.kernel_size)]
                yield im_region, i, j
    
    def forward(self, input):
        """
        Pad input, get regions, and perform full cross correlation with kernels
        """
        padded = np.pad(input, ((0,0), (self.pad_size, self.pad_size), (self.pad_size, self.pad_size)), mode="constant", constant_values=self.padding)
        self.prev_input = padded # Save for backpropagation
        self.output = np.copy(self.bias)
        for im_region, i, j in self.iter_regions(padded):
            self.output[:, i, j] += np.sum(im_region * self.kernels, axis=(1, 2, 3))
        return self.output
    
    def backprop(self, d_L_d_out, learn_rate):
        """
        Update kernels and bias, and return input gradient
        """
        # Cross correlation for kernel gradient
        d_L_d_kernels = np.zeros(self.kernels.shape)
        for im_region, i, j in self.iter_regions(self.prev_input):
            for f in range(self.num_kernels):
                d_L_d_kernels[f] += d_L_d_out[f, i, j] * im_region 
        # Full convolution for input gradient
        d_L_d_input = np.zeros(self.input_shape)
        pad_out = np.pad(d_L_d_out, ((0,0), (self.pad_size, self.pad_size), (self.pad_size, self.pad_size)), mode="constant", constant_values=0)
        conv_kernels = np.rot90(np.moveaxis(self.kernels, 0, 1), 2, axes=(2, 3))
        for im_region2, i, j in self.iter_regions(pad_out):
            for d in range(self.d):
                d_L_d_input[d, i, j] += np.sum(im_region2 * conv_kernels[d])
        # Adjust by learn rate
        self.bias -= learn_rate * d_L_d_out
        self.kernels -= learn_rate * d_L_d_kernels
        return d_L_d_input

In [4]:
class ReLU:
    """
    Simple ReLU activation function
    """
    def __init__(self):
        pass
    
    def forward(self, input):
        self.prev_output = np.maximum(0, input)
        return self.prev_output
    
    def backprop(self, d_L_d_out):
        return d_L_d_out * np.int64(self.prev_output > 0)

### Max Pooling Layer
Since convolutions capture information from neighboring pixels, many elements in the output array are redundant. <strong>Pooling</strong> presents a simple solution: run another square array across the output matrix and at each <strong>stride</strong> keep only the max, min, or average value. In this implementation I use the max value, but any value will evenly reduce the size of the output while keeping the most important information.

In [5]:
class MaxPool:
    def __init__(self, pool_size=2):
        self.size = pool_size
    
    def iter_regions(self, image):
        """
        Same as Conv layer, but with stride of pool_size
        """
        _, h, w = image.shape
        new_h = h // self.size
        new_w = w // self.size
        for i in range(new_h):
            for j in range(new_w):
                im_region = image[:, (i * self.size):(i * self.size + self.size), (j * self.size):(j * self.size + self.size)]
                yield im_region, i, j
    
    def forward(self, input):
        """
        Gets max value in each region
        """
        self.prev_input = input
        num_kernels, h, w = input.shape
        output = np.zeros((num_kernels, h // self.size, w // self.size))
        for im_region, i, j in self.iter_regions(input):
            output[:, i, j] = np.amax(im_region, axis=(1, 2))
        return output
    
    def backprop(self, d_L_d_out):
        """
        Backpropagates gradient to input
        """
        d_L_d_input = np.zeros(self.prev_input.shape)
        for im_region, i, j in self.iter_regions(self.prev_input):
            f, h, w = im_region.shape
            amax = np.amax(im_region, axis=(1, 2))
            for i2 in range(h):
                for j2 in range(w):
                    for f2 in range(f):
                        if im_region[f2, i2, j2] == amax[f2]:
                            d_L_d_input[f2, i * self.size + i2, j * self.size + j2] = d_L_d_out[f2, i, j]
        return d_L_d_input

In [6]:
class Output:
    def __init__(self, input_len, nodes):
        self.weights = np.random.randn(input_len, nodes) / input_len
        self.bias = np.zeros(nodes)
    
    def forward(self, input):
        """
        Flatten input, matrix multiply with weights, add bias, and get softmax
        """
        # Forward pass
        flat = input.flatten()
        totals = np.dot(flat, self.weights) + self.bias
        exp = np.exp(totals)
        # Saving forward pass for backpropagation
        self.prev_input_shape = input.shape
        self.prev_input = flat
        self.prev_totals = totals
        return exp / np.sum(exp, axis=0)
    
    def backprop(self, d_L_d_out, learn_rate):
        """
        Softmax backprop for output layer
        """
        for i, gradient in enumerate(d_L_d_out):
            # Only the gradient at the correct class is nonzero
            if gradient == 0:
                continue 
            # e^totals
            t_exp = np.exp(self.prev_totals)
            S = np.sum(t_exp)
            # Gradients at i against totals
            d_out_d_t = -t_exp[i] * t_exp / (S ** 2)
            d_out_d_t[i] = t_exp[i] * (S - t_exp[i]) / (S ** 2)
            # Gradients of totals against weights/bias/input
            d_t_d_w = self.prev_input
            d_t_d_b = 1
            d_t_d_inputs = self.weights
            # Gradients of loss against totals
            d_L_d_t = gradient * d_out_d_t
            d_L_d_w = d_t_d_w[np.newaxis].T @ d_L_d_t[np.newaxis]
            d_L_d_b = d_L_d_t * d_t_d_b
            d_L_d_inputs = d_t_d_inputs @ d_L_d_t
            # Update weights and bias
            self.weights -= learn_rate * d_L_d_w
            self.bias -= learn_rate * d_L_d_b
            return d_L_d_inputs.reshape(self.prev_input_shape)

In [9]:
class SimpleCNN:
    """
    Simple CNN using the layers built above.
    Architecture:
    Input -> Conv -> ReLU -> Conv -> ReLU -> MaxPool -> Dense -> Dense -> Output
    """
    def __init__(self, ConvLayer_1, ReLU_1, ConvLayer_2, ReLU_2, MaxPool, Output):
        self.ConvLayer_1 = ConvLayer_1
        self.ReLU_1 = ReLU_1
        self.ConvLayer_2 = ConvLayer_2
        self.ReLU_2 = ReLU_2
        self.MaxPool = MaxPool
        self.OutputLayer = Output
    
    def preprocess(self, data):
        """
        Data generally needs to be reshaped for our purposes
        """
        if len(data.shape) == 3:
            data = data[:, np.newaxis, :, :]
        elif len(data.shape) == 4 and data.shape[3] == 3:
            data = np.moveaxis(data, -1, 1)
        return data
    
    def forward(self, image):
        """
        Forward pass through network, transform image from [0, 255] to [-0.5, 0.5] as standard practice
        """
        input = (image / 255) - 0.5
        out = self.ConvLayer_1.forward(input)
        out = self.ReLU_1.forward(out)
        out = self.ConvLayer_2.forward(out)
        out = self.ReLU_2.forward(out)
        out = self.MaxPool.forward(out)
        out = self.OutputLayer.forward(out)
        return out
    
    def backprop(self, gradient, learn_rate):
        """
        Backpropagation through network
        """
        d_L_d_out = self.OutputLayer.backprop(gradient, learn_rate)
        d_L_d_out = self.MaxPool.backprop(d_L_d_out)
        d_L_d_out = self.ReLU_2.backprop(d_L_d_out)
        d_L_d_out = self.ConvLayer_2.backprop(d_L_d_out, learn_rate)
        d_L_d_out = self.ReLU_1.backprop(d_L_d_out)
        d_L_d_out = self.ConvLayer_1.backprop(d_L_d_out, learn_rate)
        return d_L_d_out

    def avg_f1_score(self, predicted_labels, true_labels, classes):
        """
        Calculate the f1-score for each class and return the average of it
        F1 score is the harmonic mean of precision and recall
        Precision is True Positives / All Positives Predictions
        Recall is True Positives / All Positive Labelsß
        """
        f1_scores = []
        for c in classes:
            pred_class = np.array([pred == c for pred in predicted_labels])
            true_class = np.array([lab == c for lab in true_labels])
            precision = (t_sum(logical_and(pred_class, true_class)) / t_sum(pred_class)) if t_sum(pred_class) else 0
            recall = t_sum(logical_and(pred_class, true_class)) / t_sum(true_class)if t_sum(true_class) else 0
            f1_scores.append(2 * (precision * recall) / (precision + recall)) if precision and recall else 0
        return np.mean(f1_scores)

    def predict(self, dataset, true_labels, classes):
        """
        Predict labels for dataset and return f1-score
        """
        preds = []
        acc = 0
        for im, lab in zip(dataset, true_labels):
            preds.append(np.argmax(self.forward(im)))
            acc += (preds[-1] == lab)
        preds = np.array(preds)
        accuracy = acc / len(preds)
        f1 = self.avg_f1_score(preds, true_labels, classes)
        return accuracy, f1

    def train(
        self,
        trainset,
        trainlabels,
        devset,
        devlabels,
        classes=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        epochs=3,
        learn_rate=0.005
    ):
        """
        Training loop for network
        """
        # Preprocess & generate permutation to shuffle data
        trainset = self.preprocess(trainset)
        devset = self.preprocess(devset)
        permutation = np.random.permutation(len(trainset))
        train_data = trainset[permutation]
        train_labels = trainlabels[permutation]
        # Training loop
        print("Training...")
        for epoch in range(epochs):
            losses = []
            for image, label in tqdm(list(zip(train_data, train_labels))):
                # Forward pass
                out = self.forward(image)
                # Calculate loss and gradient
                loss = -np.log(out[label])
                losses.append(loss)
                gradient = np.zeros(10)
                gradient[label] = -1 / out[label]
                # Backpropagation
                self.backprop(gradient, learn_rate)
            print(f"Epoch {epoch + 1}, loss: {np.mean(losses):.3f}")
            print("Evaluating dev...")
            acc, f1 = self.predict(devset, devlabels, classes)
            print(f"Dev Accuracy: {acc:.3f}, Dev F1 Score: {f1:.3f}")

In [11]:
model = SimpleCNN(
        ConvLayer(input_shape=(1, 28, 28), kernel_size=5, num_kernels=6, padding=0),
        ReLU(),
        ConvLayer(input_shape=(6, 28, 28), kernel_size=5, num_kernels=6, padding=0),
        ReLU(),
        MaxPool(),
        Output(6 * 14 * 14, 10)
)
model.train(trainset, trainlabels, testset, testlabels, epochs=3, learn_rate=0.005)

Training...


  0%|          | 0/60000 [00:00<?, ?it/s]

Epoch 1, loss: 0.467
Evaluating dev...
Dev Accuracy: 0.873, Dev F1 Score: 0.870


  0%|          | 0/60000 [00:00<?, ?it/s]

Epoch 2, loss: 0.320
Evaluating dev...
Dev Accuracy: 0.888, Dev F1 Score: 0.886


  0%|          | 0/60000 [00:00<?, ?it/s]

Epoch 3, loss: 0.285
Evaluating dev...
Dev Accuracy: 0.891, Dev F1 Score: 0.890


In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers.legacy import SGD

train_images = (trainset / 255) - 0.5
test_images = (testset / 255) - 0.5

train_images = np.expand_dims(train_images, axis=3)
test_images = np.expand_dims(test_images, axis=3)

model = Sequential([
  Conv2D(6, 5, padding="same", input_shape=(28, 28, 1), use_bias=True, activation='relu'),
  Conv2D(6, 5, padding="same", input_shape=(28, 28, 6), use_bias=True, activation='relu'),
  MaxPooling2D(pool_size=2),
  Flatten(),
  Dense(10, activation='softmax'),
])

model.compile(SGD(learning_rate=.005), loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(
  train_images,
  to_categorical(trainlabels),
  batch_size=1,
  epochs=3,
  validation_data=(test_images, to_categorical(testlabels)),
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x2b0c4f3a0>