# Matrix Analysis 2022 - EE312
## Week 8 - Image classification with Radial Basis Function (RBF) networks
[LTS2](https://lts2.epfl.ch)

In [None]:
#handle imports
import numpy as np
from sklearn.datasets import fetch_openml
from matplotlib import pyplot as plt
from sklearn.metrics import pairwise_distances
from scipy.spatial import distance_matrix
from scipy.special import softmax

### Image Classification
In this exercise, we will be doing image classification with a simple neural network. For simplicity, let's assume we will be working with black and white images.
Given an input image $i$ represented as a vector of  pixel intensities $ \mathbf{x}_i \in [0,1]^d$, we want to predict its correct label $\mathbf{y}_i$, which is represented as a one-hot vector in $\{0,1\}^K$, where $K$ is the number of possible categories (classes) that the image may belong to. For example, we may have pictures of cats and dogs, and our goal would be to correctly tag those images as either cat or dog. In that case we would have $K=2$, and the vectors $\begin{pmatrix}0 \\ 1\end{pmatrix}$ and $\begin{pmatrix}1 \\ 0\end{pmatrix}$ to represent the classes of cat and dog.  

In today's example we will be using the MNIST handwritten digit dataset. It contains images of handwritten numbers from 0 to 9 and our goal is to create a model that can accurately tag each image with its number. Let's load the data first.

In [None]:
### Load the data
# mnist = fetch_openml('mnist_784') # this might not work well on noto. uncomment (and comment the rest) at will.
import requests
r=requests.get('https://os.unil.cloud.switch.ch/swift/v1/lts2-ee312/mnist_data.npz', allow_redirects=True)
with open('mnist_data.npz', 'wb') as f: # save locally
    f.write(r.content)
mnist = np.load('mnist_data.npz')

In the context of classification, neural networks are models that given one (or multiple) input data points produce as output a set of corresponding labels for each input. The model itself consists of parametric functions $g_i$ which can be applied sequentially to the input data, resulting in a set of labels which are the model's prediction for the data. For example, in a model that consists of two parameteric functions $g_1$ and $g_2$, for a given $\mathbf{x}_i$, we have the predicted label $ \hat{\mathbf{y}}_i = g_1(g_2(\mathbf{x}_i))$. The parametric functions are commonly called "layers".

In a standard image classification setup, we are given some training data which we can use to tune the parameters of the parametric functions $g_i$ in order to improve its ability to predict the labels correctly. The parameters are generally tuned with respect to some objective (commonly called a loss function). We want to find the parameters of the model that minimize this loss function. Various loss functions can be used, but in general they tend to encode how "wrong" the model is. For
example, on a given image $i$ one can use the loss $\mathcal{L}(\hat{\mathbf{y}_i}, \mathbf{y}_i)= \sum_{j=1}^{K}(\hat{{y}}_{ij} -{y}_{ij})^2 $, which is the mean squared difference between the vector coordinates of the predicted label of the image and the ones of the actual label $\mathbf{y}_i$.
Minimizing the loss over the whole training set is referred to as "training the model". Furthermore, the goal is that given new data we have not seen before and we have not trained our model with, the model will still be able to classify accurately.

Before we go into the details of the model and how we will train it, let's prepare the data.

In [None]:
# Preprocess the data
# mnist_data = mnist.data # openml version
# images = mnist_data.to_numpy() # openml version

images = mnist['data']
num_images = images.shape[0]

train_set_size = 60000
test_set_size = 10000

train_images = images[:train_set_size]
train_images = train_images/255.
train_images =  train_images

test_images = images[-test_set_size:]
test_images = test_images/255.
test_images = test_images

#create one-hot encodings of labels
# mnist_target = mnist.target.to_numpy(dtype=int) # openml version
mnist_target = mnist['target']
num_classes = mnist_target.max()+1
labels = []
for k in range(num_images):
    one_hot = np.zeros(num_classes)
    one_hot[int(mnist_target[k])]=1
    labels+= [one_hot]
labels = np.array(labels)

#labels in one-hot
train_labels = labels[:train_set_size]
test_labels = labels[-test_set_size:]

#labels in integer form
int_labels = np.array(mnist_target, dtype=int)
int_labels_train = int_labels[:train_set_size]
int_labels_test = int_labels[-test_set_size:]

In [None]:
# View an image to make sure everything went well
which_one = 5
plt.imshow(train_images[which_one].reshape((28,28)));

### 1. Radial Basis Function (RBF) networks

For our task, we will be using Radial Basis Function (RBF) Networks as our neural network model.
The pipeline, which is presented in the image below, consists of two layers. The first employs non-linear functions $g_1(\mathbf{x};\boldsymbol{\mu}): \mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times c}$.
The second is a linear layer, represented by a matrix of weights $\mathbf{W} \in \mathbb{R}^{c \times K}$, which maps the output of the previous layer to class scores; its role is to predict labels. 

The pipeline proceeds in the following steps:

i) Choose a set of $c$ points $\boldsymbol{\mu}_j\in [0,1]^d$.     
ii) Compute $g_1(\mathbf{x}_i;\boldsymbol{\mu}_j) = \exp^{\frac{||{\mathbf{x}_i-\boldsymbol{\mu}_j||^2}}{\sigma}}=a_{ij}$ for all possible pairs of $i$ and $j$. Here $\sigma$ is a hyperparameter that controls the width of the gaussian.  
iii) Compute the predicted labels $g_2(\mathbf{a}_i)= \mathbf{a}_i^{\top}\mathbf{W}= \hat{\mathbf{y}}_i$. Here $\mathbf{a}_i \in \mathbb{R}^c$ are the outputs of the layer $g_1$ for an input image $i$. $\hat{\mathbf{y}}_i$ is a row vector and $\hat{y}_{ij} = \sum_{m=1}^{c}a_{im}w_{mj}$, $j\in\{1,...,K\}$. 

![RBF_NN.png](images/RBF_NN.png)

Intuitively, the first layer of the RBF network can be viewed as matching the input data with a set of prototypes (templates) through a gaussian whose width is determined by $\sigma$. The second layer performs a weighted combination of the matching scores of the previous layer to determine the predicted label for a given point. 

**Exercise:** For hyperparameters $c$ and $\sigma$ of your choice, select $c$ prototypes and obtain the output of the first layer of the RBF network. The prototypes can simply be random images from your training set.
The following functions might be helpful:
- [pairwise_distances (from scikit-learn)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)
- [random.choice (from numpy)](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)

In [None]:
#pick random centers
number_of_centers = 200
sigma = 100


rand_centers = #your code here

In [None]:
def get_activations(imgs, rand_centers, sigma):
    # your code here
    return activations

activations = get_activations(train_images, rand_centers, sigma)

### Training the network

To make things easier, we will fix the parameters $\boldsymbol{\mu}$ and $\sigma$ of the network, i.e., we decide their values before and the remain constant throughout training and testing of the model. Therefore, the only trainable parameters are going to be the weights of the second layer.
To train the model, we are going to use the mean squared loss function that we mentioned earlier. For a training dataset with $n$ images we have

$$ \mathcal{L}(\text{training data}, \text{training labels}) = \frac{1}{2n}\sum_{i=1}^n\mathcal{L}(\hat{\mathbf{y}}_i,\mathbf{y}_i) = \frac{1}{2n}\sum_{i=1}^n ||(\hat{\mathbf{y}}_{i} - \mathbf{y}_{i})||^2.$$




 There are two ways of tuning those:  
i) Backpropagation.   
ii) Solve a linear system.

### Training with backpropagation

Backpropagation depends on [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent#Description). The goal is to update the trainable parameters of the network by "moving them" in the direction that will decrease the loss function.
In our case, the weights $w_{kl}$ are updated in the following manner
$$ w_{kl}' = w_{kl}- \gamma \frac{\partial\mathcal{L}(\text{training data}, \text{training labels})}{\partial w_{kl}}, $$
where $\gamma$ is a hyper-parameter called the learning rate. The gradient of the Loss points towards the direction of steepest descent, hence we update the weights of the network towards that direction.  

**Question**: For the mean squared error loss, what is the gradient of the loss with respect to the weights $w_{kl}$ of the network?

**Answer**: 

Train the weights of the linear layer using stochastic gradient descent. For $p$ iterations (called epochs), you have to update each weight $w_{kl}$ of the network once for each image, by computing the gradient of the loss with respect to that weight.


In [None]:
# Initial values for hyperparams. Feel free to experiment with them.
weights = (1/28)*np.random.randn(number_of_centers, num_classes)
epochs = 50
learning_rate = 0.1

def get_predictions_loss(activations, weight, labels, int_labels):
    predictions = activations@weights
    num_correct_predictions = ((predictions.argmax(1) - int_labels)==0).sum()
    loss = ((predictions - labels)*(predictions - labels)).sum(1).mean()
    return loss, num_correct_predictions

In [None]:
# compute the gradient for a single input
def compute_gradient(activation, weights, train_label):
    # your code here
    return None
    

#Backpropagation with SGD
for k in range(epochs):
    for counter, activation in enumerate(activations):
        gradient = compute_gradient(activation, weights, train_label)/train_set_size
        weights = weights - learning_rate*gradient
    
    loss_train, num_correct_predictions_train = get_predictions_loss(activations, weights, train_labels, int_labels_train)
    print("Loss:", loss_train)
    print("Number of correct predictions:", num_correct_predictions_train)

**Exercise:** Check how well your network does on the test set. Print its accuracy.

In [None]:
def get_accuracy(predictions, int_labels, set_size):
    num_correct_predictions = ((predictions.argmax(1) - int_labels)==0).sum()
    return num_correct_predictions/set_size

test_activations = get_activations(test_images, rand_centers, sigma)
test_predictions = test_activations@weights
print(f"The accuracy on the test set is: {get_accuracy(test_predictions, int_labels_test, test_set_size)*100} %")  

### Solving the linear system

Since we only have one weight matrix to tune, we can avoid learning with backpropagation entirely. Consider the mean squared error for the whole dataset and a one-dimensional binary label $y_i$ for each data point for simplicity. The mean squared loss for the dataset is
$$  \sum_{i=1}^n (\hat{{y}}_{i} - {y}_{i})^2=  ||(\mathbf{A}\mathbf{w} - \mathbf{y})||^2.$$ Here $\mathbf{A} \in \mathbb{R}^{n \times c}$ is the matrix that contains the outputs (activations) of the first layer. From a linear algebra perspective, we are looking for a matrix $\mathbf{w}$ that solves the linear system $ \mathbf{A}\mathbf{w} = \mathbf{y}.$  

**Question:** Can we find solutions to this system (justify) and how ?

**Answer**: 

**Exercise:** Based on your answer above, compute the weights of the neural network that best classify the data points of the training set.

In [None]:
#calculate the weights of the linear layer
weights_lsq = # your code here

**Exercise:** Using the weights you computed, classify the points in the training set and print the accuracy.

In [None]:
#predict the labels of each image in the training set and compute the accuracy
train_prediction_lsq = activations@weights_lsq
print(f"The accuracy on the training set is: {get_accuracy(train_prediction_lsq, int_labels_train, train_set_size)*100} %")

**Exercise:** Using the weights you computed, classify the points in the test set and print the accuracy.

In [None]:
#calculate the activations of the test set
test_activations = get_activations(test_images, rand_centers, sigma)

In [None]:
#predict the accuracy on the test set
test_predictions_lsq = test_activations@weights_lsq
print(f"The accuracy on the test set is: {get_accuracy(test_predictions_lsq, int_labels_test, test_set_size)*100} %")

### **Open ended**: On the choice of templates. 
Suggest a different or more refined way to select templates for the RBF network and implement it. Check how it compares with your original approach.
Check how it works with the backpropagation and linear system solutions.