# UCLAIS Tutorial Series Challenge 2

We are proud to present you with the second challenge of the 2022-23 UCLAIS tutorial series: the CIFAR-10 image classification problem. You will be introduced to a variety of core concepts in **computer vision** and specifically the implementation of convolutional neural network (CNN) architectures using the popular machine learning package, [TensorFlow](https://www.tensorflow.org/).

This Jupyter notebook will guide you through the various general stages involved in end-to-end machine learning projects, including data visualisation, data preprocessing, model selection, model training and model evaluation. Finally, you will have the opportunity to submit the model you build to [DOXA](https://doxaai.com/) for evaluation on an unseen test set.

This notebook contains blank code blocks for you to experiment with your own ideas in! See the `starter-SOLUTION.ipynb` notebook if you need more guidance.

If you do not already have a DOXA account, you will want to [sign up](https://doxaai.com/sign-up) first before proceeding.

## Background & Motivation

**CIFAR 10**

![title](./media/CIFAR-10.png)

**Background**: Image classification is one of the fundamental tasks in the domain of computer vision. It has revolutionised and propelled technological advancements in many prominent fields and industries, including healthcare, manufacturing, the automobile industry and much more.

**Objective**: For this challenge, your aim is to build a model that can accurately predict the class to which images drawn from the popular CIFAR-10 dataset belong. The images in the dataset can each belong to one of ten different classes.

**Dataset**: The dataset is based on the following [CIFAR-10 dataset](hhttps://www.cs.toronto.edu/~kriz/cifar.html). We have divided the dataset into a **'smaller dataset'** (43.9 MB), as well as a **'larger dataset'** (146.5 MB) you can use if you are feeling more comfortable. The small dataset contains 15,000 images, where each class has 1,500 images, whereas the large dataset contains 50,000 images in total, where each class has 5,000 images. In other words, these datasets are _balanced_. The partitioned dataset can be downloaded from [Google Drive](https://drive.google.com/drive/folders/11M8y08hEDTmMpVq3tZCU9ajX7Gui_0nN).

## Installing and Importing Useful Packages

To get started, we will install a number of common machine learning packages.

In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn doxa-cli gdown

In [None]:
# Import relevant libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import cv2

# Import relevant sklearn classes/functions related to data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# Import relevant TensorFlow classes
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPool2D, Dropout
from tensorflow.keras.optimizers import Adam

%matplotlib inline

## Data Loading
The first step is to gather the data that we will be using. The data can be downloaded directly via [Google Drive](https://drive.google.com/drive/folders/11M8y08hEDTmMpVq3tZCU9ajX7Gui_0nN) or just by simply running the cell below. 

In [None]:
# Let's download the dataset if we don't already have it!
if not os.path.exists("data"):
    os.makedirs("data", exist_ok=True)

    !gdown https://drive.google.com/drive/folders/11M8y08hEDTmMpVq3tZCU9ajX7Gui_0nN -O ./data --folder

We will be using the small dataset in this notebook, but feel absolutely free to switch to the large dataset if you are feeling comfortable. To do so, just comment out the lines that load the features and labels for the small dataset, and then uncomment those corresponding to the large dataset. It will allow you to improve your model further, but you may require additional compute power.

In [None]:
# Select either the "small" dataset or the "large" dataset
DATASET = "small"
# DATASET = "large"

# Load the saved .npz file
data_original = np.load(f"./data/train_{DATASET}.npz")["data"]

# Load the labels
labels = np.genfromtxt(f"./data/train_{DATASET}_label.csv").astype("uint8")

In [None]:
# We then make an in-memory copy of the dataset that we can manipulate 
# and experiment with. Just remember to rerun this code block when you
# change your data preprocessing approach!

data = data_original.copy()

## Data Understanding & Visualisation
Before we start to train our Machine Learning model, it is important to have a look and understand first the dataset that we will be using. This will provide some insights onto which model, model hyperparameter, and loss function are suitable for the problem we are dealing with. 

In [None]:
# Find the shape of our training and testing set

In [None]:
# Display the labels we will be predicting (integers in the range 0 to 9)

In [None]:
# Print the friendly label names (e.g. `cat`, `dog`)

In [None]:
# Plot a number of images using matplotlib

## Data Preprocessing 

For this step, there are two basic things we can do before we start building our neural network model:

**1. Label Encoding**

As shown in the previous section, our labels are integers in the range of 0 to 9. This is not really suitable for our neural network, so it is recommended instead to one-hot encode the labels using the [LabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) class provided by Scikit-Learn.

**2. Splitting into the Training and Validation Sets**

The next preprocessing step that needs to be done before we can proceed onto model training is to split our dataset into the training and validation sets. The training set will be used for the training of our model, while the validation set will be used to evaluate the performance of our model (say, against other models you might train).

In [None]:
# One-hot encode the labels

# HINT: Use LabelBinarizer class provided by scikit-learn

In [None]:
# Split our features and output labels into separate training and validation sets

# HINT: Use train_test_split function from scikit-learn

## Constructing a Convolutional Neural Network

Now that we have completed all of the required preprocessing steps, we can proceed onto the most exciting stage, which is constructing the neural network. For this, we will build a convolutional neural network (CNN), which is a popular network architecture in the domain of computer vision.

To construct the neural network, we will be using the functionality provided by TensorFlow, which greatly simplifies the task of building neural networks. The [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers) describes all of the building blocks you can take advantage of to build neural networks in detail.

In [None]:
model = Sequential(
    [
        # Feel free to add multiple convolutional layers
        # Flatten
        # Add code for the last output layer
    ]
)

In [None]:
# Let's have a look at a summary of the model we've created

# HINT: Call .summary() on our model

## Training the Model

This is where the magic happens. We will start training our training set with the neural network architecture that we have created before.



In [None]:
# Set all the required hyperparameters before starting the training process
# HINT: Call .compile()

# Start the training and save its progress in a variable called 'history'
# HINT: Call .fit()

In [None]:
# Now that we have trained our model, let's plot how our model performed
# on both the training and validation dataset as the number of iteration increases

How has your model performed?

## Analyse the Model

Let's proceed to analyse our model further. The hope is so that we might be able to capture some insight that can be used to improve on our model architecture.

In [None]:
# Let's run our model on the validation set and base our class predictions on
# whichever neuron (of the 10 neurons in the output layer) has the largest value

In [None]:
# Do the same thing for our true labels

In [None]:
# Plot a confusion matrix of the true and predicted labels

## Preparing our DOXA Submission

Once we are confident with the performance of our model, we can start deploy our model onto DOXA! 

In [None]:
# Create a submission folder by downloading a few required files from Github
if not os.path.exists("submission"):
    os.makedirs("submission")
    
    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-2/submission/doxa.yaml --output submission/doxa.yaml
    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-2/submission/run.py --output submission/run.py

In [None]:
# Save the CNN model in the submission folder
model.save("submission/model")

## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-2) and click "Enrol" in the top-right corner.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

You can then submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Yay! You have (probably) just uploaded your model to DOXA! When DOXA has time to evaluate the performance of your model on an unseen test set, you will then be able to see how your model performs on the [scoreboard](https://doxaai.com/competition/uclais-2)!