# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [1]:
# path tools
import os
import cv2
import numpy as np

# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

2023-03-10 12:24:10.650672: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-10 12:24:10.804715: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/coder/.local/lib/python3.9/site-packages/cv2/../../lib64:
2023-03-10 12:24:10.804737: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-10 12:24:11.429286: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7';

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [2]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


**Question:** What is the shape of the data?

In [3]:
X_train.shape # fourth dimension is the number of types we have of the other three color channels 

(50000, 32, 32, 3)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [4]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [None]:
# example for list comprehensions 
colors = ["red", "blue", "green"]
uppers = []

for color in colors:
    upper = color.upper() # make all upper case 
    uppers.append(upper) # append to the empty list 
# can also compress that into one line of code in the for loop 

# how to make this example into a list comprehension 
# set up is, do this thing for everything in the list - where as the for loop is the opposite, for this list do this thing 
# it is all in [] so don't need to append to a new list because it's already in a new list immediately 
uppers = [color.upper for color in colors]

In [5]:
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train])
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

Can use this "nicely" by making a function with multiple steps and then call the function and apply to data in a list comprehension. Otherwise if there are multiple steps in a for loop, converting to a list comprehension can get very confusing  

Then, we're going to do some simple scaling by dividing by 255. Which rescales all the pixel values, still 32x32 and 50,000 images but just scales down the pixel values from 0-255 to 0-1. Also gives small changes in weights a more pronounced affect  

Rescaling makes it faster and easier to converge. 

In [6]:
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0

### Reshaping the data

Next, we're going to reshape this data. 

In [7]:
nsamples, nx, ny = X_train_scaled.shape # returning the shape of each object, nsamples = 50,000, nx(x axis values) = 32, ny = 32
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny)) # new shape is the number of values 50,000 and the second set is 32*32 which flattens everything into 1D array

In [17]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

In [9]:
X_train_dataset.shape # which means are input data is now ready to go into classifier models 

(50000, 1024)

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

penalty - use all the weights, learn as much as possible, no penalty. If have a penalty, set some weights to 0 and only use important ones..?

tol - tolerance, by how much weights should be changing when it updates every time, if it doesn't do this over about 10 iterations it will stop because model is no longer improving  

verbose - if false no output/update on how model is performing 

solver - see documentation for suggestions, for larger datasets, sag and saga and for multinomial datasets 


In [10]:
clf = LogisticRegression(penalty="none", 
                        tol=0.1, 
                        verbose=True, 
                        solver="saga",
                        multi_class="multinomial").fit(X_train_dataset, y_train)

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.24794480
Epoch 3, change: 0.16788796
convergence after 4 epochs took 10 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.2s finished


In [11]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [12]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.34      0.37      0.35      1000
  automobile       0.37      0.39      0.38      1000
        bird       0.27      0.22      0.24      1000
         cat       0.24      0.14      0.18      1000
        deer       0.26      0.19      0.22      1000
         dog       0.29      0.33      0.31      1000
        frog       0.27      0.37      0.31      1000
       horse       0.33      0.30      0.31      1000
        ship       0.35      0.41      0.37      1000
       truck       0.39      0.45      0.42      1000

    accuracy                           0.32     10000
   macro avg       0.31      0.32      0.31     10000
weighted avg       0.31      0.32      0.31     10000



Start by setting a benchmark, start simple like with Logistic Regression. See how it does and can it beat this target with ML or neutral network stuff. 

## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

learning_rate="adaptive" <-- want it to learn quickly and once weights are set then go more slowly and carefully. Can adapt it's learning rate by learning at different speeds 

early_stopping - if we change the tol level from before - connected to this. Not totally sure about this part 

output, validation score - training model on training data, then it check accuracy on validation dataset which gives a score and does this for 20 iteratations and then evalulates how it does on the totally unseen test data 
in training - learns loss values from training data, training model my minimizing loss 
Validataion dataset is a very small subsection of the training data 

In [13]:
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(64, 10), # two hidden layers, 64 and then 10 - gets better if change to 100, 10 which increases complexity of the architecture 
                    learning_rate="adaptive",
                    early_stopping=True,
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train)

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.30872956
Validation score: 0.133000
Iteration 2, loss = 2.15971661
Validation score: 0.239200
Iteration 3, loss = 2.02581278
Validation score: 0.265200
Iteration 4, loss = 1.97076182
Validation score: 0.281800
Iteration 5, loss = 1.93555578
Validation score: 0.302600
Iteration 6, loss = 1.90926190
Validation score: 0.315600
Iteration 7, loss = 1.89160286
Validation score: 0.318800
Iteration 8, loss = 1.87500641
Validation score: 0.322200
Iteration 9, loss = 1.86730610
Validation score: 0.316800
Iteration 10, loss = 1.85845283
Validation score: 0.321200
Iteration 11, loss = 1.84549829
Validation score: 0.331400
Iteration 12, loss = 1.83590762
Validation score: 0.328600
Iteration 13, loss = 1.82908945
Validation score: 0.331400
Iteration 14, loss = 1.82320985
Validation score: 0.330600
Iteration 15, loss = 1.81056794
Validation score: 0.343400
Iteration 16, loss = 1.80707784
Validation score: 0.338400
Iteration 17, loss = 1.79877427
Validation score: 0.339800
Iterat



In [14]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [15]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.38      0.41      0.40      1000
  automobile       0.40      0.49      0.44      1000
        bird       0.26      0.34      0.30      1000
         cat       0.28      0.11      0.16      1000
        deer       0.27      0.26      0.27      1000
         dog       0.33      0.34      0.34      1000
        frog       0.28      0.29      0.28      1000
       horse       0.45      0.39      0.42      1000
        ship       0.44      0.44      0.44      1000
       truck       0.42      0.47      0.44      1000

    accuracy                           0.35     10000
   macro avg       0.35      0.35      0.35     10000
weighted avg       0.35      0.35      0.35     10000



## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).