#Documenting your results
This Lab teaches you some of the basics of documenting machine learning research.

Your task is to:

1. Download and describe the CIFAR10 image classification dataset.
2. Design and describe a small CNN that can solve the CIFAR10 problem.
3. Train your model and explain how you trained it.
4. Summarize your results using
 - Test set error (top-1 and top-5)
 - Test set accuracy
 - Confusion matrix
 - Precision/recall and Average Precision

We will be using Keras and [scikit-learn](https://scikit-learn.org/stable/index.html).

##Task 1: The dataset
Describing your dataset is important. Here is a list of things that could be of interest to the reader:

- What are you trying to predict? 
- Where did the dataset come from? (Remember to cite if its a public dataset)
- How was it collected?
- Why was it collected?
- Why did you choose this dataset, and not that one over there?
- What is the output: categorial {0, 1, ..., K}, continuous scalar in [0, 1], arbitrary real no., etc.
- No. of classes, what are the classes?
- No. of observations, no. of observations per class if unbalanced.
- What is the size of the training set?
- Is there a test set? What is its size?
- Is there a baseline result that you can compare your results with?
- What is the state-of-the-art performance on this data set?

Finally, it might also be a good idea to show some actual observations/examples from the dataset.

###1.1 Download

In [None]:
from __future__ import print_function
import keras
from keras.datasets import cifar10

# The data, split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Uncomment below to convert class vectors to binary class matrices.
# num_classes = ??? # Number of classes
#y_train = keras.utils.to_categorical(y_train, num_classes)
#y_test = keras.utils.to_categorical(y_test, num_classes)

###1.2 Questions
Try to answer as many of the above questions as possible (you don't need to write your answers down). You will be able to find many answers just by looking at the original source of the CIFAR 10 dataset:

https://www.cs.toronto.edu/~kriz/cifar.html


##Task 2: Network Architecture
So you are faced with a machine lerning problem. Which model should you use? 

Well, that depends on the type of problem (classification, regression, clustering, etc.) and what the success criteria are (high accuracy, high speed, understanding structure in the data, etc).

###2.1 Design a CNN
Your task is design a CNN that solves the CIFAR10 problem. Motivate your choice of architecture and hyperparameters (number of layers, number of neurons/kernels in each layer, etc.) and regularization techniques.

I left a template below that you could use - you just need to fill in the gaps (marked with `???`). Feel free to design your own network (look for inspiration in [Lab 2](https://github.com/aivclab/dlcourse/blob/master/Lab2_Solution.ipynb) or [Lab 3](https://github.com/aivclab/dlcourse/blob/master/Lab3_Solution.ipynb)) or search the internet for alternative models. 


In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D

model = Sequential()
model.add(Conv2D(???, (3, 3), padding='same', input_shape=???))
model.add(Activation('relu'))
model.add(Conv2D(???, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(???, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(???))
model.add(Activation('relu'))
model.add(Dropout(???))
model.add(Dense(???))
model.add(Activation('softmax'))

Always remember to summarize your model in your report. This is easily done in Keras:

In [None]:
model.summary()

##Task 3: Model Training
Train your model and explain how you trained it. This includes:

- Data preprocessing
- Data augmentation?
- Train/validation split
- Choice of optimizer (not covered in lectures yet) and its hyperparameters, including

 - learning rate
 - number of training epochs.
 - batch size

Here is a template that you can use (again, gaps are marked with ???).

**Note:** The template code assumes that you already have a validation set. We will be using the test set provided with CIFAR10 as the validation set. Formally this is not the correct way to use a test set. Instead the validation set should be a random subset of training data (recall that the validation set serves as "unseen data" that you are allowed to use for optimizing your hyperparameters). Then, ONLY after you are done training your final model, you are allowed to evaluate it on your test set. If you wish to use a subset of the training data for validation, there is code for that in [Lab 2](https://github.com/aivclab/dlcourse/blob/master/Lab2_Solution.ipynb).

In [None]:
from keras.preprocessing.image import ImageDataGenerator

# Initiate RMSprop optimizer
# (Note: this choice is somewhat random - see option shere: https://keras.io/api/optimizers/)
opt = keras.optimizers.RMSprop(learning_rate=???, decay=1e-6)

# Let's train the model using RMSprop
model.compile(loss='categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

# Data preprocessing (normalization)
# (Note: Consider zero-centering the data - how?)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

data_augmentation = ???
epochs = ???
batch_size = ???

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_test, y_test),
              shuffle=True)
else:
    print('Using real-time data augmentation.')
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        zca_epsilon=1e-06,  # epsilon for ZCA whitening
        rotation_range=???,  # randomly rotate images in the range (degrees, 0 to 180)
        # randomly shift images horizontally (fraction of total width)
        width_shift_range=0.1,
        # randomly shift images vertically (fraction of total height)
        height_shift_range=0.1,
        shear_range=0.,  # set range for random shear
        zoom_range=0.,  # set range for random zoom
        channel_shift_range=0.,  # set range for random channel shifts
        # set mode for filling points outside the input boundaries
        fill_mode='nearest',
        cval=0.,  # value used for fill_mode = "constant"
        horizontal_flip=???,  # randomly flip images
        vertical_flip=False,  # randomly flip images
        # set rescaling factor (applied before any other transformation)
        rescale=None,
        # set function that will be applied on each input
        preprocessing_function=None,
        # image data format, either "channels_first" or "channels_last"
        data_format=None,
        # fraction of images reserved for validation (strictly between 0 and 1)
        validation_split=0)

    # Compute quantities required for feature-wise normalization
    # (std, mean, and principal components if ZCA whitening is applied).
    datagen.fit(x_train)

    # Fit the model on the batches generated by datagen.flow().
    model.fit_generator(datagen.flow(x_train, y_train,
                                     batch_size=batch_size),
                        epochs=epochs,
                        validation_data=(x_test, y_test),
                        workers=4)

##Task 4: Summarizing your results
Summarize your results using
 - Test set error (top-1 and top-5)
 - Test set accuracy
 - Confusion matrix
 - Precision/recall curve
 - Average Precision

You might find the code below useful:

In [None]:
# Evaluate the model on the test data using `evaluate`
print('\n# Evaluate on test data')
results = model.evaluate(x_test, y_test, batch_size=???)
print('test loss, test acc:', results)

# Generate predictions (probabilities -- the output of the last layer)
# on new data using `predict`
N = ???
print('\n# Generate predictions for N samples')
predictions = model.predict(x_test[:N])
print('predictions shape:', predictions.shape)

Note that the whole test dataset might not fit into the memory of the GPU. In that case you have to perform some kind of looping to make predictions for all samples in the test set.

Also note that if the last layer of your CNN model is softmax, then `predictions` contains predicted class probabilities (and not labels).

###4.1 Test set error
The **top-1 error** is simply the number of wrong predictions divided by the total numer of observations (in the test test set)

To calculate the **top-5 error** you must consider predicitons of all labels/classes and rank them. The easiest way to rank the predictions is to sort them by the predicted class probabilities. Once ranked, the top-5 error is calculated the same way as the top-1 error, except that a prediction is considered wrong only when the true label is not among the top-5 five predicions.

When sorting the predicted class probabilities, we are interested in the indices rather than the sorted values. You might want to take a look at numpy's [argsort](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html).

Your task is to calculate the top-1 error and the top-5 error *on the test set*.

In [None]:
# Your code goes here

###4.2 Test set accuracy
Is just 1 minus the test set error.


In [None]:
# Your code goes here

###4.3 Confusion matrix
A confusion matrix is a table that summarizes the performance of a classification model on a set of test data for which the true labels are known. The number of correct and incorrect predictions is summarized and broken down for each label. The diagonal elements of the table (from top-left to bottom-right) represent the number of correct predictions for each label. The off-diagonal elements correspond to incorrect predictions; they show the ways in which the classification model is confused when it makes predictions.

Example:

![alt text](https://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_001.png)

You can normalize the entries of the table by dividing the numbers in each row with the sum of the numbers in that row. As a general rule of thumb, a score above 0.8 on the diagonal is desired.

Example:

![alt text](https://scikit-learn.org/stable/_images/sphx_glr_plot_confusion_matrix_002.png)

Calculate and display the normalized confusion matrix. Use one of these sources as inspiration:

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py


In [None]:
# Your code goes here

###4.4 Precision-Recall and Average Precision
Precision-Recall is a useful measure of success of prediction, especially when the classes are very imbalanced. In information retrieval, precision is a measure of result relevancy:

```
precision = #TP/(#TP + #FP)
```

while recall is a measure of how many truly relevant results are returned:

```
recall = #TP/(#TP + #FN)
```

The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.

Your task is to modify the multi-class part of [this tutorial](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) and make it work on your results.

Note that `y_score` represents a *score* for *each* class. The score is predicted by your model and could for instance be the class probabilities.

In [None]:
# Your code goes here

See if you can make sense of the outputs and interpret the results.

##5: Optional task
Visualize the learned representation (i.e., the output of the encoder) using t-sne. See details in [Lab 2 solution](https://github.com/aivclab/dlcourse/blob/master/Lab2_Solution.ipynb)