Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 12 - Exam

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=12838) Lab session #12 - Exam

Adrien Ehrhardt

Groupe B

## Quiz

The following function `impute_missing` pertains to the moodle quiz question on Missing Value Imputation (12 points). It is separate from the 'main' lab exercises which follow. Complete the function by replacing `# TODO` tags with your code, according to the instructions given.

In [None]:
import numpy as np
from sklearn.decomposition import PCA

In [None]:
def impute_missing(X: np.ndarray, random_number: float = 2.5) -> float:
    """
    Perform PCA missing data imputation

    :param numpy.ndarray X: array with one missing value
    :param float random_number: initialize missing value to this random value
    :return: imputed value
    :rtype: float
    """
    # Identify the row and column of the missing value
    i, j = np.argwhere(np.isnan(X))[0]

    # TODO: Impute 'random_number' to replace the missing value
    # YOUR CODE HERE
    raise NotImplementedError()

    # TODO: Carry out PCA by completing the code below
    # TODO: Reconstruct X from the main component
    # TODO: Impute the value
    pca = PCA(n_components=2).fit(X)
    U = pca.components_.T
    Lambda = pca.explained_variance_
    # YOUR CODE HERE
    raise NotImplementedError()

    return X[i, j]

In [None]:
# Data 
X = np.array([
    [-5,-1.5],
    [-3,-3],
    [-1,-0.5],
    [-1,4],
    [np.nan,2],
    [4,4],
    [8,6.5],
    [9,7],
    [9,-3.5],
])


print(impute_missing(X))

## Overall presentation

Emphasis is put on the quiz, which will be worth $\approx$ 50 % of this exam's grade.

This lab is composed of 3 exercises, granting up to 3, 3 and 5 points.

There are examples of automatic tests that are run against your code. **They are not exhaustive nor sufficient** (we will run other - hidden - tests), **but they are necessary**: they have to pass, otherwise you can be sure *not* to get the points.

You **cannot** Google anything. The exam is open book w.r.t. the lectures. Some help and hints are provided for each question (e.g. which function to use and how), and you can also use the `help(...)` function and consult the documentation linked in this notebook.

- **Do not** delete any pre-existing cell (you can create and delete your own cells for testing).
- **Do not** change the type (Markdown / Code / ...) of any pre-existing cell.
- Run the notebook with the **CSE204** kernel - if you didn't install it beforehand (come on, it's the $12^{th}$ lab!), proceed at your own risk.
- **Do not** rename the file when uploading your work on Moodle.
- **Do not** edit the notebook's or a cell's metadata.

## Imports

In [None]:
import os
import gc
import json
from copy import deepcopy
import itertools
import pickle
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.tree import DecisionTreeClassifier
import tensorflow as tf
import tensorflow.keras.datasets.mnist as mnist
from tensorflow.keras.layers import Dense, Input, Conv2D, Activation, Dropout, MaxPooling2D,\
    BatchNormalization, Flatten, UpSampling2D, Reshape
from tensorflow.keras import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing import image
import matplotlib.pyplot as plt

## Introduction

In this lab, you will get hands-on experience with Unsupervised Learning, namely dimension reduction using PCA and Undercomplete Autoencoders, as well as Supervised Learning, namely multi-label classification.

The goal of dimension reduction is to find a suitable transformation which converts a high-dimensional space into a smaller feature space, such that the important information is not lost, but the visualization and interpretability are easier.

The goal of supervised multi-label classification is to predict the membership of a sample to **possibly several** categories.

## Dataset

* Get the data

*If* the following cell does not work, *e.g.* on Windows, copy / paste the link to the dataset in your browser and place the downloaded artifact in the same directory as this notebook.

In [None]:
%%bash

if ! [ -f Movie_Poster_Dataset.zip ]; then
    curl -O https://www.cs.ccu.edu.tw/~wtchu/projects/MoviePoster/Movie_Poster_Dataset.zip
fi

In [None]:
%%bash

if ! [ -f Movie_Poster_Metadata.zip ]; then
    curl -O https://www.cs.ccu.edu.tw/~wtchu/projects/MoviePoster/Movie_Poster_Metadata.zip
fi

* Unzip the data

*If* the following cell does not work, *e.g.* on Windows, unzip the directory using your built-in tools (try double-clicking the zip file).

In [None]:
%%bash

if ! [ -r Movie_Poster_Dataset ]; then
    unzip -o Movie_Poster_Dataset.zip
fi

In [None]:
%%bash

if ! [ -r groundtruth ]; then
    unzip -o Movie_Poster_Metadata.zip
fi

You should now have folders named `Movie_Poster_Dataset` and `groundtruth` in your working directory containing years of pictures of movie posters and associated metadata (in particular their genre(s)).

* List id's and genres from `groundtruth` to obtain `y` (the categories to predict)

In [None]:
files = []
for year in range(1980, 2016):  # loop through years
    encoding = 'utf-8' if year in [1980, 1981] else 'utf-16'
    with open(f'groundtruth/{year}.txt', 'r', encoding=encoding) as f:  # open metadata file
        file = '[' + f.read() + ']'
        file = file.replace('}', '},')
        file = file.replace('ObjectId(', '')
        file = file.replace('),', ',')
        file = file.replace(',\n]', ']')
        data = json.loads(file)  # read metadata file
        files.append(pd.DataFrame(data)[['imdbID', 'Genre']])  # keep only interesting information

In [None]:
genres = pd.concat(files)  # concatenate all years in a single pandas DataFrame

In [None]:
# List all genres in the dataset
possible_genres = np.unique(list(itertools.chain.from_iterable(
    genres.Genre.str.split(', ').to_list())))

In [None]:
possible_genres

In [None]:
genres.set_index('imdbID', inplace=True)  # set _id as the DataFrame's index

In [None]:
# Create one boolean column for each genre and each movie
for genre in possible_genres:
    genres[genre] = genres.Genre.str.contains(genre) * 1

In [None]:
genres.drop(columns="Genre", inplace=True)
genres = genres[~genres.index.duplicated(keep='first')]
genres

* Load images in an array

Beware of the weird double-loop (to go easy on the memory).

In [None]:
def train_test_split(subset: str = "train", max_size=4000):
    if subset == "train":
        start = 0
    else:
        start = 20
    images_loop = []
    images_ids = []
    increment_size = 3
    images = np.zeros((max_size, 200, 200, 3))
    compteur = 0
    for year in range(1980 + start, 1980 + start + 15):
        if year > 2005:
            continue
        print(f'Loading posters from year {year}')
        for filename in os.listdir(f'Movie_Poster_Dataset/{year}'):
            img = image.load_img(f'Movie_Poster_Dataset/{year}/{filename}',
                                 target_size=(200, 200, 3))
            img = image.img_to_array(img)
            img = img / 255
            images[compteur, :, :, :] = img
            images_ids.append(filename.replace('.jpg', ''))
            compteur += 1
    return images[:compteur, :, :, :], np.array(images_ids)

In [None]:
# Do not run this several times due to memory leaks
gc.collect()
X_train, indices_train = train_test_split()

First poster:

In [None]:
plt.imshow(X_train[0, :, :, :])
plt.axis('off');

Get categories:

In [None]:
y_train = genres.loc[genres.index.isin(indices_train)].values
y_train

In [None]:
X_train.shape, indices_train.shape, y_train.shape

In [None]:
X_test, indices_test = train_test_split("test")

In [None]:
y_test = genres.loc[genres.index.isin(indices_test)].values
y_test

In [None]:
X_test.shape, indices_test.shape, y_test.shape

In [None]:
X_train_reshaped = X_train.reshape(X_train.shape[0], np.prod(X_train.shape[1:]))
X_train_reshaped.shape

In [None]:
X_test_reshaped = X_test.reshape(X_test.shape[0], np.prod(X_test.shape[1:]))
X_test_reshaped.shape

## Exercise 1: PCA (3 points)

Complete the following function.

Recall that:

The Singular Value Decomposition of a matrix $X \in R^{n\times p}$ is given as

$$ X = U \Sigma W^T, $$

where $\Sigma \in R^{n\times p}$ is a rectangular diagonal matrix of positive values known as the singular values of $X$ (often denoted by $\sigma(X)$), and $U \in R^{n\times n}$ and $W \in R^{p\times p}$ are orthonormal matrices, which columns are the left and right (respectively) singular vectors of the matrix $X$.

We can reduce the dimensionality of our data by truncating the transformed variables to include only a subset of those variables with the highest variance. For example, if we keep the first `n_components` $\leq p$ variables, the reduced transformation reads

$$ T_{n_{components}} = X W_{n_{components}}. $$

We can then decode to reconstruct the images from the lower-dimensional space. Recall that based on the PCA transformation, we can compute the reconstructed images with

$$ \hat{X} = X W_{n_{components}} W_{n_{components}}^T $$

<!-- See [lab_session_10](https://adimajo.github.io/CSE204-2021/lab_session_10/lab_session_10.html). -->

In [None]:
def transform_PCA(X_train, X_test, n_components: int = 3):
    """
    Learns principal components analysis on X_train, returns the projections of X_train and X_test 
    on n_components first principal components and W_n_components.
    :param numpy.ndarray X_train: training set of size n x p
    :param numpy.ndarray X_test: testing set of size m x p
    :param int n_components: number of components to retain in the PCA decomposition
    :return: projection of X_train, X_test and W_n_components from SVD
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return X_train_transformed, X_test_transformed, W_n_components

In [None]:
X_train_PCA_transformed, X_test_PCA_transformed, W_n_components = transform_PCA(X_train_reshaped, X_test_reshaped)
assert X_test_PCA_transformed.shape[1] == 3

#### 2D representation

In [None]:
indices = np.random.randint(0, X_test_PCA_transformed.shape[0], 1000)
plt.scatter(X_test_PCA_transformed[indices, 0],
            X_test_PCA_transformed[indices, 1],
            c=np.argmax(y_train[indices, :], axis=1), cmap='rainbow')
plt.colorbar();

#### Reconstruction

Complete the following function:

In [None]:
def reconstruct_image(X_transformed: np.ndarray, W: np.ndarray):
    """
    Reconstruct an image from a projection onto n_components
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
new_data = reconstruct_image(X_test_PCA_transformed[:5, :], W_n_components)
assert new_data.shape == (5, 200 * 200 * 3)

In [None]:
fig, ax = plt.subplots(2, 5)
for i in range(5):
    ax[0, i].imshow(X_test[i, :])
    ax[0, i].set_axis_off()
    ax[1, i].imshow(new_data[i, :].reshape(200, 200, 3))
    ax[1, i].set_axis_off()

## Exercise 2: Convolutional Autoencoders (3 points)

Complete the following function by adding:
* An encoder with:
    * An [`Input`](https://keras.io/api/layers/core_layers/input/) layer of `shape` `input_size`;
    * A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 16 3x3 filters for each channel, with `same` padding and `relu` activation;
    * A [`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d/) layer of 4x4 `pool_size` with `same padding`;
    * A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 2 3x3 filters for each channel, with `same` padding and `relu` activation;
    * A [`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d/) layer of 4x4 `pool_size` with `same padding`;
    * A [`Flatten`](https://keras.io/api/layers/reshaping_layers/flatten/) layer;
    * A [`Dense`](https://keras.io/api/layers/core_layers/dense/) layer with `code_size` nodes;
* A decoder with:
    * A [`Dense`](https://keras.io/api/layers/core_layers/dense/) layer with 338 nodes;
    * A [`Reshape`](https://keras.io/api/layers/reshaping_layers/reshape/) layer with `target_shape` of `(13, 13, 2)`;
    * A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 2 3x3 filters for each channel, with `same` padding and `relu` activation;
    * An [`UpSampling2D`](https://keras.io/api/layers/reshaping_layers/up_sampling2d/) layer with `size` 4x4;
    * A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 16 3x3 filters for each channel, with `relu` activation;
    * An [`UpSampling2D`](https://keras.io/api/layers/reshaping_layers/up_sampling2d/) layer with `size` 4x4;
    * A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 1 3x3 filters for each channel, with `same` padding and `sigmoid` activation.
* [Compile](https://keras.io/api/models/model_training_apis/) the resulting autoencoder with optimizer `Adam` and `mean_square_error` loss;
* Return the autoencoder and the encoder.

In [None]:
def convolutional_autoencoder(input_size, code_size: int):
    """
    Instanciate and compiles an autoencoder, returns both the autoencoder and just the encoder

    :param tuple input_size: shape of the input samples
    :param int code_size: size of the new representation space
    :return: autoencoder, encoder
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return autoencoder, encoder

In [None]:
cnn_autoencoder, cnn_encoder = convolutional_autoencoder(X_train.shape[1:], 3)

The following cell trains the convolutional autoencoder; it is quite long to run. If you're confident about your code, you might want to skip this and go to the next section, coming back here when necessary / when you're done. You **can** also train on fewer examples / less epochs (modify the cell below) to speed things up and debug your code.

In [None]:
history = cnn_autoencoder.fit(X_train, X_train, epochs=4)
plt.plot(history.history['loss']);

In [None]:
X_train_transformed_autoencoder = cnn_encoder.predict(X_train)
X_test_transformed_autoencoder = cnn_encoder.predict(X_test)
indices = np.random.randint(0, X_test_transformed_autoencoder.shape[0], 1000)
plt.scatter(X_test_transformed_autoencoder[indices, 0],
            X_test_transformed_autoencoder[indices, 1],
            c=np.argmax(y_train[indices, :], axis=1), cmap='rainbow')
plt.colorbar();

In [None]:
fig, ax = plt.subplots(2, 5)
new_data = cnn_autoencoder.predict(X_test[:6, :])

for i in range(5):
    ax[0, i].imshow(X_test[i, :])
    ax[1, i].imshow(new_data[i, :])
    ax[0, i].set_axis_off()
    ax[1, i].set_axis_off()

## Exercise 3: Classification (5 points)

### Using reduced representation and tree-based methods

To predict the genres of each movie given its poster, we will use a binary classification model, such as a decision tree, **for each label**. 

Complete the following function.

*Hint*
* it should return *e.g.* `{'Action': DecisionTreeClassifier(model_kwargs), ...}`;
* you can use `**model_kwargs` to "unpack" a dictionary.

In [None]:
def dict_of_binary_models(labels, model, model_kwargs):
    """
    Constructs a dict with key label and instanciate and object of class model as value.

    :param list labels: list of labels of genres
    :param model: a model Class from sklearn
    :param dict model_args: a dictionary of arguments to pass to the model instanciation
    :return: a dict which key is the label and which value is a new instance of model
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
dict_of_models = dict_of_binary_models(genres.columns.to_list(),
                                       DecisionTreeClassifier,
                                       {'max_depth': 5})

In [None]:
[print(key, ':', value) for key, value in dict_of_models.items()]
assert len(dict_of_models) == len(genres.columns.to_list())

Complete the following function which calls `fit` of each model in the dictionary with the appropriate data.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
train_binary_models(dict_of_models, X_train_PCA_transformed, y_train)

In [None]:
for label in dict_of_models.keys():
    assert hasattr(dict_of_models[label], "classes_")

Complete the following function which predicts each label on `X_test`.

In [None]:
def pred_binary_models(dict_of_models: dict, X_test: np.ndarray):
    """
    Applies predict method for each classifier on X_test

    :param dict dict_of_models: a dictionary (key: class labels, values: sklearn trained models)
    :param numpy.ndarray X_test: testing dataset
    :returns: a dictionary (key: class labels, values: np.ndarray of binary class prediction)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
dict_of_predictions = pred_binary_models(dict_of_models, X_test_PCA_transformed)

In [None]:
[print(key, ':', value) for key, value in dict_of_predictions.items()];

Complete the following function which computes the error rate for each label.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
dict_of_error_rate_pca = dict_of_error_rate(dict_of_predictions, y_test)

In [None]:
[print(key, ':', value) for key, value in dict_of_error_rate_pca.items()];

Let's rerun the whole thing with the convolutional autoencoder approach:

In [None]:
dict_of_models = dict_of_binary_models(genres.columns.to_list(),
                                       DecisionTreeClassifier,
                                       {'max_depth': 5})
train_binary_models(dict_of_models, X_train_transformed_autoencoder, y_train)
for label in dict_of_models.keys():
    assert hasattr(dict_of_models[label], "classes_")
dict_of_predictions = pred_binary_models(dict_of_models, X_test_transformed_autoencoder)
[print(key, ':', value) for key, value in dict_of_predictions.items()];
print('\n')
dict_of_error_rate_autoencoder = dict_of_error_rate(dict_of_predictions, y_test)
[print(key, ':', value) for key, value in dict_of_error_rate_autoencoder.items()];

In [None]:
error_rates = pd.DataFrame([dict_of_error_rate_pca, dict_of_error_rate_autoencoder],
                           index=['PCA', 'Autoencoder']).T
error_rates

### Using CNNs

If the goal is first and foremost to predict the probable genre, why not use directly a CNN with a classification layer!

Complete the following function by adding:
* A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 32 3x3 filters for each channel, with `same` padding and `relu` activation;
* A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 32 3x3 filters for each channel, with `same` padding and `relu` activation;
* A [`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d/) layer of 4x4 `pool_size` with `same padding`;
* A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 32 3x3 filters for each channel, with `same` padding and `relu` activation;
* A [`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/) layer of 32 3x3 filters for each channel, with `same` padding and `relu` activation;
* A [`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d/) layer of 4x4 `pool_size` with `same padding`;
* A [`Flatten`](https://keras.io/api/layers/reshaping_layers/flatten/) layer;
* A [`Dense`](https://keras.io/api/layers/core_layers/dense/) layer with 512 nodes and `relu` activation;
* A [`Dense`](https://keras.io/api/layers/core_layers/dense/) layer with `num_classes` nodes and `sigmoid` activation;
* [Compile](https://keras.io/api/models/model_training_apis/) the resulting model with optimizer `SGD` and `binary_crossentropy` loss;
* Return the resulting model.

In [None]:
def cnn_model(input_shape: tuple, num_classes: int):
    """
    Returns a compiled keras Conv2D model
    
    :param tuple input_shape: shape of ONE example as passed to the first layer
    :param int num_classes: number of output classes as passed to the last layer
    """
    model = tf.keras.models.Sequential()
    # model.add(...)
    # YOUR CODE HERE
    raise NotImplementedError()
    # model.compile(...)
    # YOUR CODE HERE
    raise NotImplementedError()
    return model

In [None]:
model = cnn_model(X_train.shape[1:], y_train.shape[1])

The following cell trains the convolutional autoencoder; it is quite long to run. If you're confident about your code, you might want to skip this and go to the next section, coming back here when necessary / when you're done. You **can** also train on fewer examples / less epochs (modify the cell below) to speed things up and debug your code.

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', min_delta=0, patience=2, mode='auto')

history = model.fit(X_train, y_train, batch_size=32, epochs=5, 
    validation_data=(X_test, y_test), shuffle=True, callbacks=[early_stopping])

plt.plot(history.history['loss']);

In [None]:
# Predict probabilities for each class on X_test
predictions_cnn = model.predict(X_test)

In [None]:
# Convert probabilities into decisions (class membership) with 0.5 threshold
cnn_decisions = (predictions_cnn > 0.5) * 1

In [None]:
# Compute error rates per class
error_rates_cnn = np.sum(np.abs(y_test - cnn_decisions), axis=0) / y_test.shape[0]

In [None]:
# Add to PCA and Autoencoder error rates and display
error_rates['CNN'] = error_rates_cnn
error_rates