Please fill in your name and that of your teammate.

You: Elian Marsault Benichon

Teammate: Nathan Wilson Fouka

# Introduction

Welcome to the eleventh lab. Deep Learning takes the already complex topic of Neural Networks and turns it up a notch. Several notches, in fact. It's hard to find exercises small enough to fit in a single assignment, let alone a *set* of exercises for all of these topics.

So this week the assignment is particularly small, with only 15 points, and should not take you as long as usual to complete. What you should do if you are interested in building Deep Learning experience instead is take one of the Bonus Questions and solve it yourself. We will support fully any question on any of the topics.

Willing to learn but unsure on the topic? Go for the **Transfer Learning** tutorial, it's the shortest and one of the most marketable skills. Basically you download a pre-trained network (on a huge dataset), cut the last (decision) layer(s), add your own (so you decide based on the last feature space), then train _only your new layer(s)_ on your specific task, which is fast and easy. While re-using the larger body that was pre-trained by someone else, likely with a larger budget.

### How to pass the lab?

Below you find the exercise questions. Each question awarding points is numbered and states the number of points like this: **[0pt]**. To answer a question, fill the cell below with your answer (markdown for text, code for implementation). Incorrect or incomplete answers are in principle worth 0 points: to assign partial reward is only up to teacher discretion. Over-complete answers do not award extra points (though they are appreciated and will be kept under consideration). Save your work frequently! (`ctrl+s`)

**You need at least 10 points (out of 15 available) to pass** (66%).

# 1. History and training strategies

#### 1.1 **[1pt]** Mention 3 reasons why DL did not happen 30 years ago.

1. There was a lack of large labeled datasets that are required to train deep neural networks.
2. There was a lack of computing power compared to today's GPUs to train complex models.
3. We did not have today's knowledge of the most effective deep learning architectures and training techniques.

#### 1.2 **[1pt]** Explain how to train a neural network using supervised learning.

The data needs to be split into 3 sets by following a 60-20-20 or 70-15-15 repartition for the training, validation, and test sets respectively.  

Then the training data is fed to the neural network. By using backpropagation, we can adjust the model's parameters to minimize the loss function.  
During this training, the performance is evaluated on the validation set and once we have a trained model, we test it on the test set to evaluate the generalization of the model.

####  1.3 **[1pt]** What is overfiting? How to avoid it?

Overfitting happens when the model is trained to fit the training set too closely, including all the noise and specific fluctuations, instead of learning the underlying pattern of the data.  
The trained model can't be generalized to unseen data as a consequence of overfitting.  

To avoid it, we can increase the size of training data through data augmentation techniques, use the dropout layer regularization technique and use transfer learning.

# 2. Deep Convolutional Neural Networks

#### 2.1 **[2pt]** Calculate the dimension of the feature space of the third layer of LeNet-5 (16 filters, slide 28). Explain your reasoning.

- Remember it uses Valid Convolution for padding.
- The filter size is really $(5\times5\times6)$ since it takes all channels at a time.
- The number of Filters is the number of neurons.

The third layer will have 16 filters that each have dimensions 10 x 10.  
The input to this layer is 14 x 14 (output of the previous layer). Valid convolution for padding means no padding and applying a 5 x 5 filter at stride 1 will reduce the size of each dimension by 5 - 1 = 4.  
The resulting feature space is of dimension 10x10x16 


#### 2.2 **[2pt]** Explain in English the results of the Microsoft Tay Twitter chatbot experiment. Propose a safer alternative experiment protocol.

The results of the Microsoft Tay Twitter experiment are that users fed the neural networks with offensive and hateful language which the chatbot learned from and replicated. Anyone could feed the bot anything and it was unable to distinguish the incoming tweets based on ethical or moral safeguards.  

A safer alternative experiment protocol could include interactions with a curated set of individuals rather than anonymous internet users. Also human administrators overseeing the interactions or implementing another model ahead to filter out spam and offensive language from being fed to the bot.

# 3. Generative Adversarial Networks

#### BONUS **[ZERO pt]** GANs are amazing tools and a great topic, but they are complex enough that implementing a decent example would require a lab by itself. So here is a [great tutorial](https://colab.research.google.com/github/tensorflow/gan/blob/master/tensorflow_gan/examples/colab_notebooks/tfgan_tutorial.ipynb), if you choose to play with it share your progress on Moodle and we'll support you!

# 4. Transfer Learning

If I were to only do **one** Bonus Question in the entire course, and was interested in taking a job using Deep Learning after the university, I would do this one here. It is too much work to complete to _require_ everyone to do it, but is probably the most valuable exercise in this whole assignment if you wish to do it.

Transfer Learning is easily the most useful and powerful technique to know when you first get a job that expects you to apply Deep Learning -- granted, IF you know how NNs work, as required for this course. It allows you to simply download enormous networks that have been trained on supercomputers using unbelievably large datasets, then specialize them your specific problem and use their results for free.

#### BONUS **[ZERO pt]** Follow [this tutorial](https://keras.io/guides/transfer_learning/) on Transfer Learning.

# 5. Transformers

#### 2.1 **[5pt]** Run the simple Transformer tutorial below to train a model on movie reviews. Explain what parts of the Transformer are involved in the following lines, and what do they do: 33, 37, 66, 106, 109 (include each parameter).

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
sns.set(rc={'figure.figsize':(8,6)}, style="whitegrid")

In [2]:
"""
Title: Text classification with Transformer
Author: [Apoorv Nandan](https://twitter.com/NandanApoorv)
Date created: 2020/05/10
Last modified: 2020/05/10
Description: Implement a Transformer block as a Keras layer
and use it for text classification.
Accelerator: GPU
https://github.com/keras-team/keras-io/blob/master/examples/nlp/text_classification_with_transformer.py
"""
"""
## Setup
"""

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing import sequence

# The IMDB Large Movie Review Dataset
# https://ai.stanford.edu/~amaas/data/sentiment/
# https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb
from tensorflow.keras.datasets import imdb

"""
## Implement a Transformer block as a layer
"""


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads,
                                             key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [
                layers.Dense(ff_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


"""
## Implement embedding layer

Two seperate embedding layers, one for tokens, one for token
index (positions).
"""


class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size,
                                          output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen,
                                        output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions


"""
## Download and prepare dataset
"""

vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
dataset = imdb.load_data(num_words=vocab_size)
(x_train, y_train), (x_val, y_val) = dataset
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = sequence.pad_sequences(x_val, maxlen=maxlen)

"""
## Create classifier model using transformer layer

Transformer layer outputs one vector for each time
step of our input sequence. Here, we take the mean
across all time steps and use a feed forward network
on top of it to classify text.
"""


embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)


"""
## Train and Evaluate
"""

batch_size=32
epochs=2

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)
history = model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(x_val, y_val)
)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz: None -- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)

In [None]:
Line 33: The Multi-Headed Attention part of the transformer is involved here.  
This line uses the self-attention mechanism to focus on different positions of the input sequence.  
The parameter num_heads determines the number of attention operations that are performed in parallel.  
The parameter key_dim determines the dimension of the key vector, query vector and value vector. It is set to the variable embed_dim which stands for the embedding dimension of the input tokens.  

Line 37: The Feed-forward part of the transformer is involved here.  
This line creates a Dense layer as part of the feed-forward network.  
The parameter ff_dim determines the number of units in the Dense layer.  
The parameter activation determines the activation function of the Dense layer which is set to rectified linear unit (relu).

Line 66: The input embedding part of the transformer is involved here.  
This line maps each token to a dense vector representation by creating an embedding layer for the input tokens.  
The parameter input_dim determines the size of the vocabulary or unique tokens in the input data.  
The parameter output_dim determines the dimensionality of the embedding vectors and is set to the variable embed_dim introduced before.

Line 106: The inputs of the transformer architecture are defined here.  
This line determines the shape of the input data by creating an input layer.  
The parameter shape determines the shape and it is set to maxlen which is the maximum length of the input sequences.

Line 109: The entire encoder-decoder structure is involved here which is the complete transformer block with the self-attention mechanism and the feed-forward network.  
The argument embed_dim is necessary to define the dimensionality of the key vector, query vector and value vector for the self-attention mechanism.  
The argument num_heads is necessary for the Multi-Headed Attention part of the transformer.  
The argument ff_dim defines the dimensionality of the hidden layer in the feed-forward network.

#### 2.2 **[3pt]** Use the trained model to generate a review of 100 words.

- Remember that your `model` is, after all, still just a neural network.
- Think what input you need to pass at each time step, you already did sequential modeling with RNNs last week.
- Think explicitly about the conversions between text and embedded vectors, both for inputs and outputs.
- For your reference, here is some code to read a (decoded) review from the dataset.

In [None]:
# Source: Tensorflow IMDB dataset documentation
# https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index

# Use the default parameters to keras.datasets.imdb.load_data
start_char = 1
oov_char = 2
index_from = 3
# Retrieve the training sequences.
(x_train, _), _ = keras.datasets.imdb.load_data(
    start_char=start_char, oov_char=oov_char, index_from=index_from
)
# Retrieve the word index file mapping words to indices
word_index = keras.datasets.imdb.get_word_index()
# Reverse the word index to obtain a dict mapping indices to words
# And add `index_from` to indices to sync with `x_train`
inverted_word_index = dict(
    (i + index_from, word) for (word, i) in word_index.items()
)
# Update `inverted_word_index` to include `start_char` and `oov_char`
inverted_word_index[start_char] = "[START]"
inverted_word_index[oov_char] = "[OOV]"
# Decode the first sequence in the dataset
decoded_sequence = " ".join(inverted_word_index[i] for i in x_train[0])
print(decoded_sequence)

In [None]:
review_index = np.random.choice(len(x_train))
selected_review = x_train[review_index]

seed_review = " ".join(inverted_word_index.get(i, "[OOV]") for i in selected_review if i != 0)
seed_words = seed_review.split()[:maxlen]

generated_review = " ".join(seed_words)

for _ in range(100 - len(seed_words)):
    
    token_list = [inverted_word_index.get(word, 2) for word in seed_review.split()]
    token_list = [1] + token_list
    
    input_sequence = sequence.pad_sequences([token_list], maxlen = maxlen)[0]
    input_tensor = tf.convert_to_tensor([input_sequence])
    
    predicted_probs = layers.Dense(vocab_size, activation = "softmax")(x)
    predicted_token = np.random.choice(range(vocab_size), p = predicted_probs[0, -1].numpy())
    
    output_word = inverted_word_index.get(predicted_token, "[OOV]")
    
    generated_review += " " + output_word
    
    seed_review = " ".join(generated_review.split()[-maxlen + 1:])

print(generated_review)

**BONUS [ZERO pt]: Follow this complete tutorial on Transformers.**  
https://www.tensorflow.org/text/tutorials/transformer

# At the end of the exercise

Bonus question with no points! Answering this will have no influence on your scoring, not at the assignment and not towards the exam score -- really feel free to ignore it with no consequence. But solving it will reward you with skills that will make the next lectures easier, give you real applications, and will be good practice towards the exam.

The solution for this questions will not be included in the regular lab solutions pdf, but you are welcome to open a discussion on the Moodle: we will support your addressing it, and you may meet other students that choose to solve this, and find a teammate for the next assignment that is willing to do things for fun and not only for score :)

#### BONUS **[ZERO pt]** You now know the basis for time series prediction using recurrent networks. Why don't you try your hand at predicting the evolution of the current COVID-19 situation? Specifically look at the Reproduction number, which is the base for the exponential growth of the infection. You can find the main data from JHU CSSE [here](https://github.com/CSSEGISandData/COVID-19), then the data for Switzerland [here](https://github.com/openZH/covid_19) (specifically Fribourg [here](https://github.com/openZH/covid_19/blob/master/fallzahlen_kanton_total_csv_v2/COVID19_Fallzahlen_Kanton_FR_total.csv)), some work from ETHZ [here](https://bsse.ethz.ch/cevo/research/sars-cov-2/real-time-monitoring-in-switzerland.html), and an example for advanced visualization [here](https://opensource.com/article/20/4/python-data-covid-19). Feel free to share your conclusions and opinions on it on the forum.

### Final considerations

- You now know more about Transformers networks using Keras: this allows you to tackle several Natural Language Processing tasks, which are highly marketable at this time.
- You should now have a deeper understanding of convolutions, especially on how things that appear small and easy (such as padding and striding) can lead to quite complex changes of behavior. For example, can we apply "same" padding with an filter of an even shape (e.g. 4 x 4, 6 x 6 etc.)? Would it be possible to pad the input such that, using a stride > 1, we get a matrix with the same shape as the input? This reasoning is important because these "sizes" in the network are hyperparameters, which means that you are responsible to set them correctly.