# Data Science

Before starting the exercises make sure to enable the GPU


*   Go to the runtime menu (&#8593; )
*   Select *Change runtime type*
*   Select GPU as *Hardware accelerator*


## Deep Learning 2 (1 hour 30 min)

### Exercise 1 (10 min) -- Some more questions about Deep Learning

Edit this cell to write **T** (True) or **F** (False) *before* each assertion

1. **T** Bad choice of optimization parameters can cause underfitting
1. **F** Decreasing the minibatch size helps reducing noise in the gradients
1. **F** If a CNN is overfitting, doubling the number of filters will usually make it overfit twice more
1. **T** The learning rate and the batch size often have an impact on generalization
1. **F** Decrease the learning rate for the optimizer helps overcome overfitting
1. **T** Inserting normalization layers (e.g. BatchNorm) helps overcome underfitting
1. **F** Typical neural network architectures are over-parameterized, meaning that they have enough trainable parameters to fit arbitrary noise labels.

### Exercise 2 (20 min) -- Gradient Descent and momentum in numpy

While building a minimal  neural network (a linear model) in numpy, we define the parameters $\theta$ as a matrix $W$ of shape `(8, 1)` and a bias vector $b$ of shape `(1,)`. You are given the functions to compute the loss and the gradient of the loss w.r.t. the parameters.

1. write a function that performs gradient descent over `n_iter` steps with a given `learning rate`
2. write a function that performs momentum gradient descent over `n_iter` steps with a given `learning rate` and `momentum`

*Hint*: formula for the momentum update:

$$ \mathbf{v} \leftarrow \mu \cdot \mathbf{v} + \nabla $$
$$ \mathbf{\theta} \leftarrow \theta - \eta \cdot \mathbf{v} $$

where:

- $\theta$ is the vector of trainable parameters
- $\eta$ is the `learning_rate` coefficient
- $\mu$ is the `momentum` coefficient
- $\nabla$ is the gradient (usually of the loss function for the current value, in this case a random vector)
- $\mathbf{v}$ is the tensor of velocities and as the same shape as the parameters tensor $\theta$. $\mathbf{v}$ is initialized to zero.

In [12]:
import numpy as np

n_samples = 100
n_features = 8

rng = np.random.RandomState(seed=0)
X = rng.randn(n_samples, n_features)
w_true = rng.randn(n_features)
b_true = rng.randn(1)
noise = rng.randn(n_samples) / 10
y = X @ w_true + b_true + noise

In [13]:
def loss(params):
    y_pred = X @ params[0] + params[1]
    return np.mean(0.5 * (y - y_pred) ** 2, axis=0)


def gradients(params):
    y_pred = X @ params[0] + params[1]
    diff = y_pred - y
    return [np.mean(X * diff.reshape(-1, 1), axis=0),
            np.mean(diff, axis=0)]# Write your code here

In [14]:
init_params = [np.zeros(shape=(n_features,)),
               np.zeros(shape=(1,))]

loss(init_params)

4.387651045016995

In [15]:
gradients(init_params)

[array([-1.38796564, -1.12056059, -0.35814543,  0.65666592, -1.5200169 ,
        -0.79684241, -0.69080678, -0.08511958]), 2.114556483761852]

In [20]:
learning_rate = 0.1
momentum = 0.5

In [21]:
def gradient_descent(init_params, n_iter=5):
    params = [p.copy() for p in init_params]
    for step in range(n_iter):
        new_gradients = gradients(params)
        params[0] = params[0] - new_gradients[0]*learning_rate
        params[1] = params[1] - new_gradients[1]*learning_rate
        # write code to update the parameters with the 
        # gradients using gradient descent
    return params


final_params = gradient_descent(init_params, n_iter=15)
loss(final_params)

0.10512056065005256

In [25]:
velocitity_params = [np.zeros(shape=(n_features,)),
               np.zeros(shape=(1,))]

def momentum_gradient_descent(init_params, init_velocitity,n_iter=5):
    params = [p.copy() for p in init_params]
    v_params = [p.copy() for p in init_velocitity]
    for step in range(n_iter):
        new_gradients = gradients(params)
        v_params[0] = v_params[0]*momentum + new_gradients[0]
        v_params[1] = v_params[1]*momentum + new_gradients[1]
        params[0] = params[0] - v_params[0]*learning_rate
        params[1] = params[1] - v_params[1]*learning_rate

        # write code to update the parameters with the 
        # gradients using momentum gradient descent
    return params


final_params = momentum_gradient_descent(init_params, velocitity_params,n_iter=15)
loss(final_params)

0.008094608613871714

### Exercise 3 (30 min) -- Natural Language Classifier

Alice wants to classify the topic of tweets. She is interested in knowing whether the tweet is dealing with `politics`, `technology`, `religion` or none of the 3. She supposes only one of these possibilities can happen for a given tweet.

Say she has a dataset of 10K tweets with their corresponding label. 

##### 3.1 Describe all the preprocessing steps Alice should do before feeding the data to train the model described below:

*Edit this cell and write me !*
1-	Importing the pandas library and loading the tweets as a dataframe

2-	Remove punctuation from text, use predefined punctiation lists such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’

3-	lower case all text

4-	Tokenize text, to introduce some structure either using word or sentence tokenization

5-	Remove stopwords, does not add any value to the analysis, we can use nltk library which provide a list of words considered stopwords for the english

6- Tweets may also include URL, HTML tags or other rare words, we can filter them

7-	Text standardization; stemming which help diminishing words to their base form

8-	Lemmatizing, a more practical text standardization that preserve the meaning. Lemmatizing is dictionary based and can be slow

9- transforming text data into numerical value, plenty of methods exist such as bag of word which encode each docuement on a vector with an overall vocabulary, if the word is present 1 if not 0. Other methods considers the frequency over the collection of documents such as TF-IDF


##### 3.2 Analysis of the model

Add a **T** (True) or **F** (False) before each of the follwoing statements

- **T** 4 is a correct number of classes for this problem
- **T** 20000 is a suitable vocabulary size
- **F** 5 is a suitable embedding dimension
- **F** 50 is a suitable sequence length
- **F** The first model only has parameters in the Dense layer
- **T** The first model takes into account the order of the words

##### 3.3 Write the second and third models in the `elif` statements below

- model 2: based on LSTM with a sensible number of hidden units
- model 3: based on several Convolutions1D and MaxPoolings with sensible numbers of parameters

In [43]:
from tensorflow.keras.layers import Dense, Input, Flatten, Convolution1D, MaxPooling1D
from tensorflow.keras.layers import GlobalAveragePooling1D, Embedding, LSTM
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers

MAX_NB_WORDS = 20000
EMBEDDING_DIM = 5
MAX_SEQUENCE_LENGTH = 50
model_num = 2
N_CLASSES = 4

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

if model_num == 1:
    x = GlobalAveragePooling1D()(embedded_sequences)
    predictions = Dense(N_CLASSES, activation='softmax')(x)
elif model_num == 2:
    x = LSTM(128)(embedded_sequences)
    x = Dense(100, name="dense_final")(x)
    predictions = Dense(N_CLASSES, activation='softmax')(x)

    
    pass

elif model_num == 3:
    x = Convolution1D(128, 5,activation='relu', name='conv1')(embedded_sequences)
    x = MaxPooling1D()(x)
    x = Convolution1D(128, 5,activation='relu', name='conv2')(x)
    x = MaxPooling1D()(x)
    x = Flatten(name="flatten1")(x)
    x = Dense(100, name="dense_final")(x)
    predictions = Dense(N_CLASSES, activation='softmax')(x)

model = Model(sequence_input, predictions)
model.compile(loss="categorical_crossentropy",
              optimizer='adam', metrics=['acc'])

Use the following random input to check that you code can run without failing with randomly initialized weights.

In [44]:
import numpy as np

batch_size = 3
random_batch = np.random.randint(low=0, high=MAX_NB_WORDS,
                                 size=(batch_size, MAX_SEQUENCE_LENGTH))
model.predict(random_batch)



array([[0.24971864, 0.25047985, 0.25000224, 0.24979931],
       [0.24929018, 0.25098228, 0.24994725, 0.24978033],
       [0.25053275, 0.24918959, 0.25025746, 0.2500202 ]], dtype=float32)

##### 3.4 Using a Transformer model instead

Add a **T** (True) or **F** (False) before each of the follwoing statements

- **T/F** The self-attention mechanism enables to take into account the order of the words
- **T/F** It is necessary to add a positional embedding in transformer architectures when working with sequential data
- **T/F** It is possible to use a pre-trained architecture and fine-tune it for a classification task such as the one above
- **T/F** Transformer architectures can be used in sequence in sequence settings such as Machine Translation
- **T/F** The self-attention mechanism does not bring additional parameters to the transformer architecture

### Exercise 4 (30 min) -- Captioning

##### 4.1 Consider the following image-captioning model, which takes as input an image, and produces as output a sentence describing the image.

<img src="https://i.imgur.com/QRS2Hrj.png" style="width: 600px;" />

Notes:
- All the images in the training set are 224x224 RGB color images.
- Each sentence in the training set is an English sentence of maximum length 20, with words indexed as integers in a vocabulary of size 1000. There are special symbols; `<s>` for start of sequence and `<eos>` for end of sequence included in the 20 words length.
- Henri does not one-hot encode the text part of the training data: he feeds the model directly with arrays of integer values as representation for the sequences.
- During training, we use teacher forcing, which means we pass as input both the image and the shifted output text, and predict the next word.
- The ResNet is pre-trained on ImageNet and outputs a vector representation of each image in dimension 2048, then a linear projection projects to a dimension of 128
- For simplicity, we add the $h$ (`img_features` in the code below) image representation to the decoder's hidden activation $h_i^{dec}$ at each time step instead of just the first one, using `RepeatVector` 

In [None]:
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.layers import Dense, Input, Flatten, SimpleRNN, RepeatVector, Lambda
from tensorflow.keras.layers import GlobalAveragePooling2D, Embedding, Dot, Reshape, Softmax
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers

base_model = ResNet50(include_top=True)

In [None]:
input_img = base_model.layers[0].input
input_text = Input(shape=(20,), dtype='int32')

MAX_NB_WORDS = 1000
EMBEDDING_DIM = 128
SEQ_LENGTH = 20

# Image features: from the pre-trained resnet
img_features = base_model.layers[-2].output
img_features = Dense(EMBEDDING_DIM, use_bias=False)(img_features)
img_features = RepeatVector(SEQ_LENGTH)(img_features)

# Input text embedding
input_text = Input(shape=(SEQ_LENGTH,), dtype='int32')
embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=SEQ_LENGTH)
embedded_text = embedding_layer(input_text)

# Combining the two and producing the output
rnn_input = embedded_text + img_features
output_seq = SimpleRNN(EMBEDDING_DIM)(rnn_input)
output_seq = Dense(MAX_NB_WORDS, activation="softmax")(output_seq)

model = Model([input_img, input_text], output_seq)

##### 4.2 Analysis of the model

Edit this cell to add a **T** (True) or **F** (False) before each of the follwoing statements

- **T/F** `base_model.layers[-2].output` has spatial dimensions
- **T/F** After the `Dense` Layer `img_features` has spatial dimensions
- **T/F** After the `RepeatVector` Layer, `img_features` has sequential dimensions
- **T/F** The linear projection has parameters
- **T/F** It is possible to fine tune the ResNet50 parameters when training the RNN
- **T/F** The hidden-to-hidden parameter matrix of the RNN has a shape of (128, 1000)
- **T/F** it is possible to add an attention mechanism to focus on specific part of the picture
- **T/F** This model is a conditional language model

##### 4.3 Train / Test time

At train time, we use both the image and a shifted version of the expected target sequence fed as input for the RNN decoder ("teacher forcing").

Edit this cell to add a **T** (True) or **F** (False) before each of the follwoing statements

- **T/F** At test time, it is possible to use this same strategy (input both the image and the shifted output)
- **T/F** It is possible that we just capture the language structure, and barely take into account the image
- **T/F** At test time, we can decode the first word, then the second given the first, etc... (greedy decoding)
- **T/F** A beam search would probably improve results over a greedy decoding
- **T/F** For shorter sequences, there will be a stronger vanishing gradients problem
- **T/F** Replacing the RNN with an LSTM will help generalization, but will make the training more difficult
- **T/F** Replacing the basic RNN model (while preserving its 128 hidden dimension) by a LSTM would reduce the number of parameters