Author: François Mercier

Goals: 
- Convert preprocessed into TF dataloader

# Imports

Additional requirements for this notebook (not part of main requirements)
```
pip install --no-index matplotlib 
pip install --no-index scikit-learn
pip install --no-index seaborn
pip install fastprogress
````

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
import sklearn
import matplotlib.pyplot as plt 
import seaborn as sns
from pathlib import Path
import pandas as pd
import numpy as np
import json
import pickle

from fastprogress import progress_bar

In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
tf.__version__

'2.0.0'

In [4]:
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

In [5]:
import sys
sys.path.append("..") # Require to have the utilities packages in path
from tools import tokenizer

In [6]:
pd.set_option('display.max_columns', 999)
pd.set_option('display.max_colwidth', 999)
pd.set_option('display.max_rows', 999)

In [7]:
data_path = Path(r"/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020")
files = list(data_path.glob("*"))
files

[PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/token_to_word_en.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/train_lang1_en_numericalized.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/word_to_token_fr.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/unaligned_fr_numericalized.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/word_to_token_en.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/train_lang2_fr_numericalized.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/token_to_word_fr.pickle'),
 PosixPath('/project/cq-training-1/project2/teams/team03/data/preprocessed_15032020/unaligned_en_numericalized.pickle')]

# Bilingual dataloader

In [8]:
with open(data_path/"train_lang1_en_numericalized.pickle", 'rb') as handle:
    train_lang1_en_numericalized = pickle.load(handle)
    
with open(data_path/"train_lang2_fr_numericalized.pickle", 'rb') as handle:
    train_lang2_fr_numericalized = pickle.load(handle)
    
with open(data_path/"word_to_token_en.pickle", 'rb') as handle:
    word_to_token_en = pickle.load(handle)
    
with open(data_path/"word_to_token_fr.pickle", 'rb') as handle:
    word_to_token_fr = pickle.load(handle)

with open(data_path/"token_to_word_fr.pickle", 'rb') as handle:
    token_to_word_fr = pickle.load(handle)

In [9]:
gen_ds = zip(train_lang1_en_numericalized, train_lang2_fr_numericalized)

In [10]:
def my_generator(train_lang1_en_numericalized=train_lang1_en_numericalized, 
                 train_lang2_fr_numericalized=train_lang2_fr_numericalized,
                ):
    bos, eos = -2, -1
    for i in range(len(train_lang1_en_numericalized)):
        en = np.array([bos] + train_lang1_en_numericalized[i] + [eos]) + 3
        fr = np.array([bos] + train_lang2_fr_numericalized[i] + [eos]) + 3
        inputs = (en, 
                  fr)
        output = fr[1:]
        yield (inputs, output)

In [12]:
batch_size = 16

ds = tf.data.Dataset.from_generator(my_generator, 
                                    output_types=((tf.int32, tf.int32), tf.int32), 
                                    output_shapes=((tf.TensorShape([None]), tf.TensorShape([None])), 
                                                   tf.TensorShape([None])))
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
ds = ds.shuffle(seed=42, buffer_size=256)
#ds = ds.map(lambda x, y: ((tf.minimum(x[0], 10000 - 1), tf.minimum([1], 10000 - 1)), tf.minimum(y, 10000 - 1))) # Only to test performance with lower vocab size (and GPU mem)
ds = ds.padded_batch(batch_size=batch_size, padded_shapes=(([128], [128]), 128)) # Batch size for K20

# 5000 like XNLI https://www.nyu.edu/projects/bowman/xnli/
test_dataset = ds.take(int(5000 / batch_size))#.cache()
train_dataset = ds.skip(int(5000 / batch_size)).cache()


In [13]:
%%time
for element in test_dataset.take(1): 
    print(element[0][0].shape, element[0][1].shape, element[1].shape)

(16, 128) (16, 128) (16, 128)
CPU times: user 76.6 ms, sys: 10.6 ms, total: 87.2 ms
Wall time: 141 ms


In [14]:
len(word_to_token_fr)

91269

# Seq2Seq at word level

In [15]:
# hparams
latent_dim = 256

max_len = 128

vocab_size_en = len(word_to_token_en) + 3
vocab_size_fr = len(word_to_token_fr) + 3
#vocab_size_en = 10000
#vocab_size_fr = 10000



# Define an input sequence and process it.
encoder_inputs = tf.keras.layers.Input(shape=(max_len))
encoder_masked_inputs = tf.keras.layers.Masking()(encoder_inputs) # Assuming PAD is zeros
encoder_embeddings = tf.keras.layers.Embedding(vocab_size_en, 300)
encoder = tf.keras.layers.LSTM(latent_dim, return_state=True, name="encoder")
encoder_outputs, state_h, state_c = encoder(encoder_embeddings(encoder_masked_inputs))
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = tf.keras.layers.Input(shape=(max_len))
decoder_masked_inputs = tf.keras.layers.Masking()(decoder_inputs) # Assuming PAD is zeros
decoder_embeddings = tf.keras.layers.Embedding(vocab_size_fr, 300)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True, name="decoder")
decoder_outputs, _, _ = decoder_lstm(decoder_embeddings(decoder_masked_inputs),
                                     initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(vocab_size_fr, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 128)]        0                                            
__________________________________________________________________________________________________
masking (Masking)               (None, 128)          0           input_1[0][0]                    
__________________________________________________________________________________________________
masking_1 (Masking)             (None, 128)          0           input_2[0][0]                    
______________________________________________________________________________________________

In [16]:
# Run training
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_dataset, validation_data=test_dataset, validation_steps=int(5000 / batch_size), epochs=10)

Epoch 1/10
Epoch 2/10
 24/376 [>.............................] - ETA: 7:14 - loss: 1.1154 - accuracy: 0.8417

KeyboardInterrupt: 

# BLEU score

In [17]:
refs = []
for sen in train_lang2_fr_numericalized:
    refs += [" ".join([token_to_word_fr[t] for t in sen])]

In [19]:
# For an unknown reason, for the predict, inputs must be float (unlike train)
preds = model.predict(test_dataset.take(1).map(lambda x, y: ((tf.cast(x[0], tf.float32), tf.cast(x[1], tf.float32)), tf.cast(y, tf.float32))))


token_to_word_fr_with_special_tokens = {(k+3): v for k, v in token_to_word_fr.items()}
token_to_word_fr_with_special_tokens[0] = "<MASK>"
token_to_word_fr_with_special_tokens[1] = "<BOS>"
token_to_word_fr_with_special_tokens[2] = "<EOS>"

sys = []
for sen in preds:
    sys += [" ".join([token_to_word_fr_with_special_tokens[t.argmax()] for t in sen if t.argmax() != 0])]

In [27]:
np.random.choice(refs), np.random.choice(sys)

("Vraiment , j ' étais le plus jeune membre de n ' importe quelle délégation dans la convention de 1980 qui a élu Ronald Reagan pour être le candidat Républicain pour la présidentielle .",
 "Il , ' est , , , le la la de la . . la la . . la ' est de . la la . . la la . . la . la ' est de . <EOS>")

In [21]:
len(refs), len(sys)

(11000, 16)

In [28]:
import sacrebleu

bleu_scores = []
for i in range(len(sys)):
    bleu_scores += [sacrebleu.corpus_bleu(sys[i], refs[i]).score]
    
np.mean(bleu_scores)

1.810378014096578

# Transformers

In [32]:
from transformers import XLMConfig, TFXLMModel

# See documentatio  https://huggingface.co/transformers/model_doc/xlm.html

# Initializing a XLM configuration
configuration = XLMConfig()

# Initializing a model from the configuration
del model
model = TFXLMModel(configuration)

# Accessing the model configuration
configuration = model.config
configuration

XLMConfig {
  "architectures": null,
  "asm": false,
  "attention_dropout": 0.1,
  "bos_index": 0,
  "bos_token_id": null,
  "causal": false,
  "do_sample": false,
  "dropout": 0.1,
  "emb_dim": 2048,
  "embed_init_std": 0.02209708691207961,
  "end_n_top": 5,
  "eos_index": 1,
  "eos_token_ids": null,
  "finetuning_task": null,
  "gelu_activation": true,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "init_std": 0.02,
  "is_decoder": false,
  "is_encoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "lang_id": 0,
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "mask_index": 5,
  "mask_token_id": 0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "model_type": "xlm",
  "n_heads": 16,
  "n_langs": 1,
  "n_layers": 12,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_index": 2,
  "pad_token_id": null,
  "pruned_heads": {},
  "repeti

In [33]:
next(dl.take(1).__iter__())[0]

<tf.Tensor: id=8234, shape=(4, 128), dtype=int32, numpy=
array([[   27,    56, 60004,  3553,    33,    83,   126,    50,     1,
         2031,  1123,  2391,  4442,  4443,     4,  1730,    14,     1,
          362,     4,  4442,    16,  3303,   197,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0

In [34]:
model(next(dl.__iter__())[0])

(<tf.Tensor: id=12270, shape=(4, 128, 2048), dtype=float32, numpy=
 array([[[-1.1322917e+00,  1.3965550e-01,  6.6679060e-01, ...,
           3.9226633e-02,  1.3610057e+00, -2.4725671e-01],
         [-1.1042662e+00,  2.9468575e-01,  5.9892792e-01, ...,
           3.7019411e-01,  1.1967196e+00, -5.8942002e-01],
         [-1.2960956e+00,  5.7720596e-01,  6.6599417e-01, ...,
          -8.7684974e-02,  1.0786055e+00,  2.5409577e-02],
         ...,
         [-1.2445209e+00,  5.4877132e-01,  9.6160740e-01, ...,
           5.0965804e-01,  1.1313727e+00, -1.9816267e-01],
         [-1.1881863e+00,  5.9808564e-01,  6.7106557e-01, ...,
          -1.0582311e-03,  1.2208782e+00, -9.9302456e-02],
         [-1.4462609e+00,  2.9336625e-01,  6.5152121e-01, ...,
           1.8172280e-01,  1.3121840e+00, -4.1789383e-01]],
 
        [[-1.1204789e+00,  8.8081703e-02,  6.9771290e-01, ...,
           1.6565101e-01,  1.3074118e+00, -1.6417824e-01],
         [-1.2055515e+00,  3.5257468e-01,  8.8713181e-01, ...,

In [35]:
model.summary()

Model: "tfxlm_model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
transformer (TFXLMMainLayer) multiple                  667088896 
Total params: 667,088,896
Trainable params: 667,088,896
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.transformer

<transformers.modeling_tf_xlm.TFXLMMainLayer at 0x2b05dedae210>

In [14]:
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
data = tensorflow_datasets.load('glue/mrpc')


# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=2, steps_per_epoch=115,
                    validation_data=valid_dataset, validation_steps=7)

ModuleNotFoundError: No module named 'transformers'

# Conclusion 

English monolingual
- Using sentencizer increases from 470k to 492k sentences
- Must be adapted to look like English bilangual (remove case and punctuation)
- Lot of new tokens
- Stats at token level similar to bilingual
- Can be in memory

English bilingual
- No punctuation and no case
- Max sequences at character level: 512 (possibility of character level production)
- Can be in memory

French monolingual
- Using sentencizer increases from 470k to 496k sentences
- French monolingual don't need special preprocessing
- Lot of new tokens
- Can be in memory


French bilingual
- Punctuation and Case
- Some inconsistences ("l'" and "l '") but can be fixed with tokenizer
- Max sequences at character level: 562 (possibility of character level production)
- Can be in memory


Spacy tokenizer
- Keep the structure between monolingual and bilingual
- Quite fast (30min for 1 monolingual, <1min for 1 bilingual)


Outputs
- 2 files for English corpora, and 1 file for their vocab (1 extra file for the reverse dictionary)
- 2 files for French corpora, and 1 file for their vocab (1 extra file for the reverse dictionary)

Possible improvements
- Generate character level dataset
- Use BPE dataset
- Parallelize the tokenizer (currently single thread)