# Preprocessing

Motivated by [fast.ai](fast.ai)

This notebook takes the matched sentences from the [2015 EMNLP shared task](http://www.statmt.org/wmt15/translation-task.html) (specifically the [Giga French English corpus](http://www.statmt.org/wmt10/training-giga-fren.tar)), and tokenizes them.

Only who, what, why, when, where questions are tokenized (i.e. if the English sentence starts with `Wh` and ends with `?`).

In [1]:
from pathlib import Path
import numpy as np
import pickle

In [2]:
from translate.data.process import QuestionTokenizer

In [3]:
french, english = Path('data/giga-fren.release2.fixed.fr'), Path('data/giga-fren.release2.fixed.en')

The tokenizer's constructor takes as input a tuple `(path_to_english_sentences, path_to_french_sentences)`. In addition, the number of processes, parallelism and number of chunks the data is split into can be defined.

In [4]:
tokenizer = QuestionTokenizer((english, french))

Loaded 49881 questions


The tokenizer's `preprocess` method by default assumes a maximum vocabulary of 40,000. This can be modified with the `vocab_size` argument.

In [5]:
english, french = tokenizer.preprocess()

Tokenizing english questions
Processed 10000 articles
Processed 20000 articles
Processed 30000 articles
Processed 40000 articles
Tokenizing french questions
Processed 10000 articles
Processed 20000 articles
Processed 30000 articles
Processed 40000 articles
Tokenized questions!


In [6]:
en_ints, en_dict = english
fr_ints, fr_dict = french

In [7]:
np.save('en_ints.npy', np.array(en_ints))
np.save('fr_ints.npy', np.array(fr_ints))

In [8]:
en_path = Path('en_word2int.pickle')
fr_path = Path('fr_word2int.pickle')

with en_path.open(mode='wb') as en_file:
    pickle.dump(en_dict, en_file, protocol=pickle.HIGHEST_PROTOCOL)
    
with fr_path.open(mode='wb') as fr_file:
    pickle.dump(fr_dict, fr_file, protocol=pickle.HIGHEST_PROTOCOL)

A quick check, to make sure the order is okay

In [9]:
def read_sentence(word2int, ints):
    int2word = {int(idx): word for word, idx in word2int.items()}
    
    return " ".join([int2word[i] for i in ints])

In [11]:
for i in range(5):
    print(read_sentence(fr_dict, fr_ints[i]))
    print(read_sentence(en_dict, en_ints[i]))
    print('-------')

qu’ est -ce que la lumière ? xeos
what is light ? xeos
-------
où sommes - nous ? xeos
who are we ? xeos
-------
t_up d' où venons - nous ? xeos
where did we come from ? xeos
-------
que ferions - nous sans elle ? xeos
what would we do without it ? xeos
-------
quelle sont les coordonnées ( latitude et longitude ) de badger , à terre - neuve - etlabrador ? xeos
what is the absolute location ( latitude and longitude ) of badger , newfoundland and labrador ? xeos
-------
