# Neural translation - seq2seq
## French questions to english questions

**Difficulties:**

1. Output of arbitrary length
1. Order of tokens in the input and the output is not the same

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.text import *

## Dataset
http://www.statmt.org/wmt15/translation-task.html

Obtained by web crawling millions of sites and using simple heuristics such as replacing *en* with *fr* etc.

In [3]:
PATH = Path('data/translate/')
TMP_PATH = PATH/'tmp'
TMP_PATH.mkdir(exist_ok=True)

In [4]:
filename = 'giga-fren.release2.fixed'

In [5]:
en_fn = PATH/f'{filename}.en'
fr_fn = PATH/f'{filename}.fr'

Training a full translation model takes a long time. In this example we therefore focus only on questions that start with *What*, *Where*, *Wh...* etc. and end with a *?*.

Compiling makes the regular expressions faster.

In [6]:
re_enquest = re.compile('^(Wh[^?.!]+\?)')
re_frquest = re.compile('^([^?.!]+\?)')

In [7]:
lines = ((re_enquest.search(enquest), re_frquest.search(frquest)) for enquest, frquest in zip(open(en_fn, encoding='utf-8'), open(fr_fn, encoding='utf-8')))

In [8]:
for i, l in enumerate(lines):
    print(l)
    if i == 20:
        break

(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(<_sre.SRE_Match object; span=(0, 15), match='What is light ?'>, <_sre.SRE_Match object; span=(0, 25), match='Qu’est-ce que la lumière?'>)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, <_sre.SRE_Match object; span=(0, 72), match="Astronomes Introduction Vidéo d'introduction Qu'e>)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(<_sre.SRE_Match object; span=(0, 11), match='Who are we?'>, <_sre.SRE_Match object; span=(0, 15), match='Où sommes-nous?'>)
(<_sre.SRE_Match object; span=(0, 23), match='Where did we come from?'>, <_sre.SRE_Match object; span=(0, 17), match="D'où venons-nous?">)


In [9]:
questions = [(e.group(), f.group()) for e, f in lines if e and f]

In [10]:
pickle.dump(questions, (PATH/'fr-en-questions.pkl').open('wb'))

In [6]:
questions = pickle.load((PATH/'fr-en-questions.pkl').open('rb'))

In [7]:
questions[:3]

[('What would we do without it?', 'Que ferions-nous sans elle ?'),
 ('What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador?',
  'Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?'),
 ('What is the major aboriginal group on Vancouver Island?',
  'Quel est le groupe autochtone principal sur l’île de Vancouver?')]

In [8]:
len(questions)

52328

In [9]:
en_questions, fr_questions = zip(*questions)

### Tokenization

`Tokenizer` is a fastai wrapper around *spacy* that uses multiple processors for speedup.

In [16]:
en_tok = Tokenizer.proc_all_mp(partition_by_cores(en_questions))

In [17]:
fr_tok = Tokenizer.proc_all_mp(partition_by_cores(fr_questions), 'fr')

In [18]:
en_tok[0]

['what', 'would', 'we', 'do', 'without', 'it', '?']

In [19]:
fr_tok[0]

['que', 'ferions', '-', 'nous', 'sans', 'elle', '?']

#### Average length of the questions

In [20]:
np.mean([len(q) for q in en_tok])

13.345895123069868

In [21]:
np.mean([len(q) for q in fr_tok])

16.26809738572084

#### Discard questions that are too long

In [22]:
keep = np.array([len(q) < 30 for q in en_tok])

In [23]:
en_tok = np.array(en_tok)[keep]
fr_tok = np.array(fr_tok)[keep]

In [24]:
pickle.dump(en_tok, (PATH/'en_tok.pkl').open('wb'))
pickle.dump(fr_tok, (PATH/'fr_tok.pkl').open('wb'))

In [10]:
en_tok = pickle.load((PATH/'en_tok.pkl').open('rb'))
fr_tok = pickle.load((PATH/'fr_tok.pkl').open('rb'))

### Numericalization

In [26]:
def toks2idxs(tok, pre):
    freq = Counter(t for q in tok for t in q)
    itos = [s for s, c in freq.most_common(40000)]
    itos.insert(0, '_bos_')  # beginning of sequence token
    itos.insert(1, '_pad_')  # padding token
    itos.insert(2, '_eos_')  # end of sequence token
    itos.insert(3, '_unk_')  # unknown token
    stoi = collections.defaultdict(lambda: 3, {t:i for i,t in enumerate(itos)})  # if string not found, set to '_unk_'
    indcs = np.array([([stoi[t] for t in q] + [2]) for q in tok])
    np.save(TMP_PATH/f'{pre}_indcs.npy', indcs)
    pickle.dump(itos, open(TMP_PATH/f'{pre}_itos_pkl', 'wb'))
    return indcs, itos, stoi

In [27]:
en_indcs, en_itos, fr_stoi = toks2idxs(en_tok, 'en')

In [28]:
fr_indcs, fr_itos, fr_stoi = toks2idxs(fr_tok, 'fr')

In [12]:
def load_indcs(pre):
    indcs = np.load(TMP_PATH/f'{pre}_indcs.npy')
    itos = pickle.load(open(TMP_PATH/f'{pre}_itos_pkl', 'rb'))
    stoi = collections.defaultdict(lambda: 3, {t:i for i,t in enumerate(itos)})
    return indcs, itos, stoi

In [13]:
en_indcs, en_itos, en_stoi = load_indcs('en')
fr_indcs, fr_itos, fr_stoi = load_indcs('fr')

In [14]:
' '.join([en_itos[i] for i in en_indcs[0]])

'what would we do without it ? _eos_'

In [15]:
' '.join([fr_itos[i] for i in fr_indcs[0]])

'que ferions - nous sans elle ? _eos_'

### Word vectors

In [33]:
# ! pip install git+https://github.com/facebookresearch/fastText.git

In [16]:
import fastText as ft

In [17]:
en_vecs = ft.load_model(str((PATH/'word_vectors'/'wiki.en.bin')))

In [19]:
fr_vecs = ft.load_model(str(PATH/'word_vectors'/'wiki.fr.bin'))

In [20]:
def get_vecs(lang, ft_vecs):
    vec_dict = {w: ft_vecs.get_word_vector(w) for w in ft_vecs.get_words()}
    pickle.dump(vec_dict, open(PATH/f'wiki.{lang}.pkl', 'wb'))
    return vec_dict

In [None]:
en_vec_dict = get_vecs('en', en_vecs)
fr_vec_dict = get_vecs('fr', fr_vecs)

In [21]:
en_vec_dict = pickle.load(open(PATH/'wiki.en.pkl','rb'))
en_vec_dict = pickle.load(open(PATH/'wiki.en.pkl','rb'))

In [22]:
ft_words = en_vecs.get_words(include_freq=True)

In [23]:
ft_word_dict = {k:v for k,v in zip(*ft_words)}

In [24]:
ft_words = sorted(ft_word_dict.keys(), key=lambda x: ft_word_dict[x])  # sorted by frequency

In [25]:
ft_words[-20:]

['by',
 'as',
 'for',
 's',
 'on',
 'was',
 'is',
 'a',
 'to',
 '(',
 ')',
 "'",
 'and',
 'in',
 '-',
 'of',
 '</s>',
 'the',
 '.',
 ',']

In [26]:
len(ft_words)

2519370

#### Dimensionality of the word vectors

In [37]:
dim_en_vec = len(en_vec_dict['and'])
dim_fr_vec = len(fr_vec_dict['and'])

In [58]:
dim_en_vec, dim_fr_vec

(300, 300)

#### Mean and stdv of word vectors

In [39]:
en_vecs = np.stack(list(en_vec_dict.values()))
# keys are words, values are the vectors of size 300

In [41]:
en_vecs.shape

(2519370, 300)

In [42]:
en_vecs.mean(), en_vecs.std()

(0.0075652334, 0.29283327)

### Model data object

`en_indcs` are the numericalized questions.

**Percentile:**
*Ist beispielsweise eine Stichprobe von Schuhgrößen gegeben, so ist das empirische 0,35-Quantil diejenige Schuhgröße s , so dass 35 % der Schuhgrößen in der Stichprobe kleiner als s  sind und 65 % größer als s sind.* [source](https://de.wikipedia.org/wiki/Empirisches_Quantil)

In [47]:
en_len_99 = int(np.percentile([len(o) for o in en_indcs], 99))
fr_len_97 = int(np.percentile([len(o) for o in fr_indcs], 97))

In [48]:
en_len_99, fr_len_97

(29, 33)

#### Truncate the questions

In [49]:
en_indcs_trunc = np.array([q[:en_len_99] for q in en_indcs])
fr_indcs_trunc = np.array([q[:fr_len_97] for q in fr_indcs])

A `Dataset` object needs a `__getitem__` and a `__len__` method. This example is actually very general and can be used for any arrays...

In [262]:
class Seq2SeqDataset(Dataset):
    def __init__(self, x, y):
        self.x, self.y = x, y
    def __getitem__(self, idx):
        return A(self.x[idx], self.y[idx])
    def __len__(self):
        return len(self.x)

In [56]:
??A()  # returns a np.array if len == 1 else returns a list of np.arrays

In [57]:
np.random.seed(42)

#### Split into train and validation set

In [58]:
trn_keep = np.random.rand(len(en_indcs_trunc)) > 0.1

In [59]:
en_trn, fr_trn = en_indcs_trunc[trn_keep], fr_indcs_trunc[trn_keep]

In [64]:
en_val, fr_val = en_indcs_trunc[~trn_keep], fr_indcs_trunc[~trn_keep]  # tilde negates

In [66]:
len(en_trn), len(en_val)

(45218, 5039)

#### Create datasets for french to english translation. Swap arguments to create a english to french model.

In [263]:
trn_ds = Seq2SeqDataset(fr_trn, en_trn)
val_ds = Seq2SeqDataset(fr_val, en_val)

In [264]:
' '.join([fr_itos[o] for o in fr_trn[0]])

'que ferions - nous sans elle ? _eos_'

In [265]:
' '.join([en_itos[o] for o in en_trn[0]])

'what would we do without it ? _eos_'

Everything still looking as expected :)

In [73]:
bs = 125

Since we want to fully utilize the GPUs capabilities, we train in batches. The length of a minibatch tensor is set by the sequence length of the longest question in that batch. The other questions are padded. To save time and memory, we want to avoid very long and very short questions in one batch because that would mean lot's of padding. For the validation set we simply sort the questions. For training we use the `SortishSampler` which groups *longer* questions together and *shorter* questions together while preserving some randomness.

For language models it's better to pad before the start of the sequence because we need the final hidden state to predict the next token or for classification...

For sequence to sequence models it is better to pad after the end of the sequence.

In [266]:
trn_sampler = SortishSampler(en_trn, key=lambda x: len(en_trn[x]), bs=bs)
val_sampler = SortSampler(en_val, key=lambda x:len(en_val[x]))

Both samplers simply return indices:

In [267]:
i = next(iter(trn_sampler))

In [176]:
' '.join([en_itos[o] for o in en_trn[i]])

'what , in your view , would be the impact of such minimum requirements on job creation as well as on the protection of workers ? _eos_'

In [177]:
' '.join([fr_itos[o] for o in fr_trn[i]])

"quelle serait , selon vous , l' incidence de ces obligations minimales sur la création d' emplois et la protection des travailleurs ? _eos_"

In [268]:
trn_dl = DataLoader(trn_ds, bs, transpose=True, transpose_y=True, num_workers=1, pad_idx=1, pre_pad=False, sampler=trn_sampler)


In [269]:
val_dl = DataLoader(val_ds, int(1.5*bs), transpose=True, transpose_y=True, num_workers=1, pad_idx=1, pre_pad=False, sampler=val_sampler)

Reminder: the ModelData object combines the training and validation dataloaders and a path to story temp stuff. When you have a ModelData object you can create a learner and then call `fit`.

In [270]:
modeldata = ModelData(PATH, trn_dl, val_dl)

**Let's look at an example batch:**

In [275]:
it = iter(trn_dl)
its = [next(it) for i in range(3)]

In [277]:
[(len(x), len(y)) for x,y in its]

[(29, 13), (33, 25), (25, 13)]

In [317]:
for x, y in its:
    print(' '.join([fr_itos[o] for o in x[:,0]]))
    print(' '.join([en_itos[o] for o in y[:,0]]))
    print()

quelle province ou quel territoire est le plus sûr ? _eos_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_
which province or territory is the safest in terms of murder ? _eos_

lorsqu' ils sont mutés à l' extérieur du groupe ca , les ca devraient ils s' attendre à être mutés à un poste ex moins 1 ou ex moins 2 ? _eos_ _pad_
when deploying from the ca group , should cas expect to be deployed to an ex minus 1 or ex minus 2 position ? _eos_

qui est responsable de la collecte , de l' analyse et de la diffusion des données épidémiologiques ? _eos_ _pad_ _pad_ _pad_ _pad_ _pad_ _pad_
who is responsible for collecting , analyzing and disseminating epidemiological data ? _eos_

