# Neural translation - seq2seq
## French questions to english questions

**Difficulties:**

1. Output of arbitrary length
1. Order of tokens in the input and the output is not the same

In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai.text import *

## Dataset
http://www.statmt.org/wmt15/translation-task.html

Obtained by web crawling millions of sites and using simple heuristics such as replacing *en* with *fr* etc.

In [3]:
PATH = Path('data/translate/')
TMP_PATH = PATH/'tmp'
TMP_PATH.mkdir(exist_ok=True)

In [4]:
filename = 'giga-fren.release2.fixed'

In [5]:
en_fn = PATH/f'{filename}.en'
fr_fn = PATH/f'{filename}.fr'

Training a full translation model takes a long time. In this example we therefore focus only on questions that start with *What*, *Where*, *Wh...* etc. and end with a *?*.

Compiling makes the regular expressions faster.

In [6]:
re_enquest = re.compile('^(Wh[^?.!]+\?)')
re_frquest = re.compile('^([^?.!]+\?)')

In [7]:
lines = ((re_enquest.search(enquest), re_frquest.search(frquest)) for enquest, frquest in zip(open(en_fn, encoding='utf-8'), open(fr_fn, encoding='utf-8')))

In [8]:
for i, l in enumerate(lines):
    print(l)
    if i == 20:
        break

(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(<_sre.SRE_Match object; span=(0, 15), match='What is light ?'>, <_sre.SRE_Match object; span=(0, 25), match='Qu’est-ce que la lumière?'>)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, <_sre.SRE_Match object; span=(0, 72), match="Astronomes Introduction Vidéo d'introduction Qu'e>)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(None, None)
(<_sre.SRE_Match object; span=(0, 11), match='Who are we?'>, <_sre.SRE_Match object; span=(0, 15), match='Où sommes-nous?'>)
(<_sre.SRE_Match object; span=(0, 23), match='Where did we come from?'>, <_sre.SRE_Match object; span=(0, 17), match="D'où venons-nous?">)


In [9]:
questions = [(e.group(), f.group()) for e, f in lines if e and f]

In [10]:
pickle.dump(questions, (PATH/'fr-en-questions.pkl').open('wb'))

In [11]:
questions = pickle.load((PATH/'fr-en-questions.pkl').open('rb'))

In [12]:
questions[:3]

[('What would we do without it?', 'Que ferions-nous sans elle ?'),
 ('What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador?',
  'Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?'),
 ('What is the major aboriginal group on Vancouver Island?',
  'Quel est le groupe autochtone principal sur l’île de Vancouver?')]

In [13]:
len(questions)

52328

In [14]:
en_questions, fr_questions = zip(*questions)

### Tokenization

`Tokenizer` is a fastai wrapper around *spacy* that uses multiple processors for speedup.

In [17]:
en_tok = Tokenizer.proc_all_mp(partition_by_cores(en_questions))

In [18]:
fr_tok = Tokenizer.proc_all_mp(partition_by_cores(fr_questions), 'fr')

In [19]:
en_tok[0]

['what', 'would', 'we', 'do', 'without', 'it', '?']

In [20]:
fr_tok[0]

['que', 'ferions', '-', 'nous', 'sans', 'elle', '?']

#### Average length of the questions

In [21]:
np.mean([len(q) for q in en_tok])

13.345895123069868

In [23]:
np.mean([len(q) for q in fr_tok])

16.26809738572084

#### Discard questions that are too long

In [24]:
keep = np.array([len(q) < 30 for q in en_tok])

In [27]:
en_tok = np.array(en_tok)[keep]
fr_tok = np.array(fr_tok)[keep]

In [30]:
pickle.dump(en_tok, (PATH/'en_tok.pkl').open('wb'))
pickle.dump(fr_tok, (PATH/'fr_tok.pkl').open('wb'))

In [31]:
en_tok = pickle.load((PATH/'en_tok.pkl').open('rb'))
fr_tok = pickle.load((PATH/'fr_tok.pkl').open('rb'))

### Numericalization

In [50]:
def toks2idxs(tok, pre):
    freq = Counter(t for q in tok for t in q)
    itos = [s for s, c in freq.most_common(40000)]
    itos.insert(0, '_bos_')  # beginning of sequence token
    itos.insert(1, '_pad_')  # padding token
    itos.insert(2, '_eos_')  # end of sequence token
    itos.insert(3, '_unk_')  # unknown token
    stoi = collections.defaultdict(lambda: 3, {t:i for i,t in enumerate(itos)})  # if string not found, set to '_unk_'
    indcs = np.array([([stoi[t] for t in q] + [2]) for q in tok])
    np.save(TMP_PATH/f'{pre}_indcs.npy', indcs)
    pickle.dump(itos, open(TMP_PATH/f'{pre}_itos_pkl', 'wb'))
    return indcs, itos, stoi

In [51]:
en_indcs, en_itos, fr_stoi = toks2idxs(en_tok, 'en')

In [52]:
fr_indcs, fr_itos, fr_stoi = toks2idxs(fr_tok, 'fr')

In [53]:
def load_indcs(pre):
    indcs = np.load(TMP_PATH/f'{pre}_indcs.npy')
    itos = pickle.load(open(TMP_PATH/f'{pre}_itos_pkl', 'rb'))
    stoi = collections.defaultdict(lambda: 3, {t:i for i,t in enumerate(itos)})
    return indcs, itos, stoi

In [54]:
en_indcs, en_itos, en_stoi = load_indcs('en')
fr_indcs, fr_itos, fr_stoi = load_indcs('fr')

In [55]:
' '.join([en_itos[i] for i in en_indcs[0]])

'what would we do without it ? _eos_'

In [56]:
' '.join([fr_itos[i] for i in fr_indcs[0]])

'que ferions - nous sans elle ? _eos_'