# Machine Translation using Transformers

[Video](https://www.youtube.com/watch?v=KzfyftiH7R8&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=18)

In [1]:
from fastai.text import *

In [15]:
path = Path("/ml/data/fastai/translation")  # Config().data_path()/'giga-fren'

## Data

To download the data, run the following code once:

```python
! wget https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz -P {path}
! tar xf {path}/giga-fren.tgz -C {path} 

#with open(path/'giga-fren/giga-fren.release2.fixed.fr') as f:
#    fr = f.read().split('\n')

#with open(path/'giga-fren/giga-fren.release2.fixed.en') as f:
#    en = f.read().split('\n')

re_eq = re.compile('^(Wh[^?.!]+\?)')
re_fq = re.compile('^([^?.!]+\?)')
en_fname = path/'giga-fren/giga-fren.release2.fixed.en'
fr_fname = path/'giga-fren/giga-fren.release2.fixed.fr'

lines = ((re_eq.search(eq), re_fq.search(fq)) 
         for eq, fq in zip(open(en_fname, encoding='utf-8'), open(fr_fname, encoding='utf-8')))
qs = [(e.group(), f.group()) for e,f in lines if e and f]

qs = [(q1,q2) for q1,q2 in qs]
df = pd.DataFrame({'fr': [q[1] for q in qs], 'en': [q[0] for q in qs]}, columns = ['en', 'fr'])
df.to_csv(path/'questions_easy.csv', index=False)

del en, fr, lines, qs, df # free RAM or restart the nb
```

In [108]:
def seq2seq_collate(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True, backwards:bool=False):
    "Function that collects samples and adds padding. Reverses token order if needed"
    samples = to_data(samples)
    max_len_x, max_len_y = max([len(s[0]) for s in samples]), max([len(s[1]) for s in samples])
    res_x = torch.zeros(len(samples), max_len_x).long() + pad_idx
    res_y = torch.zeros(len(samples), max_len_y).long() + pad_idx
    if backwards: pad_first = not pad_first
    
    for i, s in enumerate(samples):
        if pad_first:
            res_x[i, -len(s[0]):], res_y[i, -len(s[1]):] = LongTensor(s[0]), LongTensor(s[1])
        else:
            res_x[i, :len(s[0])], res_y[i, :len(s[1])] = LongTensor(s[0]), LongTensor(s[1])
    if backwards:
        res_x, res_y = res_x.flip(1), res_y.flip(1)
    return res_x, res_y
        

In [109]:
class Seq2SeqDataBunch(TextDataBunch):
    @classmethod
    def create(cls, train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs:int=32, val_bs:int=None, pad_idx=1,
               pad_first=False, device:torch.device=None, no_check:bool=False, backwards:bool=False, **dl_kwargs) -> DataBunch:
        datasets = cls._init_ds(train_ds, valid_ds, test_ds)
        val_bs = ifnone(val_bs, bs)
        collate_fn = partial(seq2seq_collate, pad_idx=pad_idx, pad_first=pad_first, backwards=backwards)
        train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
        
        # bugfix
        new_dl_kwargs = dl_kwargs.copy()
        new_dl_kwargs.pop('dl_tfms')
        
        train_dl = DataLoader(datasets[0], batch_size=bs, sampler=train_sampler, drop_last=True, **new_dl_kwargs)
        dataloaders = [train_dl]
        for ds in datasets[1:]:
            lengths = [len(t) for t in ds.x.items]
            sampler = SortSampler(ds.x, key=lengths.__getitem__)
            dataloaders.append(DataLoader(ds, batch_size=val_bs, sampler=sampler, **new_dl_kwargs))
        return cls(*dataloaders, path=path, device=device, collate_fn=collate_fn, no_check=no_check)

In [110]:
class Seq2SeqTextList(TextList):
    _bunch = Seq2SeqDataBunch
    _label_cls = TextList

In [111]:
df = pd.read_csv(path/'questions_easy.csv')

In [112]:
src = Seq2SeqTextList.from_df(df, path=path, cols="fr").split_by_rand_pct().label_from_df(cols='en', label_cls=TextList)

In [113]:
src

LabelLists;

Train: LabelList (41865 items)
x: Seq2SeqTextList
xxbos xxmaj qu’est - ce que la lumière ?,xxbos xxmaj d'où venons - nous ?,xxbos xxmaj que xxunk - nous sans elle ?,xxbos xxmaj quelle sont les coordonnées ( latitude et longitude ) de xxmaj xxunk , à xxmaj terre - xxmaj neuve - xxunk ?,xxbos xxmaj quel est le groupe autochtone principal sur l’île de xxmaj vancouver ?
y: TextList
xxbos xxmaj what is light ?,xxbos xxmaj where did we come from ?,xxbos xxmaj what would we do without it ?,xxbos xxmaj what is the absolute location ( latitude and longitude ) of xxmaj badger , xxmaj newfoundland and xxmaj xxunk ?,xxbos xxmaj what is the major aboriginal group on xxmaj vancouver xxmaj island ?
Path: /ml/data/fastai/translation;

Valid: LabelList (10466 items)
x: Seq2SeqTextList
xxbos à combien xxunk le salaire initial ?,xxbos xxmaj quels types d'aide financière ou de services sont offerts ?,xxbos xxmaj pourquoi un camion en infraction est - il immobilisé dans un xxmaj etat membre et

In [114]:
np.percentile([len(o) for o in src.train.x.items] + [len(o) for o in src.valid.x.items], 90)

29.0

In [115]:
np.percentile([len(o) for o in src.train.y.items] + [len(o) for o in src.valid.y.items], 90)

26.0

**We remove any sentences that are longer than 30 tokens:**

In [116]:
src = src.filter_by_func(lambda x,y: len(x) > 30 or len(y) > 30)

In [117]:
len(src.train), len(src.valid)

(37883, 9504)

In [118]:
data = src.databunch()

In [121]:
x, y = next(iter(data.train_dl))

In [122]:
x.shape, y.shape

(torch.Size([64, 30]), torch.Size([64, 30]))

In [129]:
from IPython.display import display, HTML

def visualize_tensor(t):
    display(HTML(pd.DataFrame(np.array(t)).to_html(index=False, header=None)))

In [132]:
visualize_tensor(x)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
2,5,51,57,2677,39,311,1227,10,45,112,1497,18,19,145,11,57,52,51,57,145,5693,1799,754,10,164,30,155,486,9
2,5,385,44,968,115,72,5,33,178,10,140,1087,11,31,31,1204,363,10,46,1134,16,386,10,380,15,10,1931,1388,9
2,5,26,79,163,6326,11,49,37,119,16,154,15,10,13,741,21,1091,14,167,804,17,1734,427,30,12,2603,10,88,9
2,5,1705,13,5397,10,0,86,0,294,12,539,5181,10,791,7989,10282,18,19,133,11,43,1111,97,39,226,10,0,1336,9
2,5,10,36,478,14,5,237,10,568,15,781,247,148,11,49,942,81,830,16,843,25,234,10,2066,16,329,45,2181,9
2,5,27,20,12,629,25,2895,75,13,516,10,13,68,1380,18,13,1542,10,13,68,1380,15,13,1542,1978,10,13,68,9
2,5,27,1575,24,744,3501,4345,21,13,5,1477,16,1814,3584,91,11,31,4944,12,45,15,86,38,12,316,743,15,32,9
2,5,0,11,43,216,35,14,649,15,69,3508,11,43,3509,1007,17,54,83,21,12,562,0,15,81,402,10,13,273,9
2,5,27,449,865,3337,11,31,147,221,10,681,13,151,25,267,42,12,5,344,174,24,3193,10,80,35,12,1996,1053,9
2,5,33,152,748,16,502,35,46,611,7087,38,10,891,17,3888,16,434,208,56,789,3166,18,789,4747,18,1780,789,47,9


In [131]:
visualize_tensor(y)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
2,5,11,44,39,3826,4815,1008,14,76,13,2681,13,77,65,687,12,158,16,77,1978,9,1,1,1,1,1,1,1,1
2,5,11,284,12,388,45,24,76,13,726,33,2766,5,564,14,5,1070,428,9,1,1,1,1,1,1,1,1,1,1
2,5,11,65,94,14,347,136,17,340,13,791,10,490,210,13,1493,81,16,86,796,9,1,1,1,1,1,1,1,1
2,5,11,15,53,864,7219,348,18,353,10,5003,12,10,0,56,0,16,7466,2777,513,554,9,1,1,1,1,1,1,1
2,5,11,136,17,129,16,10,218,10,0,14,10,6,268,424,10,331,16,212,13,893,81,223,330,9,1,1,1,1
2,5,11,17,10,4517,3602,78,1161,58,868,18,1161,58,1159,14,801,58,1159,9,1,1,1,1,1,1,1,1,1,1
2,5,11,6895,48,12,10,5,3437,265,5,1308,719,63,24,131,73,14,56,25,812,607,14,22,9,1,1,1,1,1
2,5,11,45,23,389,28,10,833,14,75,45,195,13,170,105,23,3048,13,107,126,1921,14,434,126,273,9,1,1,1
2,5,38,444,442,36,24,109,16,212,13,1665,10,369,12,10,5,622,5,1609,5,1039,54,10,5,258,5,312,9,1
2,5,11,239,45,10,590,31,28,32,104,12,2547,25,1030,16,10,0,477,553,575,18,4887,25,837,9,1,1,1,1


In [135]:
data.show_batch()

text,target
xxbos xxmaj quels autres facteurs les gouvernements doivent - ils prendre en compte dans leurs décisions concernant l'utilisation à des fins de prévention xxunk xxunk avec les xxunk publics ?,xxbos xxmaj what else do governments need to consider in making a decision about whether or not to provide publicly funded antivirals for prevention ?
"xxbos xxmaj quel sera le sort réservé aux suppléments en vertu du projet de loi xxup c-51 , et le dosage fera - t - il l'objet d'une réglementation ?",xxbos xxmaj what is the future under xxmaj bill xxup c-51 for supplements and will there be regulated doses ?
xxbos xxmaj quels arguments pourraient - ils avancer afin de promouvoir l ’ « étude de l’avenir » en plus de « l’étude du passé » ( l’histoire ) ?,"xxbos xxmaj what arguments could they make for "" future studies "" in addition to "" past studies "" ( history ) ?"
xxbos • xxmaj quel est le pourcentage des participants qui ont continué à travailler pour xxunk de leur projet ou à faire du travail bénévole pour celui - ci ?,xxbos xxmaj what are the xxunk rates among participants by region and type of work placement / host ?
"xxbos xxmaj quelles incidences ont les articles 60 , xxunk , xxunk et xxunk de la xxmaj loi sur les langues officielles sur la demande de renseignements du demandeur ?","xxbos xxmaj what impact do sections 60 , 72 , xxunk and xxunk of the xxmaj official xxmaj languages xxmaj act have upon the applicant ’s request for information ?"


## Transformer model