# Seq2Seq Translation - LINUX
*LINUX - the previous notebooks have all been ran on a Windows OS laptop containing a RTX2070-MaxQ Graphics card. For this specific notebook & the next, we will be using a Linux machine hosted via Paperspace. The reason for the switch is mainly due to fastText not having supporting binaries for Windows OS.*

Im this notebook we will be tackling the task of translation. We will be translating French to English - specifically translating quesitons. 

This task is an example of Sequence to Sequence (seq2seq). Seq2Seq can be more challenging than classification, since the output is of variable length (different from the length of the input).

In [1]:
from fastai.text import *

## Download and preprocess our data
We will begin by reducing the original dataset to questions. You only need to execute this once.

In [2]:
path = Config().data_path()

In [None]:
# downloading the data
# !wget https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz -P {path}

In [7]:
path.ls()

[WindowsPath('C:/Users/dmber/.fastai/data/giga-fren.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/human_numbers'),
 WindowsPath('C:/Users/dmber/.fastai/data/human_numbers.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb_sample.tgz')]

In [9]:
# !tar xf {path}/giga-fren.tgz -C {path}

In [11]:
path.ls()

[WindowsPath('C:/Users/dmber/.fastai/data/giga-fren'),
 WindowsPath('C:/Users/dmber/.fastai/data/giga-fren.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/human_numbers'),
 WindowsPath('C:/Users/dmber/.fastai/data/human_numbers.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb.tgz'),
 WindowsPath('C:/Users/dmber/.fastai/data/imdb_sample.tgz')]

In [13]:
path = Config().data_path()/'giga-fren'
path.ls()

[WindowsPath('C:/Users/dmber/.fastai/data/giga-fren/giga-fren.release2.fixed.en'),
 WindowsPath('C:/Users/dmber/.fastai/data/giga-fren/giga-fren.release2.fixed.fr')]

In [15]:
# re_eq = re.compile('^(Wh[^?.!]+\?)')
# re_fq = re.compile('^([^?.!]+\?)')
# en_fname = path/'giga-fren.release2.fixed.en'
# fr_fname = path/'giga-fren.release2.fixed.fr'

In [16]:
# lines = ((re_eq.search(eq), re_fq.search(fq)) 
#         for eq, fq in zip(open(en_fname, encoding='utf-8'), open(fr_fname, encoding='utf-8')))
# qs = [(e.group(), f.group()) for e,f in lines if e and f]

In [17]:
# qs = [(q1,q2) for q1,q2 in qs]
# df = pd.DataFrame({'fr': [q[1] for q in qs], 'en': [q[0] for q in qs]}, columns = ['en', 'fr'])
# df.to_csv(path/'questions_easy.csv', index=False)

In [18]:
path.ls()

[WindowsPath('C:/Users/dmber/.fastai/data/giga-fren/giga-fren.release2.fixed.en'),
 WindowsPath('C:/Users/dmber/.fastai/data/giga-fren/giga-fren.release2.fixed.fr'),
 WindowsPath('C:/Users/dmber/.fastai/data/giga-fren/questions_easy.csv')]

## Load our data into a DataBunch
What do our questions look like?

In [19]:
df = pd.read_csv(path/'questions_easy.csv')
df.head()

Unnamed: 0,en,fr
0,What is light ?,Qu’est-ce que la lumière?
1,Who are we?,Où sommes-nous?
2,Where did we come from?,D'où venons-nous?
3,What would we do without it?,Que ferions-nous sans elle ?
4,What is the absolute location (latitude and lo...,Quelle sont les coordonnées (latitude et longi...


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52331 entries, 0 to 52330
Data columns (total 2 columns):
en    52331 non-null object
fr    52331 non-null object
dtypes: object(2)
memory usage: 817.8+ KB


In [21]:
# Lowercasing everything to make it simple
df['en'] = df['en'].apply(lambda x:x.lower())
df['fr'] = df['fr'].apply(lambda x:x.lower())

df.head()

Unnamed: 0,en,fr
0,what is light ?,qu’est-ce que la lumière?
1,who are we?,où sommes-nous?
2,where did we come from?,d'où venons-nous?
3,what would we do without it?,que ferions-nous sans elle ?
4,what is the absolute location (latitude and lo...,quelle sont les coordonnées (latitude et longi...


### Collate
Given that our input and outputs are of different lengths, we must collate inputs and targets in a batch. That is adding 0 padding. 

This will create all sequences have the same length

In [23]:
def seq2seq_collate(samples, pad_idx=1, pad_first=True, backwards=False):
    "Function that collect samples and adds padding. Flips token order if needed"
    samples = to_data(samples)
    max_len_x,max_len_y = max([len(s[0]) for s in samples]),max([len(s[1]) for s in samples])
    res_x = torch.zeros(len(samples), max_len_x).long() + pad_idx
    res_y = torch.zeros(len(samples), max_len_y).long() + pad_idx
    if backwards: pad_first = not pad_first
    for i,s in enumerate(samples):
        if pad_first: 
            res_x[i,-len(s[0]):],res_y[i,-len(s[1]):] = LongTensor(s[0]),LongTensor(s[1])
        else:         
            res_x[i,:len(s[0]):],res_y[i,:len(s[1]):] = LongTensor(s[0]),LongTensor(s[1])
    if backwards: res_x,res_y = res_x.flip(1),res_y.flip(1)
    return res_x,res_y

In [25]:
doc(to_data)

In [32]:
doc(Dataset)

### The ```Dataset```
Is essentially the main class that will store datasets. Any datasets created under this class will essentially be a subclass.

In [34]:
# train_dl, valid_dl are all of DataLoader types
doc(DataLoader)

### The ```DataLoader```
This gives us a way to iterate over a dataset. Therefor we must first create the ```dataset``` and them create ```dataloaders```. 

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn='default_collate', pin_memory=True, drop_last=False, timeout=0, worker_init_fn=None)

In [29]:
doc(DataBunch)

### The ```DataBunch```
This essentially puts it all together hence it is always called last when creating a dataset with FasAI.

This will combine all your dataloaders into a single object.

In [35]:
# Creating our own TextDataBunch
class Seq2SeqDataBunch(TextDataBunch):
    "Create a `TextDataBunch` suitable for training an RNN classifier."
    @classmethod
    def create(cls, train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs:int=32, val_bs:int=None, pad_idx=1,
               dl_tfms=None, pad_first=False, device:torch.device=None, no_check:bool=False, backwards:bool=False, **dl_kwargs) -> DataBunch:
        "Function that transform the `datasets` in a `DataBunch` for classification. Passes `**dl_kwargs` on to `DataLoader()`"
        datasets = cls._init_ds(train_ds, valid_ds, test_ds)
        val_bs = ifnone(val_bs, bs)
        collate_fn = partial(seq2seq_collate, pad_idx=pad_idx, pad_first=pad_first, backwards=backwards)
        train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
        train_dl = DataLoader(datasets[0], batch_size=bs, sampler=train_sampler, drop_last=True, **dl_kwargs)
        dataloaders = [train_dl]
        for ds in datasets[1:]:
            lengths = [len(t) for t in ds.x.items]
            sampler = SortSampler(ds.x, key=lengths.__getitem__)
            dataloaders.append(DataLoader(ds, batch_size=val_bs, sampler=sampler, **dl_kwargs))
        return cls(*dataloaders, path=path, device=device, collate_fn=collate_fn, no_check=no_check)

In [37]:
cls??

[1;31mSource:[0m
    [1;33m@[0m[0mline_magic[0m[1;33m
[0m    [1;32mdef[0m [0mclear[0m[1;33m([0m[0mself[0m[1;33m,[0m [0marg_s[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m        [1;34m"""Clear the terminal."""[0m[1;33m
[0m        [1;32mif[0m [0mos[0m[1;33m.[0m[0mname[0m [1;33m==[0m [1;34m'posix'[0m[1;33m:[0m[1;33m
[0m            [0mself[0m[1;33m.[0m[0mshell[0m[1;33m.[0m[0msystem[0m[1;33m([0m[1;34m"clear"[0m[1;33m)[0m[1;33m
[0m        [1;32melse[0m[1;33m:[0m[1;33m
[0m            [0mself[0m[1;33m.[0m[0mshell[0m[1;33m.[0m[0msystem[0m[1;33m([0m[1;34m"cls"[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mFile:[0m   c:\users\dmber\anaconda3\lib\site-packages\ipykernel\zmqshell.py


In [38]:
doc(SortishSampler)

And a subclass of ```TextList``` tjat will use this ```DataBunch``` class in the call ```.databunch``` and will use ```TextList``` to label

In [39]:
class Seq2SeqTextList(TextList):
    _bunch = Seq2SeqDataBunch
    _label_cls = TextList

We are now ready to use the datablock API

In [42]:
# Creating a Seq2Seq TextList
src = Seq2SeqTextList.from_df(df, path=path, cols='fr').split_by_rand_pct(seed=42).label_from_df(cols='en', label_cls=TextList)

In [43]:
# checking the length of both x, y => 90th percentile for each
x_per = np.percentile([len(o) for o in src.train.x.items] + [len(o) for o in src.valid.x.items], 90) 
y_per = np.percentile([len(o) for o in src.train.y.items] + [len(o) for o in src.valid.y.items], 90) 
print(f'x: {x_per}, y: {y_per}')

x: 28.0, y: 23.0


We will now remove items where one of the target is more than 30 tokens long

In [44]:
src

LabelLists;

Train: LabelList (41865 items)
x: Seq2SeqTextList
xxbos qu’est - ce que la lumière ?,xxbos où sommes - nous ?,xxbos d'où venons - nous ?,xxbos que ferions - nous sans elle ?,xxbos quel est le groupe autochtone principal sur l’île de vancouver ?
y: TextList
xxbos what is light ?,xxbos who are we ?,xxbos where did we come from ?,xxbos what would we do without it ?,xxbos what is the major aboriginal group on vancouver island ?
Path: C:\Users\dmber\.fastai\data\giga-fren;

Valid: LabelList (10466 items)
x: Seq2SeqTextList
xxbos quels pourraient être les effets sur l’instrument de xxunk et sur l’aide humanitaire qui ne sont pas co - xxunk ?,xxbos quand la source primaire a - t - elle été créée ?,xxbos pourquoi tant de soldats ont - ils fait xxunk de ne pas voir ce qui s'est passé le 4 et le 16 mars ?,xxbos quels sont les taux d'impôt sur le revenu au canada pour 2007 ?,xxbos pourquoi le programme devrait - il intéresser les employeurs et les fournisseurs de services ?
y: TextLi

In [45]:
# filtering - removing anythign less then 30 tokens
src = src.filter_by_func(lambda x,y: len(x) > 30 or len(y) > 30)

In [71]:
# creating our databunch
data = src.databunch(num_workers=0)

In [48]:
# saving
data.save()

In [72]:
data

Seq2SeqDataBunch;

Train: LabelList (41865 items)
x: Seq2SeqTextList
xxbos contenu introduction aux its saviez - vous que … ?,xxbos contenu introduction aux its saviez - vous que … ?,xxbos contenu introduction aux its saviez - vous que … ?,xxbos contenu introduction aux its saviez - vous que … ?,xxbos contenu introduction aux its saviez - vous que … ?
y: TextList
xxbos what 's inside introduction to xxunk did you know … ?,xxbos what 's inside introduction to xxunk did you know … ?,xxbos what 's inside introduction to xxunk did you know … ?,xxbos what 's inside introduction to xxunk did you know … ?,xxbos what 's inside introduction to xxunk did you know … ?
Path: C:\Users\dmber\.fastai\data\giga-fren;

Valid: LabelList (10466 items)
x: Seq2SeqTextList
xxbos quand demander une protection à l’étranger ?,xxbos quand demander une protection à l’étranger ?,xxbos quand demander une protection à l’étranger ?,xxbos quand demander une protection à l’étranger ?,xxbos quand demander une protectio

In [None]:
# # loading the data if we need to quickly
# data = load_data(path)

In [73]:
len(df)

52331

In [77]:
len(src.train) + len(src.valid)

52331

Note: On windows, you may have a problem with a databunch object - this in fact a problem with PyTorch & Windows implementation. A workaround is to set the ```num_workers=0```. But if you are now to call ```show_batch()``` on our databunch object, you will notice elements are copied over. 

It may be smart to do all the data preprocessing on a Linux/Unix machine before working on a Windows. Just save the databunch object and pull it into the later working directly.

# Creating our Model
## PreTrained Embeddings
Moving forward we will download the word embeddings (crawl vectors) from the fastText. FastText has pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia. These models were trained using CBOW.

Learn more about word embeddings here: https://www.youtube.com/watch?v=25nC0n9ERq4&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=10&t=0s

To install FastText run the following commands:
```
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .
```

In [None]:
# note to self
# DELETE giga-fren.tgz
# DELETE giga-fren Directory