> Spacy and torchtext
<a href="https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95">
Really good article</a>

In [31]:
path = '../data/nlp/transformer/fr-en'
!ls {path}

europarl-v7.fr-en.en europarl-v7.fr-en.fr


In [33]:
europarl_en = open(f'{path}/europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
europarl_fr = open(f'{path}/europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

#!python3 -m spacy download en
#!python3 -m spacy download fr
#!pip install torchtext==0.6.0
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset

en = spacy.load('en')
fr = spacy.load('fr')

def tokenize_en(sentence):
    return [tok.text for tok in en.tokenizer(sentence)]

def tokenize_fr(sentence):
    return [tok.text for tok in fr.tokenizer(sentence)]

# Before you create the field, and at then end you build the vocab from these two Fields
EN_TEXT = Field(tokenize=tokenize_en)
FR_TEXT = Field(tokenize=tokenize_fr, init_token='<sos>', eos_token='<eos>')

Perhaps counter-intuitively, the best way to work with Torchtext is to turn your data into spreadsheet format, no matter the original format of your data file.

This is due to the incredible versatility of the Torchtext TabularDataset function, which creates datasets from spreadsheet formats.

In [36]:
import pandas as pd

# csv format
raw_data = {'English' : [line for line in europarl_en], 'French': [line for line in europarl_fr]}

df = pd.DataFrame(raw_data, columns=["English", "French"])

# remove very long sentences and sentences where translations are 
# not of roughly equal length
df['eng_len'] = df['English'].str.count(' ')
df['fr_len'] = df['French'].str.count(' ')
df = df.query('fr_len < 80 & eng_len < 80')
df = df.query('fr_len < eng_len * 1.5 & fr_len * 1.5 > eng_len')

In [37]:
df.head()

Unnamed: 0,English,French,eng_len,fr_len
0,Resumption of the session,Reprise de la session,3,3
1,I declare resumed the session of the European ...,Je déclare reprise la session du Parlement eur...,37,32
2,"Although, as you will have seen, the dreaded '...","Comme vous avez pu le constater, le grand ""bog...",30,36
3,You have requested a debate on this subject in...,Vous avez souhaité un débat à ce sujet dans le...,18,18
4,"In the meantime, I should like to observe a mi...","En attendant, je souhaiterais, comme un certai...",39,37


In [38]:
from sklearn.model_selection import train_test_split

# create train and validation set
train, val = train_test_split(df, test_size=0.1)

train.to_csv('train.csv', index=False)
val.to_csv('val.csv', index=False)

data_fields = [('English', EN_TEXT), ('French', FR_TEXT)]
train, val = TabularDataset.splits(path='./', train='train.csv', validation='val.csv', 
                                   format='csv', fields=data_fields)

FR_TEXT.build_vocab(train, val)
EN_TEXT.build_vocab(train, val)

In [42]:
print(EN_TEXT.vocab.stoi['the'])

2


In [45]:
train_iter = BucketIterator(train, batch_size=5, 
                            sort_key=lambda x: len(x.French), shuffle=True)

In [46]:
batch = next(iter(train_iter))
print(batch.English)  # index are words, column is a sentence.

tensor([[  156, 44910,   347,  2321,    27],
        [   11, 21399,     2,   104, 11051],
        [ 1619,     7,    78,   655,   125],
        [ 1495,    11,   423,    63,     2],
        [ 4303,   303,     3,    35,  4322],
        [    3,     5,   470,   519,  1957],
        [   13, 10597,   654,    21,     5],
        [   21,   159,    24,    46,   391],
        [  325,   295,    11,    11,     7],
        [  125,   735,  2463,  2672,   147],
        [    2,     2,  1397,  2002,   108],
        [ 3068,  3041,     3,  3518,  5682],
        [   76,   189,   371,     4,    29],
        [ 8537,    29,    18,     1,     2],
        [    7,  1441,  8661,     1,  3222],
        [ 8166, 19753,  1540,     1,   988],
        [   18,     3,     7,     1,     4],
        [10817,    25,  3490,     1,     1],
        [   73,   224, 17539,     1,     1],
        [   25,    19,   116,     1,     1],
        [ 1370,  1440,   329,     1,     1],
        [    6,    11,    10,     1,     1],
        [ 

In [49]:
print(EN_TEXT.vocab.itos[44910])

Silent


There is an hack to make the process faster, indeed it is fairly slow. Check the article for it.