# Torchtext tutorial

[Original video](https://youtu.be/KRgq4VnCr7I)

Torchtext resources:
  * [PyTorch docs](https://torchtext.readthedocs.io/en/latest)
  * [A Tutorial on Torchtext](https://anie.me/On-Torchtext)
  * [Ben Trevett GitHub](https://github.com/bentrevett)
  * [How to use TorchText for neural machine translation, plus hack to make it 5x faster](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95)
  * [Link doesn't work for now](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/)

## Custom Datasets and loading JSON/CSV/TSV files

In [8]:
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/test.csv
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/test.json
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/test.tsv
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/train.csv
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/train.json
!wget https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/train.tsv

--2021-03-25 16:07:13--  https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 239 [text/plain]
Saving to: ‘test.csv.1’


2021-03-25 16:07:13 (9.29 MB/s) - ‘test.csv.1’ saved [239/239]

--2021-03-25 16:07:14--  https://raw.githubusercontent.com/aladdinpersson/Machine-Learning-Collection/master/ML/Pytorch/more_advanced/torchtext/mydata/test.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315 [text/pl

In [9]:
# Use torchtext.legacy
#   import torchtext.legacy as torchtext
# or downgrade to torchtext==0.8.0
#   !pip install -U torchtext==0.8.0

import torchtext
torchtext.__version__

'0.9.0'

In [32]:
import spacy
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd

from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, TabularDataset, BucketIterator

# STEPS:
# 1. Specify how preprocessing should be done -> Fields
# 2. Use Dataset to load the data -> TabularDataset (JSON/CSV/TSV files)
# 3. Construct an iterator to do batching & padding -> BucketIterator

In [7]:
# simple way, bad tokenizer
# tokenize = lambda x: x.split()

# !pip install spacy
# !python -m spacy download en

spacy_en = spacy.load('en')

# good tokenizer
def tokenize(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

quote = Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True)
score = Field(sequential=False, use_vocab=False)

fields = {'quote': ('q', quote), 'score': ('s', score)}

train_data, test_data = TabularDataset.splits(path='.',
                                              train='train.json',
                                              # validation='validation.json',
                                              test='test.json',
                                              format='json',
                                              fields=fields)

# train_data, test_data = TabularDataset.splits(path='.',
#                                               train='train.csv',
#                                               # validation='validation.csv',
#                                               test='test.csv',
#                                               format='csv',
#                                               fields=fields)

# train_data, test_data = TabularDataset.splits(path='.',
#                                               train='train.tsv',
#                                               # validation='validation.tsv',
#                                               test='test.tsv',
#                                               format='tsv',
#                                               fields=fields)

print(train_data[0].__dict__.keys())
print(train_data[0].__dict__.values())

quote.build_vocab(
    train_data,
    max_size=10000,
    min_freq=1,
    # pre-trained on 6 billion words and has 100 dimensions ≈ 862 Mb of size
    vectors='glove.6B.100d',
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# experiment with batch_sizes here
train_iterator, test_iterator = BucketIterator.splits((train_data, test_data),
                                                      batch_sizes=(2, 2),
                                                      device=device)

print('-----')
for batch in train_iterator:
    print(batch.q)
    print(batch.s)

.vector_cache/glove.6B.zip: 0.00B [00:00, ?B/s]

dict_keys(['q', 's'])
dict_values([['you', 'must', 'own', 'everything', 'in', 'your', 'world', '.', 'there', 'is', 'no', 'one', 'else', 'to', 'blame', '.'], 1])


.vector_cache/glove.6B.zip: 862MB [02:44, 5.25MB/s]                           
100%|█████████▉| 398703/400000 [00:18<00:00, 21962.82it/s]

-----
tensor([[35, 14],
        [23, 25],
        [26,  7],
        [18,  5],
        [19, 10],
        [36, 15],
        [34, 21],
        [ 2,  3],
        [33,  7],
        [20,  5],
        [24, 32],
        [ 6, 30],
        [16,  8],
        [ 8, 17],
        [12,  4],
        [ 2, 13],
        [ 1,  6],
        [ 1,  2]])
tensor([1, 1])
tensor([[29],
        [31],
        [ 3],
        [11],
        [28],
        [22],
        [ 4],
        [27],
        [ 9]])
tensor([0])


### Training a simple LSTM on this toy data of ours

In [11]:
class RNN_LSTM(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size, num_layers):
        super(RNN_LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.embedding = nn.Embedding(input_size, embed_size)
        self.rnn = nn.LSTM(embed_size, hidden_size, num_layers)
        self.fc_out = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # Set initial hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(1), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(1), self.hidden_size).to(device)

        embedded = self.embedding(x)
        outputs, _ = self.rnn(embedded, (h0, c0))
        prediction = self.fc_out(outputs[-1, :, :])

        return prediction


# Hyperparameters
input_size = len(quote.vocab)
hidden_size = 512
num_layers = 2
embedding_size = 100
learning_rate = 0.005
num_epochs = 10

# Initialize network
model = RNN_LSTM(input_size, embedding_size, hidden_size, num_layers).to(device)

print(len(quote.vocab))
print(quote.vocab.vectors.size())

# (NOT COVERED IN YOUTUBE VIDEO): Load the pretrained embeddings onto our model
pretrained_embeddings = quote.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

# Loss and optimizer
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Train Network
for epoch in range(1, num_epochs+1):
    loop = tqdm(train_iterator, leave=False)
    loop.set_description(f'Epoch [{epoch}/{num_epochs}]')

    for batch_idx, batch in enumerate(loop):
        # Get data to cuda if possible
        data = batch.q.to(device=device)
        targets = batch.s.to(device=device)

        # forward
        scores = model(data)
        loss = criterion(scores.squeeze(1), targets.type_as(scores))

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent
        optimizer.step()

37
torch.Size([37, 100])


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

## Built in Datasets with Example

In [19]:
!python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


In [28]:
spacy_en = spacy.load('en')
spacy_de = spacy.load('de')

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]


def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]


english = Field(sequential=True, use_vocab=True, tokenize=tokenize_en, lower=True)
german = Field(sequential=True, use_vocab=True, tokenize=tokenize_de, lower=True)

# translate german (de) to english (en)
train_data, validation_data, test_data = Multi30k.splits(
    exts=('.de', '.en'), fields=(german, english))

# build vocabulary
english.build_vocab(train_data, max_size=9001, min_freq=2)
german.build_vocab(train_data, max_size=9001, min_freq=2)

train_iterator, validation_iterator, test_iterator = BucketIterator.splits(
    (train_data, validation_data, test_data), batch_size=64, device=device)

for idx, batch in enumerate(train_iterator):
    print(batch)
    if idx >= 3: break


[torchtext.legacy.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 25x64]
	[.trg]:[torch.LongTensor of size 24x64]

[torchtext.legacy.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 25x64]
	[.trg]:[torch.LongTensor of size 26x64]

[torchtext.legacy.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 28x64]
	[.trg]:[torch.LongTensor of size 25x64]

[torchtext.legacy.data.batch.Batch of size 64 from MULTI30K]
	[.src]:[torch.LongTensor of size 21x64]
	[.trg]:[torch.LongTensor of size 22x64]


In [31]:
print(english.vocab.stoi['the'])
print(english.vocab.stoi['hello'])

print(english.vocab.itos[4110])
print(english.vocab.itos[5])

5
4110
hello
the


## From Textfiles to Dataset

[Neural Machine Translation](https://nlp.stanford.edu/projects/nmt/)

WMT'14 English-German data [Medium]: [train.en](https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.en) and [train.de](https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.de)

IWSLT'15 English-Vietnamese data [Small]: [train.en](https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/train.en) and [train.vi](https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi/train.vi)

In [46]:
# Train English-German is too large. Download test English-German
!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2012.en
!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2012.de

--2021-03-25 18:59:32--  https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2012.en
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 406049 (397K) [text/plain]
Saving to: ‘newstest2012.en’


2021-03-25 18:59:33 (517 KB/s) - ‘newstest2012.en’ saved [406049/406049]

--2021-03-25 18:59:33--  https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2012.de
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 470350 (459K) [text/plain]
Saving to: ‘newstest2012.de’


2021-03-25 18:59:35 (547 KB/s) - ‘newstest2012.de’ saved [470350/470350]



In [47]:
# load data from two text files where each row is a sentence
english_txt = open('newstest2012.en', encoding='utf8').read().split('\n')
german_txt = open('newstest2012.de', encoding='utf8').read().split('\n')

raw_data = {'English': [line for line in english_txt[1:1000]],
            'German': [line for line in german_txt[1:1000]]}

df = pd.DataFrame(raw_data, columns=['English', 'German'])

df.head()

Unnamed: 0,English,German
0,"Today , the Ukraine parliament dismissed , wit...",Das ukrainische Parlament verweigerte heute de...
1,The amendment that would lead to freeing the i...,"Die Neuregelung , die den Weg zur Befreiung de..."
2,"In October , Tymoshenko was sentenced to seven...",Timoshenko war im Oktober wegen des Abschlusse...
3,The verdict is not yet final ; the court will ...,"Das Urteil ist noch nicht rechtskräftig , im D..."
4,Tymoshenko claims the verdict is a political r...,Timoshenko selbst bezeichnet das Urteil als po...


In [48]:
# create train and test set
train, test = train_test_split(df, test_size=0.2)

# get train, test data to json and csv format which can be read by torchtext
train.to_json('train.json', orient='records', lines=True)
test.to_json('test.json', orient='records', lines=True)

train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)

!head train.csv

English,German
"The announcement made US investors sceptical . &quot; We are not an island . We are dependent , &quot; said Steve Goldman of Goldman Management .","Das stimmte die US ##AT##-##AT## Investoren skeptisch : &quot; Wir sind keine Insel , sondern abhängig &quot; , sagte Steve Goldman von Goldman Management ."
"While claiming that Pelta has been &apos; disqualified from returning trustworthiness to Czech football &apos; , he considers the other rival , Paclík , a strong candidate that has things to offer .","Während er von Pelta sagt , &quot; er sei von vornherein disqualifiziert , dem tschechischen Fußball zu neuem Vertrauen zu verhelfen &quot; , hält er den zweiten Rivalen - Paclík für einen starken Kandidaten , der dem Fußball wirklich etwas anzubieten habe ."
&quot; So I started collecting beautiful baths . &quot;,"&quot; Also fing ich an , schöne Bäder zu sammeln . &quot;"
"The regulator is demanding that the Frankfurt ##AT##-##AT## based bank take responsibility for &qu

### The same like in previous example

In [49]:
spacy_en = spacy.load('en')
spacy_de = spacy.load('de')

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]


def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]


english = Field(sequential=True, use_vocab=True, tokenize=tokenize_en, lower=True)
german = Field(sequential=True, use_vocab=True, tokenize=tokenize_de, lower=True)

fields = {'English': ('eng', english),
          'German': ('ger', german)}

train_data, test_data = TabularDataset.splits(path='',
                                              train='train.json',
                                              test='test.json',
                                              format='json',
                                              fields=fields)

english.build_vocab(train_data, max_size=1000, min_freq=2)
german.build_vocab(train_data, max_size=1000, min_freq=2)

train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data), batch_size=32, device=device)

for idx, batch in enumerate(train_iterator):
    print(batch)
    if idx > 3: break


[torchtext.legacy.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 49x32]
	[.ger]:[torch.LongTensor of size 49x32]

[torchtext.legacy.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 68x32]
	[.ger]:[torch.LongTensor of size 66x32]

[torchtext.legacy.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 45x32]
	[.ger]:[torch.LongTensor of size 44x32]

[torchtext.legacy.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 51x32]
	[.ger]:[torch.LongTensor of size 48x32]

[torchtext.legacy.data.batch.Batch of size 32]
	[.eng]:[torch.LongTensor of size 60x32]
	[.ger]:[torch.LongTensor of size 69x32]
