## General Overview of TorchText's abilities

- **File Loading**: load in the corpus from various formats.

- **Tokenization**: break sentences into list of words.

- **Numericalize/Indexify**: Map words into integer numbers for the entire corpus

- **Word Vector**: either initialize vocabulary randomly or load in from a pretrained embedding. This embedding must be "trimmed", meaning we only store words in our vocabulary into memory.

- **Batching**: Generate batches of training sample (padding is normally happening here)

### Stuff that it can't do
- **Train/Val/Test split**: seperate your data into fixed train/val/test set.
- **Embedding Lookup**: Map each sentence (which contain word indices) to fixed dimension word. vectors.

In [1]:
# Field helps us to specify how the preprocessing should be done.
# TabularDataset helps us load data from JSON/CSV/TSV files.
# BucketIterator helps us batch and pad the data.
from torchtext.data import Field, TabularDataset, BucketIterator
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split

### What we're looking to accomplish:
- **Tokenization** - Break up a sentence into tokens.
- **Build Vocab** -Build a one to one mapping of each unique word (vocab) in the sentence / dataset to an index.
- **Numericalize text through vocab lookup** - Substitute each word in the sentence / index with 
- **Emedding lookup** - Replace each index in a sequence with a semantically meaningful embedding instead.

## Load in data, split into train and test

In [2]:
english_txt = open('WMT-English-German-Data/train.en', encoding='utf8').read().split('\n')
german_txt = open('WMT-English-German-Data/train.de', encoding='utf8').read().split('\n')

In [3]:
# splitting by newlines alone causes some mismatch at around the 2000 line mark in the dataset.
# Let us restrict our selves to the first 1000 pairs of data instead (these we know align exactly).
raw_data = {'English': [line for line in english_txt[:1000]],
            'German': [line for line in german_txt[:1000]]}

In [4]:
# construct dataframe so that we may pass this to sklearn for train_test_split
translation_data = pd.DataFrame(raw_data, columns=['English', 'German'])

In [5]:
train, test = train_test_split(translation_data, test_size=0.2)

In [6]:
# saving files as json formats, so that we can load these files into TabularDataset later.
train.to_json('WMT-English-German-Data/translation-train.json', orient='records', lines=True)
test.to_json('WMT-English-German-Data/translation-test.json', orient='records', lines=True)

## 1. Tokenize

In [7]:
# lets load in our spacy tokenizers
spacy_en = spacy.load('en')
spacy_de = spacy.load('de')

# lets define the tokenizing routines using the above loaded tokenizers.
def tokenize_en(text):
    return [token.text for token in spacy_en.tokenizer(text)]

def tokenize_de(text):
    return [token.text for token in spacy_de.tokenizer(text)]

In [8]:
# Lets store field objects for the source and target languages. 
# These fields will hold the vocabulary for the respective languages,
# as well directives that pass in, on how to tokenize / numericalize
# the text data (here we use spacy).
english = Field(sequential=True, use_vocab=True, tokenize=tokenize_en, lower=True, batch_first=True)
german = Field(sequential=True, use_vocab=True, tokenize=tokenize_de, lower=True, batch_first=True)

fields = {'English': ('en', english), 'German': ('de', german)}

In [9]:
# let us now construct our numerical datasets based on the fields we just created.
train_data, test_data = TabularDataset.splits(path='',
                                              train='WMT-English-German-Data/translation-train.json',
                                              test='WMT-English-German-Data/translation-test.json',
                                              format='json',
                                              fields=fields)

In [10]:
# Even though we set use_vocab=True, we can't build a vocab without loading in the data.
# Now that we've linked the columns of the train and test data with their respective fields,
# we are now in position where we can ask the field to build up a vocab
english.build_vocab(train_data, max_size=10000, min_freq=2)
german.build_vocab(train_data, max_size=10000, min_freq=2)

In [11]:
# Now that the datasets are cleaned up, we can set up a iterator, 
# that'll serve us batches of data.
train_iterator, test_iterator = BucketIterator.splits((train_data, test_data),
                                                      batch_size=32,
                                                      device="cuda")

In [12]:
for batch in train_iterator:
    print(batch.en)
    print(batch.en.shape)
    print(batch.de)
    print(batch.de.shape)
    # You see that most of the sequences end with '1's.
    # This is the number that the "pad token" got mapped to.
    # The iterator object took care of that automatically for each batch that we iterate over.
    break

tensor([[ 14, 378,   3,  ...,   1,   1,   1],
        [420,  82,   3,  ...,   1,   1,   1],
        [286,   6, 163,  ...,   1,   1,   1],
        ...,
        [443,  13, 109,  ...,   1,   1,   1],
        [367,   6,  87,  ...,   4,   1,   1],
        [160, 602, 409,  ...,   1,   1,   1]], device='cuda:0')
torch.Size([32, 53])
tensor([[1538,  256,  178,  ...,    1,    1,    1],
        [ 398,   83,   54,  ...,    1,    1,    1],
        [ 286,    5,  553,  ...,    1,    1,    1],
        ...,
        [ 111,  370, 1286,  ...,    1,    1,    1],
        [ 252,    5,   61,  ...,    1,    1,    1],
        [ 149,  386,    0,  ...,    1,    1,    1]], device='cuda:0')
torch.Size([32, 56])
