## Exploring the data

In this notebook we will 

Sources

[1] https://pytorch.org/tutorials/beginner/transformer_tutorial.html

[2] https://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/



In [1]:
import torch
import torchtext


We have the following datasets available for this task:

- Penn Trebank (originally created for POS tagging)
- WikiText

Before loading our dataset, define how it will be tokenized and preprocessed. To do this, `torchtext` uses `data.Field`. By default, it uses [`spaCy`](https://spacy.io/api/tokenizer) tokenization.

Also, we set an `init_token` and `eos_token` for the begin and end of sentence characters.

In [2]:
from torchtext import data

TEXT = data.Field(
    tokenizer_language='en',
    lower=True,
    init_token='<sos>',
    eos_token='<eos>',
    batch_first=True,
)

Now, we can load our dataset

In [3]:
from torchtext.datasets import WikiText2
 
train, valid, test = WikiText2.splits(TEXT) 


In [4]:
len(train), len(valid), len(test)

(1, 1, 1)

This might seem weird, we have only one example for each split. However, each example is just the concatenation of the text compiled in the dataset

In [9]:
print(f"Words in Dataset {len(train[0].text) / 1e6:.2f}M")

Words in Dataset 2.09M


In [5]:
from pprint import pprint as pp

pp(train[0].text[:100], compact=True)

['<eos>', '=', 'valkyria', 'chronicles', 'iii', '=', '<eos>', '<eos>', 'senjō',
 'no', 'valkyria', '3', ':', '<unk>', 'chronicles', '(', 'japanese', ':',
 '戦場のヴァルキュリア3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3',
 ')', ',', 'commonly', 'referred', 'to', 'as', 'valkyria', 'chronicles', 'iii',
 'outside', 'japan', ',', 'is', 'a', 'tactical', 'role', '@-@', 'playing',
 'video', 'game', 'developed', 'by', 'sega', 'and', 'media.vision', 'for',
 'the', 'playstation', 'portable', '.', 'released', 'in', 'january', '2011',
 'in', 'japan', ',', 'it', 'is', 'the', 'third', 'game', 'in', 'the',
 'valkyria', 'series', '.', '<unk>', 'the', 'same', 'fusion', 'of', 'tactical',
 'and', 'real', '@-@', 'time', 'gameplay', 'as', 'its', 'predecessors', ',',
 'the', 'story', 'runs', 'parallel', 'to', 'the', 'first', 'game', 'and',
 'follows', 'the']


## Generating vocabulary

Let's compute the vocabulary for our train dataset. We also get their word vectors using `glove` (trained on 6 billion words, embeddings of dimension 200)

It takes a couple of minutes to download the nearly-1GB-file, so be patient :-)

In [67]:
TEXT.build_vocab(train, vectors="glove.6B.200d")

print(f"We have {len(TEXT.vocab)} tokens in our vocabulary")

We have 28914 tokens in our vocabulary


## Iterator


In [68]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train, valid, test),
    batch_size=32,
    bptt_len=30, # this is where we specify the sequence length
    device=device,
    repeat=False)

What do exactly they look like?
To get a batch, we need to get first to get a true iterator (this might be confusing) from `train_iter`. We do this by using `iter`. Afterwards, we get the first batch by calling `next` on `it`

In [69]:
it = iter(train_iter)
batch = next(it)

batch


[torchtext.data.batch.Batch of size 32]
	[.text]:[torch.cuda.LongTensor of size 32x30 (GPU 0)]
	[.target]:[torch.cuda.LongTensor of size 32x30 (GPU 0)]

So, we have two tensors of long numbers. We have 32 (because of the batch size) and 30 (because of the BPTT length)

To recover the words (instead of plain numbers), we will use `TEXT.vocab.itos` (integer to string), which maps each number to its respective token

In [78]:
for text, target in zip(batch.text, batch.target):
    tokens = [TEXT.vocab.itos[t] for t in text]
    target = [TEXT.vocab.itos[t] for t in target]
    print("="*80)
    print("Sentence:")
    print(tokens)
    print("Target:")
    print(target, "\n\n")

Sentence:
['<eos>', '=', 'valkyria', 'chronicles', 'iii', '=', '<eos>', '<eos>', 'senjō', 'no', 'valkyria', '3', ':', '<unk>', 'chronicles', '(', 'japanese', ':', '戦場のヴァルキュリア3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3', ')', ',', 'commonly']
Target:
['=', 'valkyria', 'chronicles', 'iii', '=', '<eos>', '<eos>', 'senjō', 'no', 'valkyria', '3', ':', '<unk>', 'chronicles', '(', 'japanese', ':', '戦場のヴァルキュリア3', ',', 'lit', '.', 'valkyria', 'of', 'the', 'battlefield', '3', ')', ',', 'commonly', 'referred'] 


Sentence:
['authority', '"', '.', 'it', 'is', 'balaguer', 'who', 'guides', 'much', 'of', 'the', 'action', 'in', 'the', 'last', 'sections', 'of', 'the', 'book', '.', '<eos>', '<eos>', '=', '=', '=', '<unk>', '=', '=', '=', '<eos>']
Target:
['"', '.', 'it', 'is', 'balaguer', 'who', 'guides', 'much', 'of', 'the', 'action', 'in', 'the', 'last', 'sections', 'of', 'the', 'book', '.', '<eos>', '<eos>', '=', '=', '=', '<unk>', '=', '=', '=', '<eos>', '<eos>'] 


Sentence:
['s

We can observe that `target` is just the sentence left-shifted one position