# Torchtext

- What is it for?
    - Raw data (csv, tsv) $\rightarrow$ Datasets (preprocessed blocks of data)
    - Datasets $\rightarrow$ Iterators (handle numericalizing, batching, packaging, etc.)

<img src="torchtext.png">

(image taken from [here](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/))

## Raw data (csv, tsv) $\rightarrow$ Datasets

- `Field`: how you want a certain column/field to be processed
    - E.g., `"I want to be tokenized" -> ["I", "want", "to", "be", "tokenize"]`
    - Documentation: you should totally read [it](https://torchtext.readthedocs.io/en/latest/data.html#field)!
    - Common arguments:
        - `sequential` (**True**): whether the datatype represents sequential data; **False** then no tokenization.
        - `init_token` (**None**): a token that will be prepended to every example; **None** for no `init_token`.
        - `eos_token` (**None**): a token that will be appended to every example; None for no `eos_token`.
        - `lower` (**False**): whether to lowervase the text.
        - `tokenize` (`string.split()`): the function used to tokenize strings.
        - `pad_token` (**"&lt;pad>"**): padding token.
        - `unk_token` (**"&lt;unk>"**): string token used to represent OOV words.

In [4]:
from torchtext.data import Field

s = 'I want to be tokenized'

# First some recap on lambda expression
def tokenizer_explicit(x):
    return x.split(sep=' ')

# which is equivalent to a nameless function lambda x: x.split(sep=' ')

WORD_1 = Field(tokenize=tokenizer_explicit)
#WORD_1 = Field(tokenize=lambda x: x.split(sep=' '))
#WORD_1 = Field(tokenize=str.split)
print(WORD_1.preprocess(s))

WORD_2 = Field(sequential=True)
#WORD_2 = Field(sequential=True, init_token="<start>", eos_token="<end>")
print(WORD_2.preprocess(s))

WORD_3 = Field(sequential=False)
print(WORD_3.preprocess(s))

WORD_4 = Field(sequential=True, lower=True)
print(WORD_4.preprocess(s))

['I', 'want', 'to', 'be', 'tokenized']
['I', 'want', 'to', 'be', 'tokenized']
I want to be tokenized
['i', 'want', 'to', 'be', 'tokenized']


- `NestedField`: a nested field holds another field (called nesting field) accepts an untokenized string.
    - E.g., `"I want to be nested-tokenized" -> [["I"], ["w", "a", "n", "t"], ["t", "o"]...]`
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#nestedfield)
    - Common arguments:
        - `nesting_field` (**Field**): a field contained in this nested field.
        - `tokenize` (`string.split()`)

In [2]:
from torchtext.data import NestedField

s = 'I want to be nested-tokenized'

NESTING_FIELD = Field(tokenize=list)
WORD_5 = NestedField(nesting_field=NESTING_FIELD)
print(WORD_5.preprocess(s))

s = 'I_want_to_be_nested-tokenized'
NESTING_FIELD = Field(tokenize=list)
WORD_6 = NestedField(nesting_field=NESTING_FIELD, tokenize=lambda x: x.split(sep='_'))
print(WORD_6.preprocess(s))

[['I'], ['w', 'a', 'n', 't'], ['t', 'o'], ['b', 'e'], ['n', 'e', 's', 't', 'e', 'd', '-', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'd']]
[['I'], ['w', 'a', 'n', 't'], ['t', 'o'], ['b', 'e'], ['n', 'e', 's', 't', 'e', 'd', '-', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'd']]


### Your turn

#### Q1
```
"I want to be character-tokenized" -> ["i", "w", "a", "n", "t", "t", "o"...]
```

In [3]:
s = 'I want to be character-tokenized'

WORD = Field(tokenize=lambda x: [s for w in x.split(sep=' ') for s in w if s ], lower=True)
print(WORD.preprocess(s))

['i', 'w', 'a', 'n', 't', 't', 'o', 'b', 'e', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', '-', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'd']


#### Q2

```
"I want to be bi-gramized" -> [("i", "want"), ("want", "to"), ("to", "be")...]
```

In [4]:
s = 'I want to be bi-gramized'

def tokenizer(s):
    words = s.split(sep=' ')
    return [(words[i].lower(), words[i+1].lower()) for i in range(len(words)-1)]

BI_GRAM = Field(tokenize=tokenizer)
print(BI_GRAM.preprocess(s))

[('i', 'want'), ('want', 'to'), ('to', 'be'), ('be', 'bi-gramized')]


- Dataset: `TabularDataset`, `SequenceTaggingDataset`

```
1,"Something wise.",Someone sometime ->

WORDS: ["Something", "wise"]
NAME: "Someone sometime"
```


In [5]:
from torchtext.data import TabularDataset

tv_datafields = [('NUM', None),
                 ('WORDS', WORD_2),
                 ('NAME', WORD_3)]

trn, dev = TabularDataset.splits(
    path='.',
    train='train.csv',
    validation='dev.csv',
    format='csv',
    skip_header=True,
    fields=tv_datafields)

tst_datafields = [('WORDS', WORD_2),
                  ('NAME', WORD_3)]

tst = TabularDataset(
    path='./test.csv',
    format='csv',
    skip_header=False,
    fields=tst_datafields)

ex = next(iter(trn))
print(ex.__dict__.keys())
print(ex.WORDS)
print(ex.NAME)

ex = next(iter(tst))
print(ex.__dict__.keys())
print(ex.WORDS)
print(ex.NAME)

# Build vocab from training set
WORD_2.build_vocab(trn)
WORD_3.build_vocab(trn)
print(WORD_2.vocab.stoi)
print(WORD_3.vocab.stoi)

dict_keys(['WORDS', 'NAME'])
['The', 'greatest', 'glory', 'in', 'living', 'lies', 'not', 'in', 'never', 'falling,', 'but', 'in', 'rising', 'every', 'time', 'we', 'fall.']
Nelson Mandela
dict_keys(['WORDS', 'NAME'])
['You', 'miss', '100%', 'of', 'the', 'shots', 'you', "don't", 'take.']
Wayne Gretzky
defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x129a60d90>>, {'<unk>': 0, '<pad>': 1, 'in': 2, 'you': 3, 'Always': 4, 'Just': 5, 'Let': 6, 'Life': 7, 'Spread': 8, 'The': 9, 'absolutely': 10, 'are': 11, 'busy': 12, 'but': 13, 'come': 14, 'else.': 15, 'ever': 16, 'every': 17, 'everyone': 18, 'everywhere': 19, 'fall.': 20, 'falling,': 21, 'glory': 22, 'go.': 23, 'greatest': 24, 'happens': 25, 'happier.': 26, 'is': 27, 'leaving': 28, 'lies': 29, 'like': 30, 'living': 31, 'love': 32, 'making': 33, 'never': 34, 'no': 35, 'not': 36, 'one': 37, 'other': 38, 'plans.': 39, 'remember': 40, 'rising': 41, 'that': 42, 'time': 43, 'to': 44, 'unique.': 45, 'we': 46, 

## Datasets $\rightarrow$ Iterators
- `Iterator`: Defines an iterator that loads batches of data from a dataset.
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#iterators)
    - Common arguments:
        - `dataset`
        - `batch_size`
        - `sort_key`: a key to use for sorting examples.
        - `shuffle`: whether to shuffle examples between epochs.
        - `sort`: whether to sort examples according to `sort_key`.
        - `sort_within_batch`: whether to sort within each batch.
        - `device` (**cpu**)

- `BucketIterator`: Defines an iterator that batches examples of similar lengths together. Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator)
    - Common arguments:
        - `dataset`
        - `batch_size`
        - `sort_key`
        - `shuffle`
        - `sort`
        - `device`

In [6]:
from torchtext.data import Iterator, BucketIterator

trn_iter = BucketIterator(trn,
                          batch_size=2,
                          sort_key=len,
                          shuffle=True)

dev_iter, tst_iter = Iterator.splits((dev, tst),
                                     batch_sizes=(1, 1),
                                     sort=False,
                                     shuffle=False)

ex = next(iter(trn_iter))
print(ex)
print('==========')
print(ex.WORDS)
print(ex.WORDS.size())  # (sequence_length, batch_size)
print('----------')
print(ex.NAME)
print(ex.NAME.size())  # (batch_size)


[torchtext.data.batch.Batch of size 2]
	[.WORDS]:[torch.LongTensor of size 15x2]
	[.NAME]:[torch.LongTensor of size 2]
tensor([[ 4,  8],
        [40, 32],
        [42, 19],
        [ 3,  3],
        [11, 23],
        [10,  6],
        [45, 35],
        [ 5, 37],
        [30, 16],
        [18, 14],
        [15, 44],
        [ 1,  3],
        [ 1, 49],
        [ 1, 28],
        [ 1, 26]])
torch.Size([15, 2])
----------
tensor([2, 3])
torch.Size([2])


### Your turn
(thanks Miikka for the data and idea)

In [7]:
with open('data/uralic.train', encoding='utf-8') as f:
    n = 0
    for line in f:
        n += 1
        print(line.strip())
        if n == 10:
            break

сюлонь	myv
вольпасьын	kpv
туялысьлы	kpv
курскоень	myv
кӧнтусь	kpv
todistuksensa	fin
пойпанго	myv
крезьгуръёсыз	udm
korrigeerimiseks	est
чойяс	kpv


The output format for each data point (line) should be
```
{WORD: 'todistuksensa',
 CHAR: ['t', 'o', 'd', 'i', 's', 't', 'u'...],
 LANG: 'fin'}
```
Also, for `WORD` and `CHAR` need to be prepend and append with `"<start>"` and `"<end>"`.

In [8]:
PAD = '<pad>'
UNK = '<unk>'
START = '<start>'
END = '<end>'

WORD = Field(sequential=False,
             init_token=START,
             eos_token=END,
             pad_token=PAD,
             unk_token=UNK)

CHAR = Field(sequential=True,
             tokenize=lambda s: [c for c in s],
             lower=False,
             init_token=START,
             eos_token=END,
             pad_token=PAD,
             unk_token=UNK)

LANG = Field(sequential=False)

print(WORD.preprocess('afgaaninvinttikoiria'))
print(CHAR.preprocess('afgaaninvinttikoiria'))
print(LANG.preprocess('fin'))

afgaaninvinttikoiria
['a', 'f', 'g', 'a', 'a', 'n', 'i', 'n', 'v', 'i', 'n', 't', 't', 'i', 'k', 'o', 'i', 'r', 'i', 'a']
fin


In [9]:
datafields = [(('word', 'char'), (WORD, CHAR)),
              ('lang', LANG)]
              
train, develop, test = TabularDataset.splits(
    path='data',
    train='uralic.train', validation="uralic.dev", test='uralic.test',
    format='tsv',
    skip_header=False,
    fields=datafields)

ex = next(iter(train))
print(ex.word)
print(ex.char)
print(ex.lang)

WORD.build_vocab(train)
CHAR.build_vocab(train)
LANG.build_vocab(train)

сюлонь
['с', 'ю', 'л', 'о', 'н', 'ь']
myv


In [10]:
train_iter = BucketIterator(train,
                            batch_size=5,
                            sort_key=len,
                            shuffle=True)

dev_iter, test_iter = Iterator.splits((develop, test),
                                       batch_sizes=(1, 1),
                                       sort=False,
                                       shuffle=True)

ex = next(iter(train_iter))
print(ex)
print('==========')
print(ex.word)
print(ex.word.size())
print('----------')
print(ex.char)
print(ex.char.size())
print('----------')
print(ex.lang)
print(ex.lang.size())


[torchtext.data.batch.Batch of size 5]
	[.lang]:[torch.LongTensor of size 5]
	[.word]:[torch.LongTensor of size 5]
	[.char]:[torch.LongTensor of size 13x5]
tensor([ 308, 3586, 1715,  578, 5653])
torch.Size([5])
----------
tensor([[ 2,  2,  2,  2,  2],
        [11, 16,  9, 47, 35],
        [20, 25, 23, 22, 31],
        [ 6, 14, 47, 44,  7],
        [47, 49, 23, 39, 25],
        [ 6, 56, 41,  4,  7],
        [39, 10, 23, 19,  3],
        [ 3, 25,  9,  6,  1],
        [ 1,  7,  3, 23,  1],
        [ 1,  3,  1, 13,  1],
        [ 1,  1,  1, 22,  1],
        [ 1,  1,  1,  6,  1],
        [ 1,  1,  1,  3,  1]])
torch.Size([13, 5])
----------
tensor([1, 6, 1, 2, 4])
torch.Size([5])
