# Torchtext

- What is it for?
    - Raw data (csv, tsv) $\rightarrow$ Datasets (preprocessed blocks of data)
    - Datasets $\rightarrow$ Iterators (handle numericalizing, batching, packaging, etc.)

<img src="torchtext.png">

(image taken from [here](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/))

## Raw data (csv, tsv) $\rightarrow$ Datasets

- `Field`: how you want a certain column/field to be processed
    - E.g., `"I want to be tokenized" -> ["I", "want", "to", "be", "tokenize"]`
    - Documentation: you should totally read [it](https://torchtext.readthedocs.io/en/latest/data.html#field)!
    - Common arguments:
        - `sequential` (**True**): whether the datatype represents sequential data; **False** then no tokenization.
        - `init_token` (**None**): a token that will be prepended to every example; **None** for no `init_token`.
        - `eos_token` (**None**): a token that will be appended to every example; **None** for no `eos_token`.
        - `lower` (**False**): whether to lowervase the text.
        - `tokenize` (`string.split()`): the function used to tokenize strings.
        - `pad_token` (**"&lt;pad>"**): padding token.
        - `unk_token` (**"&lt;unk>"**): string token used to represent OOV words.

In [None]:
from torchtext.data import Field

s = 'I want to be tokenized'

# First some recap on lambda expression
def tokenizer_explicit(x):
    return x.split(sep=' ')

# which is equivalent to a nameless function lambda x: x.split(sep=' ')

WORD_1 = Field()
print(WORD_1.preprocess(s))

WORD_2 = Field()
print(WORD_2.preprocess(s))

WORD_3 = Field()
print(WORD_3.preprocess(s))

WORD_4 = Field()
print(WORD_4.preprocess(s))

- `NestedField`: a nested field holds another field (called nesting field) accepts an untokenized string.
    - E.g., `"I want to be nested-tokenized" -> [["I"], ["w", "a", "n", "t"], ["t", "o"]...]`
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#nestedfield)
    - Common arguments:
        - `nesting_field` (**Field**): a field contained in this nested field.
        - `tokenize` (`string.split()`)

In [None]:
from torchtext.data import NestedField

s = 'I want to be nested-tokenized'

NESTING_FIELD = Field()
WORD_5 = NestedField()
print(WORD_5.preprocess(s))

s = 'I_want_to_be_nested-tokenized'
NESTING_FIELD = Field()
WORD_6 = NestedField()
print(WORD_6.preprocess(s))

### Your turn

#### Q1
```
"I want to be character-tokenized" -> ["i", "w", "a", "n", "t", "t", "o"...]
```

In [None]:
s = 'I want to be character-tokenized'

WORD = Field()
print(WORD.preprocess(s))

#### Q2

```
"I want to be bi-gramized" -> [("i", "want"), ("want", "to"), ("to", "be")...]
```

In [None]:
s = 'I want to be bi-gramized'

BI_GRAM = Field()
print(BI_GRAM.preprocess(s))

- Dataset: `TabularDataset`, `SequenceTaggingDataset`

```
1,"Something wise.",Someone sometime ->

WORDS: ["Something", "wise."]
NAME: "Someone sometime"
```


In [None]:
from torchtext.data import TabularDataset

tv_datafields = []

trn, dev = TabularDataset

tst_datafields = []

tst = TabularDataset

ex = next(iter(trn))
print(ex.__dict__.keys())
print(ex.WORDS)
print(ex.NAME)

ex = next(iter(tst))
print(ex.__dict__.keys())
print(ex.WORDS)
print(ex.NAME)

# Build vocab from training set
print(WORD_2.vocab.stoi)
print(WORD_3.vocab.stoi)

## Datasets $\rightarrow$ Iterators
- `Iterator`: Defines an iterator that loads batches of data from a dataset.
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#iterators)
    - Common arguments:
        - `dataset`
        - `batch_size`
        - `sort_key`: a key to use for sorting examples.
        - `shuffle`: whether to shuffle examples between epochs.
        - `sort`: whether to sort examples according to `sort_key`.
        - `sort_within_batch`: whether to sort within each batch.
        - `device` (**cpu**)

- `BucketIterator`: Defines an iterator that batches examples of similar lengths together. Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch.
    - [Documentation](https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator)
    - Common arguments:
        - `dataset`
        - `batch_size`
        - `sort_key`
        - `shuffle`
        - `sort`
        - `device`

In [None]:
from torchtext.data import Iterator, BucketIterator

trn_iter =

dev_iter, tst_iter =

ex = next(iter(trn_iter))
print(ex)
print('==========')
print(ex.WORDS)
print(ex.WORDS.size())  # (sequence_length, batch_size)
print('----------')
print(ex.NAME)
print(ex.NAME.size())  # (batch_size)

### Your turn
(thanks Miikka for the data and idea)

In [None]:
with open('data/uralic.train', encoding='utf-8') as f:
    n = 0
    for line in f:
        n += 1
        print(line.strip())
        if n == 10:
            break

The output format for each data point (line) should be
```
{WORD: 'todistuksensa',
 CHAR: ['t', 'o', 'd', 'i', 's', 't', 'u'...],
 LANG: 'fin'}
```
Also, for `WORD` and `CHAR` need to be prepend and append with `"<start>"` and `"<end>"`.

In [None]:
PAD = '<pad>'
UNK = '<unk>'
START = '<start>'
END = '<end>'

WORD = Field()

CHAR = Field()

LANG = Field()

print(WORD.preprocess('afgaaninvinttikoiria'))
print(CHAR.preprocess('afgaaninvinttikoiria'))
print(LANG.preprocess('fin'))

In [None]:
datafields = []
              
train, develop, test = TabularDataset

ex = next(iter(train))
print(ex.word)
print(ex.char)
print(ex.lang)

# build the vocab

In [None]:
train_iter = 

dev_iter, test_iter = 

ex = next(iter(train_iter))
print(ex)
print('==========')
print(ex.word)
print(ex.word.size())
print('----------')
print(ex.char)
print(ex.char.size())
print('----------')
print(ex.lang)
print(ex.lang.size())