## Purpose
This is a gradual approach on using torchtext. 

#### Tokenize and numericalize
We first define a `Field` which is datatype to process text

In [25]:
import torchtext
import torch

In [None]:
field = torchtext.data.Field()

txt = 'This is simple sentence .'

#tokenize 
tokens = field.tokenize(txt)
print(f'tokens: {tokens}')

# build the vocabulary
field.build_vocab(tokens)

Lets have a look at the vocab frequency. 

In [None]:
field.vocab.freqs


So tokenization is happening at charecter level. so we need an interator of iterator of tokens. Lets tokenize again.

In [None]:
field = torchtext.data.Field()
tokens = field.tokenize(txt)

# we use a list of list of tokens
field.build_vocab([tokens])

#check the word frequencies
print(f'frequencies: {field.vocab.freqs}')

# words in vocab
print(f'vocab words: {list(field.vocab.stoi.keys())}')

Let us tokenize and numericalize

In [None]:
data = field.numericalize([tokens])
print('data.shape: ', data.shape)
data

### Batchifying for a language model
When using transformers for training a language model, how do we create batches and feed the data efficiently during training?
It was not very clear to me in https://pytorch.org/tutorials/beginner/transformer_tutorial.html

In [4]:
txt = 'abcdefghijklmnopqrstuvwxyz'
txt = ' '.join([o.upper() for o in txt])
print(txt)

rfield = torchtext.data.ReversibleField(init_token='<sos>',
                                      eos_token='<eos>',
                                      lower=True)

tokens = rfield.tokenize(txt)
rfield.build_vocab([tokens])
data = rfield.numericalize([tokens])
data.shape

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


torch.Size([26, 1])

In [5]:
'''
input is tensor
'''
def batchify2(data, bsz):
    # divide the data into bsz parts
    nbatch = data.size(0)//bsz
    
    # trim off any extra elements that wouldnt cleanly fit (reminders)
    data = data.narrow(0, 0, nbatch*bsz)
    
    # evenly divide the data across bsz batches
    data = data.view(bsz, -1).t().contiguous()
    return data

In [7]:
batches = batchify2(data, bsz=4)
print(batches.shape);
rfield.reverse(batches)

torch.Size([6, 4])


['A B C D E F', 'G H I J K L', 'M N O P Q R', 'S T U V W X']

In [8]:
bptt = 3
def get_batch(source, i):
    print(f'source.shape: {(source.shape)}')
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

In [23]:
x, y = get_batch(batches, 0)
print(f'x.shape - {x.shape}')
print(f'y.shape - {y.shape}')
print(f'x: {rfield.reverse(x)}')
print(f'y: {rfield.reverse(y.view(y.shape[0],-1))}')

source.shape: torch.Size([6, 4])
x.shape - torch.Size([3, 4])
y.shape - torch.Size([12])
x: ['A B C', 'G H I', 'M N O', 'S T U']
y: ['B H N T C I O U D J P V']


<img src="img/lm_batches.jpg" width="480">

In [26]:
def _generate_square_subsequent_mask(sz):
    # populate the lower triangle with True and rest with False
    return torch.tril(torch.ones(sz, sz)) == 1.0
_generate_square_subsequent_mask(bptt)

tensor([[ True, False, False],
        [ True,  True, False],
        [ True,  True,  True]])