# Prprocess custom text data using Torchtext

Illustrates the usage of torchtext on a dataset that is not built-in. We will preprocess a dataset that can be further utilized to train a sequeence to sequence model for machine translation without using legacy version of torchtext.

The notebook will cover:
- read a dataset
- Tokenize sentence
- apply transformation to sentence
- perform bucketing batching

Link: https://pytorch.org/tutorials/beginner/torchtext_custom_dataset_tutorial.html

In [3]:
import torchdata.datapipes as dp
import torchtext.transforms as T
import spacy
from torchtext.vocab import build_vocab_from_iterator

In [4]:
eng = spacy.load("en_core_web_sm")
de = spacy.load("de_core_news_sm")

Load the dataset

In [17]:
FILE_PATH = 'deu.txt'
data_pip = dp.iter.IterableWrapper([FILE_PATH])
data_pip = dp.iter.FileOpener(data_pip, mode='rb')
data_pip = data_pip.parse_csv(skip_lines=0, delimiter='\t', as_tuple=True)

In the above code block, we are doing following things:
1. We are creating an iterable of filenames
2. We pass the iterable to fileopener which then opens the files in read mode
3. We call a function to parse the file, which again returns an iterable of tuples representing each rows of the tab delimited file

We can verify if the iterable has the pair of sentences as shown below

In [18]:
for sample in data_pip:
    print(sample)
    break

('Go.', 'Geh.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)')


note that we also have attribution details along with pairs of sentences. We will write a small function to remove the attribution details

In [19]:
def removeAttribution(row):
    """Function to keep the first element in a tumple"""
    return row[:2]

In [20]:
data_pip = data_pip.map(removeAttribution)

verify the data_pipe only contains pair of sentences

In [21]:
for sample in data_pip:
    print(sample)
    break

('Go.', 'Geh.')


Next we need to define a few function to perform tokenization:

In [22]:
def engTokenize(text):
    """Tokenize an english text and return a list of tokens"""
    return [token.text for token in eng.tokenizer(text)]

def deTokenize(text):
    """Tokenize a german text and return a list of tokens"""
    return [token.text for token in de.tokenizer(text)]

Test out the functions above to confirm they work

In [23]:
print(engTokenize("have a good date!!!"))

['have', 'a', 'good', 'date', '!', '!', '!']


In [24]:
print(deTokenize("haben sie einen guten tag!!!"))

['haben', 'sie', 'einen', 'guten', 'tag', '!', '!', '!']


## Building the Voacbulary

Lets us consider an english sentence as the source and german sentence as the target.

Vocabulary can be considered as the set of unique words we have in the dataset. We will build vocabulary for both our source and target now.

We neeed to define a function to get tokens from elements of tuples in the iterator

In [25]:
def getTokens(data_iter, place):
    """
    Function to yield tokens from an iterator. Since our iterator contains tuples of sentences
    (sorce and target), "place" parameters defines for which index to return the token for. 
    "place = 0" for source and "place=1" for traget
    """
    for english, german in data_iter:
        if place == 0:
            yield english(english)
        else:
            yield deTokenize(german)

Now we will build vocabulary for source

In [26]:
source_vocab = build_vocab_from_iterator(
    getTokens(data_pip, 2),
    min_freq=2,
    specials=['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)

source_vocab.set_default_index(source_vocab['<unk>'])

The code above builds the vocabulary from the iterator:
1. getokens function places=0 as we need vocabulary for source sentences
2. set min_feq=2 means the function will skip those words that occurs less than 2 times
3. We specify some special tokens
    - sos for the start of sentence
    - eos for the end of sentence
    - unk for unknown words
    - pad is padding token
4. set special_first=true which means pad will get index 0, sos will be index 1, eos index 2 and unk will get index 3 in the vocabulary
5. set the default index as index of unk. If some words is not in the vocabulary we will use unk instead of that unknown word

Similarly, we will build vocabulary for target sentences 

In [27]:
target_vocab = build_vocab_from_iterator(
    getTokens(data_pip, 1),
    min_freq=2,
    specials=['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)

target_vocab.set_default_index(target_vocab['<unk>'])

Example above shows how we can add special tokens to our vocabulary. Special tokens may change based on the requirements.

Now we can verify that special tokens are placed at the beginning and then othe words. in the below code reutrns a list which tokens at index based on vocabulary

In [28]:
print(source_vocab.get_itos()[:9])

['<pad>', '<sos>', '<eos>', '<unk>', '.', ',', 'Tom', 'Ich', '?']


## Numericalize sentence using vocabulary

After building the vocabulary, we need to convert our sentences to corresponding indcies.

We will need additional functions for this.