# Prprocess custom text data using Torchtext

Illustrates the usage of torchtext on a dataset that is not built-in. We will preprocess a dataset that can be further utilized to train a sequeence to sequence model for machine translation without using legacy version of torchtext.

The notebook will cover:
- read a dataset
- Tokenize sentence
- apply transformation to sentence
- perform bucketing batching

Link: https://pytorch.org/tutorials/beginner/torchtext_custom_dataset_tutorial.html

In [3]:
import torchdata.datapipes as dp
import torchtext.transforms as T
import spacy
from torchtext.vocab import build_vocab_from_iterator

In [4]:
eng = spacy.load("en_core_web_sm")
de = spacy.load("de_core_news_sm")

Load the dataset

In [17]:
FILE_PATH = 'deu.txt'
data_pip = dp.iter.IterableWrapper([FILE_PATH])
data_pip = dp.iter.FileOpener(data_pip, mode='rb')
data_pip = data_pip.parse_csv(skip_lines=0, delimiter='\t', as_tuple=True)

In the above code block, we are doing following things:
1. We are creating an iterable of filenames
2. We pass the iterable to fileopener which then opens the files in read mode
3. We call a function to parse the file, which again returns an iterable of tuples representing each rows of the tab delimited file

We can verify if the iterable has the pair of sentences as shown below

In [18]:
for sample in data_pip:
    print(sample)
    break

('Go.', 'Geh.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)')


note that we also have attribution details along with pairs of sentences. We will write a small function to remove the attribution details

In [19]:
def removeAttribution(row):
    """Function to keep the first element in a tumple"""
    return row[:2]

In [20]:
data_pip = data_pip.map(removeAttribution)

verify the data_pipe only contains pair of sentences

In [21]:
for sample in data_pip:
    print(sample)
    break

('Go.', 'Geh.')


Next we need to define a few function to perform tokenization:

In [22]:
def engTokenize(text):
    """Tokenize an english text and return a list of tokens"""
    return [token.text for token in eng.tokenizer(text)]

def deTokenize(text):
    """Tokenize a german text and return a list of tokens"""
    return [token.text for token in de.tokenizer(text)]

Test out the functions above to confirm they work

In [23]:
print(engTokenize("have a good date!!!"))

['have', 'a', 'good', 'date', '!', '!', '!']


In [24]:
print(deTokenize("haben sie einen guten tag!!!"))

['haben', 'sie', 'einen', 'guten', 'tag', '!', '!', '!']
