# Prprocess custom text data using Torchtext

Illustrates the usage of torchtext on a dataset that is not built-in. We will preprocess a dataset that can be further utilized to train a sequeence to sequence model for machine translation without using legacy version of torchtext.

The notebook will cover:
- read a dataset
- Tokenize sentence
- apply transformation to sentence
- perform bucketing batching

Link: https://pytorch.org/tutorials/beginner/torchtext_custom_dataset_tutorial.html

In [1]:
import torchdata.datapipes as dp
import torchtext.transforms as T
import spacy
from torchtext.vocab import build_vocab_from_iterator



In [2]:
eng = spacy.load("en_core_web_sm")
de = spacy.load("de_core_news_sm")

Load the dataset

In [3]:
FILE_PATH = 'deu.txt'
data_pipe = dp.iter.IterableWrapper([FILE_PATH])
data_pipe = dp.iter.FileOpener(data_pipe, mode='rb')
data_pipe = data_pipe.parse_csv(skip_lines=0, delimiter='\t', as_tuple=True)

In the above code block, we are doing following things:
1. We are creating an iterable of filenames
2. We pass the iterable to fileopener which then opens the files in read mode
3. We call a function to parse the file, which again returns an iterable of tuples representing each rows of the tab delimited file

We can verify if the iterable has the pair of sentences as shown below

In [4]:
for sample in data_pipe:
    print(sample)
    break

('Go.', 'Geh.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)')


note that we also have attribution details along with pairs of sentences. We will write a small function to remove the attribution details

In [5]:
def removeAttribution(row):
    """Function to keep the first element in a tumple"""
    return row[:2]

In [6]:
data_pipe = data_pipe.map(removeAttribution)

verify the data_pipe only contains pair of sentences

In [7]:
for sample in data_pipe:
    print(sample)
    break

('Go.', 'Geh.')


Next we need to define a few function to perform tokenization:

In [8]:
def engTokenize(text):
    """Tokenize an english text and return a list of tokens"""
    return [token.text for token in eng.tokenizer(text)]

def deTokenize(text):
    """Tokenize a german text and return a list of tokens"""
    return [token.text for token in de.tokenizer(text)]

Test out the functions above to confirm they work

In [9]:
print(engTokenize("have a good date!!!"))

['have', 'a', 'good', 'date', '!', '!', '!']


In [10]:
print(deTokenize("haben sie einen guten tag!!!"))

['haben', 'sie', 'einen', 'guten', 'tag', '!', '!', '!']


## Building the Voacbulary

Lets us consider an english sentence as the source and german sentence as the target.

Vocabulary can be considered as the set of unique words we have in the dataset. We will build vocabulary for both our source and target now.

We neeed to define a function to get tokens from elements of tuples in the iterator

In [11]:
def getTokens(data_iter, place):
    """
    Function to yield tokens from an iterator. Since our iterator contains tuples of sentences
    (sorce and target), "place" parameters defines for which index to return the token for. 
    "place = 0" for source and "place=1" for traget
    """
    for english, german in data_iter:
        if place == 0:
            yield english(english)
        else:
            yield deTokenize(german)

Now we will build vocabulary for source

In [12]:
source_vocab = build_vocab_from_iterator(
    getTokens(data_pipe, 2),
    min_freq=2,
    specials=['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)

source_vocab.set_default_index(source_vocab['<unk>'])

The code above builds the vocabulary from the iterator:
1. getokens function places=0 as we need vocabulary for source sentences
2. set min_feq=2 means the function will skip those words that occurs less than 2 times
3. We specify some special tokens
    - sos for the start of sentence
    - eos for the end of sentence
    - unk for unknown words
    - pad is padding token
4. set special_first=true which means pad will get index 0, sos will be index 1, eos index 2 and unk will get index 3 in the vocabulary
5. set the default index as index of unk. If some words is not in the vocabulary we will use unk instead of that unknown word

Similarly, we will build vocabulary for target sentences 

In [13]:
target_vocab = build_vocab_from_iterator(
    getTokens(data_pipe, 1),
    min_freq=2,
    specials=['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)

target_vocab.set_default_index(target_vocab['<unk>'])

Example above shows how we can add special tokens to our vocabulary. Special tokens may change based on the requirements.

Now we can verify that special tokens are placed at the beginning and then othe words. in the below code reutrns a list which tokens at index based on vocabulary

In [14]:
print(source_vocab.get_itos()[:9])

['<pad>', '<sos>', '<eos>', '<unk>', '.', ',', 'Tom', 'Ich', '?']


## Numericalize sentence using vocabulary

After building the vocabulary, we need to convert our sentences to corresponding indcies.

We will need additional functions for this.

In [15]:
def getTransform(vocab):
    """
    Create transforms based on given vocabulary. 
    The return transform is applied to sequence of tokens
    """
    text_transform = T.Sequential(
        # converts the sentences to indices based on given vocabulary
        T.VocabTransform(vocab=vocab),
        # add sos at beginning of each sentence. 1 because of the index of sos in vocabulary is 
        # 1 as seen in previous section
        T.AddToken(1, begin=True),
        # add eos at beginning of each sentence. 2 because of the index of eos
        T.AddToken(2, begin=False)
    )
    return text_transform

lets see how to use the above function. It returns an object of transforms which will use on our sentences. 

In [16]:
temp_list = list(data_pipe)
some_sentence = temp_list[798][0]
print("some sentence=", end="")
print(some_sentence)
transform_sentence = getTransform(source_vocab)(engTokenize(some_sentence))
print("Transform sentence=", end="")
print(transform_sentence)
index_to_string = source_vocab.get_itos()
for index in transform_sentence:
    print(index_to_string[index], end=" ")

some sentence=I changed.
Transform sentence=[1, 19362, 3, 4, 2]
<sos> I <unk> . <eos> 

explaination above the above code:
- take a source sentence from list that we create from data_pipe
- we get a transform based on source vocabulary and apply it to a tokenized sentence
    - transforms take list of words and not sentence
- we get the mapping of index to string and then use it get the transformed sentence

now we will use datapipe functions to apply trnasform to all our sentences. We will need additional functions for this.

In [17]:
def applyTransform(sequence_pair):
    """ 
    apply transforms to sequence of tokens in a sequence pair
    """
    return (
        getTransform(source_vocab)(engTokenize(sequence_pair[0])),
        getTransform(target_vocab)(deTokenize(sequence_pair[1]))
    )

In [18]:
data_pipe = data_pipe.map(applyTransform) # apply the function to each elment
temp_list = list(data_pipe)
print(temp_list[0])

([1, 3, 4, 2], [1, 743, 4, 2])


## Make batches (with bucket batch)

Generally, we train models in batches. While working for sequence to sequence models, its recommended to keep the length of sequence in batch similar. 

For that we will use bucketbatch function of data pip

In [19]:
def sortBucket(bucket):
    """ 
    function to sort a given bucket. 
    Here we want to sort based on length of source and target sequence
    """
    return sorted(bucket, key=lambda x: (len(x[0]), len([1])))

apply the bucketbatch function

In [20]:
data_pipe = data_pipe.bucketbatch(
    batch_size = 4, batch_num=5,  bucket_num=1,
    use_in_batch_shuffle=False, sort_key=sortBucket
)

above code will:
- keepp the batch size to 4
- batch num is the number of batches to keep in a bucket
- bucket num is the number of buckets to keep in a pool for shuffling
- sort key specifices the function that takes a bucket and sorts it

Lets us consider a batch of sorce sentences as x and a batch of targets sentences as y. Generally, while training a model, we predict on a batch of x and compare the result with y.

But a batch in our data_pip is of the form 

In [21]:
print(list(data_pipe)[0])

[([1, 3, 24, 2], [1, 16076, 24, 2]), ([1, 3, 3, 4, 2], [1, 1596, 31, 24, 2]), ([1, 3, 3, 4, 2], [1, 3780, 31, 24, 2]), ([1, 3, 3, 4, 2], [1, 375, 12, 31, 24, 2])]


now we will convert them into the form ((X_1,X_2,X_3,X_4), (y_1,y_2,y_3,y_4))

In [23]:
def seperateSourceTarget(sequence_pairs):
    """ 
    input of form [(X_1,y_1), (X_2,y_2), (X_3,y_3), (X_4,y_4)]
    output of form ((X_1,X_2,X_3,X_4), (y_1,y_2,y_3,y_4))
    """
    source, target = zip(*sequence_pairs)
    return source, target

In [24]:
data_pipe = data_pipe.map(seperateSourceTarget)


In [25]:
print(list(data_pipe)[0])

(([1, 19362, 3, 4, 2], [1, 19362, 3, 4, 2], [1, 19362, 3, 4, 2], [1, 19362, 3, 4, 2]), ([1, 7, 43, 10746, 4, 2], [1, 7, 8885, 4, 2], [1, 7, 43, 3, 4, 2], [1, 7, 74, 4, 2]))


## Padding 

As mentioned earlier, while building vocabulary, we need to pad shorter sentences in a batch to maek all the sequences in a batch of equal length.

We can perform padding as the following