<h1>Torchtext Built-in Datasets</h1>








<h3><span style='color:yellow'>Torchtext offers a variety of datasets suitable for numerous NLP tasks. These datasets can be found at: </span><a href="https://pytorch.org/text/stable/datasets.html#machine-translation" style="color:blue; font-family: 'Arial', sans-serif;">https://pytorch.org/text/stable/datasets.html#machine-translation</a></h3>

<h3><span style='color:yellow'>In this tutorial, we will explore the Multi30k dataset, which is used in machine translation applications.</span></h3>

<h3><span style='color:yellow'>The objective of this tutorial is to construct dataset objects for German- English translation.</span></h3>




In [2]:
# Importing the libraries
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, TabularDataset, BucketIterator

import spacy

#!python -m spacy download en_core_web_sm
#!python -m spacy download de_core_news_sm

spacy_ger=spacy.load('de_core_news_sm')
spacy_en=spacy.load('en_core_web_sm')

In [3]:
# Define the german tokenizer function
def ger_tokenizer(text):
    return [token.text for token in spacy_ger.tokenizer(text)]


# Define the english tokenizer function
def en_tokenize(text):
    return [token.text for token in spacy_en.tokenizer(text)]

In [4]:
# Define the source and target fields
german=Field(sequential=True,use_vocab=True,tokenize=ger_tokenizer,lower=True)
english=Field(sequential=True,use_vocab=True,tokenize=en_tokenize,lower=True)




In [None]:
# Dataset split
# Note: might be will face a problem in downloading the dataset.
train_data,validation_data,test_data=Multi30k.splits(
    exts=('.de','en'),  # source and target language . Check teh dataset website
    fields=(german,english)
    )

In [None]:
# Building the vocabularies
german.build_vocab(train_data,max_size=10000,min_freq=2)
english.build_vocab(train_data,max_size=10000,min_freq=2)

In [None]:
# Bucketiterator
train_iterator,validation_iterator,test_iterator=BucketIterator.splits(
    datasets=(train_data,validation_data,test_data),
    batch_sizes=64
    device='cuda')

In [None]:
for batch in train_iterator:
    print(batch)
    print(batch.src)
    print(batch.trg)
    

In [None]:
# We can try either by this notebook or the command windows to run the following command
english.vocab.stoi['the']  # stoi stands for string to index
english.vocab.itos[4]  # itos stands for index to string  (4 is the index of 'the')
