<h1>Torchtext with textfiles datastes </h1>








<h3><span style='color:yellow'>This tutorial covers the approach for handling datasets presented in local text files.</span></h3>

<h3><span style='color:yellow'>The task at hand is a German-English translation based on a subset of 40 samples from the WMT dataset.</span></h3>


<h3><span style='color:yellow'>Each text file comprises rows of text, and the corresponding translations are found in another file at the same row index.</span></h3>




In [2]:
# Importing libraries
import pandas as pd
import spacy
from sklearn.model_selection import train_test_split
from torchtext.data import Field, TabularDataset,BucketIterator


In [7]:
# Opening datasets files
english_text=open("./datastes/text data/WMT_en.txt",encoding="utf-8").read().split("\n") # spliting based on new line
german_txt=open("./datastes/text data/WMT_de.txt",encoding="utf-8").read().split("\n") # spliting based on new line

raw_data= {'German':[line for line in german_txt], 'English': [line for line in english_text]} # creating a dictionary of the data

# Locatting the data into a dataframe
df=pd.DataFrame(raw_data,columns=["English","German"])
df.head(2)

Unnamed: 0,English,German
0,iron cement is a ready for use paste which is ...,iron cement ist eine gebrauchs ##AT##-##AT## f...
1,iron cement protects the ingot against the hot...,Nach der Aushärtung schützt iron cement die Ko...


In [12]:
# Splitting the data into train and test
train, test=train_test_split(df,test_size=0.2, random_state=1234)

# Remember, TabularDataset requires data to be in json, CSV, TSV file format
train.to_json("./datastes/text data/WMT_train.json",orient='records',lines=True) #'records': List in which each item corresponds to a DataFrame row, formatted as a dictionary.
test.to_json("./datastes/text data/WMT_test.json",orient='records',lines=True)    # True: Read the file as a json object per line.

# We can covert it to CSV also as follows
#train.to_csv("./datastes/text data/WMT_train.csv",index=False)
#test.to_csv("./datastes/text data/WMT_test.csv",index=False)


In [20]:
# Tokenization
spacy_de=spacy.load("de_core_news_sm")
spacy_en=spacy.load("en_core_web_sm")

def german_tokenizer(text):
    return [token.text for token in spacy_de.tokenizer(text)]

def english_tokenizer(text):
    return [token.text for token in spacy_en.tokenizer(text)]

german=Field(sequential=True,use_vocab=True,tokenize=german_tokenizer,lower=True)
english=Field(sequential=True,use_vocab=True,tokenize=english_tokenizer,lower=True)

In [29]:
# TabularDataset
path='./datastes/text data/'
fields={'German':('de',german),'English':('en',english)}

train_data,test_data=TabularDataset.splits(
    path=path,
    train='WMT_train.json',
    test='WMT_test.json',
    format='json',
    fields=fields
)


In [30]:
# Building vocabulary
german.build_vocab(train_data,max_size=10000,min_freq=2)
english.build_vocab(train_data,max_size=10000,min_freq=2)

In [61]:
#BuckqetIterator
train_iterator,test_iterator=BucketIterator.splits(
    (train_data,test_data),
    batch_size=32,
    device='cpu')



In [65]:
for batch in train_iterator:
    print(batch.en)

tensor([[  2,   2,  38,  ...,  17,  28,  38],
        [  9,   9,   8,  ...,   0,  12,  19],
        [ 52,  52,   6,  ...,   6,  33,  41],
        ...,
        [ 71,  71,   1,  ...,   1,   1,   1],
        [109, 109,   1,  ...,   1,   1,   1],
        [ 14,  14,   1,  ...,   1,   1,   1]])




In [63]:
import torch
device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

for batch in train_iterator:
    data=batch.de.to(device)

