Neural Machine Translation **(NMT)** is an approach to Machine Translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modelling entire sentences in a single integrated model.

SO the Sequence (seq2seq) model in this post uses an encoder decoder architecture, which uses a type of RNN called LSTM (Long short Term Memory), where the encoder neural network encodes the input language ssquence into a single vector called a ***Context Vector***.

This *Context Vector* is said to contain the abstract representation of the input language sequence.

This vector is then passed into the decoder neural network, which is used to output the corresponding output language translation sentence, one word at a time.

***Torch text*** is a powerful library for making the text data ready for a variety of NLP tasks. It has all the tools to perform preprocessing on the textual data.

1. Fields :
This is a class under the torch text, where we specify how the preprocessing should be done on our data corpus.

2. TabularDataset :
Using this class, we can actually define the Dataset of columns stored in CSV, TSV, or JSON format and also map them into integers.

3. BucketIterator :
Using this class, we can perform padding our data for approximation and make batches with our data for model training.

In [2]:
!pip install -U torchtext==0.6.0
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import numpy as np
import pandas as pd
import spacy, random

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.6.0
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.14.0
    Uninstalling torchtext-0.14.0:
      Successfully uninstalled torchtext-0.14.0
Successfully installed sentencepiece-0.1.97 torchtext-0.6.0




In [3]:
## Loading the SpaCy's vocabulary for our desired languages. 
!python -m spacy download en_core_web_sm --quiet
!python -m spacy download de_core_news_sm --quiet
spacy_german = spacy.load("de_core_news_sm")
spacy_english = spacy.load("en_core_web_sm")

2023-01-17 09:16:12.568043: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
2023-01-17 09:16:28.516773: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


***Data PreProcessing***

In [10]:
def tokenize_german(text):
  return [token.text for token in spacy_german.tokenizer(text)]

def tokenize_english(text):
  return [token.text for token in spacy_english.tokenizer(text)]

german = Field(tokenize=tokenize_german, lower=True, init_token="<sos>",
               eos_token="<eos>")

english = Field(tokenize=tokenize_english, lower=True, init_token="<sos>",
               eos_token="<eos>")

train_data, valid_data, test_data = Multi30k.splits(exts = (".de", ".en"),
                                                    fields = (german, english))

german.build_vocab(train_data, max_size=10000, min_freq=3)
english.build_vocab(train_data, max_size=10000, min_freq=3)

print(f"Unique tokens in source (de) vocabulary: {len(german.vocab)}")
print(f"Unique tokens in source (en) vocabulary: {len(english.vocab)}")

Unique tokens in source (de) vocabulary: 5374
Unique tokens in source (en) vocabulary: 4556


***Data Processing***

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),
                                                                      batch_size=BATCH_SIZE,
                                                                      sort_within_batch=True,
                                                                      sort_key=lambda x: len(x.src),
                                                                      device = device)


In [12]:
train_iterator

<torchtext.data.iterator.BucketIterator at 0x7efd09788e50>