# 1-Sequence to Sequence Learning with Neural Network
In this notebook we are implementing Sequence to Sequence Learning with Neural Networks paper.
Reference: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

Preparing Data

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim


In [2]:
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

In [3]:
import spacy
import random, math, time

In [4]:
SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic= True

spaCy has model for each language ("de" for German and "en" for English) which need to be loaded so we can access the tokenizer of each model.

Note: the models must first be downloaded using the following on the command line:

python -m spacy download en

python -m spacy download de

In [5]:
spacy_de=spacy.load('de')
spacy_en=spacy.load('en')

In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".

In [6]:
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

In [7]:
def tokenizer_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

TorchText's Fields handle how data should be processed. You can read all of the possible arguments here.

In [8]:
SRC = Field(tokenize=tokenize_de,
            init_token='<sos>',
           eos_token='<eos>',
           lower=True)
TRG=Field(tokenize=tokenizer_en,
         init_token='<sos>',
         eos_token='<eos>',
         lower=True)

In [9]:
train_data, valid_data, test_data= Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))

In [10]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [11]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [12]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data,min_freq=2)

In [13]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


In [14]:
import os
os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='0'

In [15]:
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [16]:
BATCH_SIZE=128
train_iterator, valid_iterator, test_iterator =BucketIterator.splits((train_data, valid_data, test_data),batch_size=BATCH_SIZE,
                                                                    device=device)

In [17]:
train_iterator

<torchtext.data.iterator.BucketIterator at 0x7f7a6acab278>

Building seq2seq model

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.input_dim=input_dim
        self.emb_dim=emb_dim
        self.hid_dim=hid_dim
        self.n_layers=n_layers
        self.dropout=dropout
        
        self.embedding=nn.Embedding(input_dim, emb_dim)
        self.rnn= nn.LSTM(emb_dim,hid_dim,n_layers, dropout=dropout)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,src):
        embedded=self.dropout(self.embedding(src))
        outputs,(hidden,cell)=self.rnn(embedded)
        return hidden, cell
    