# 1-Sequence to Sequence Learning with Neural Network
In this notebook we are implementing Sequence to Sequence Learning with Neural Networks paper. https://arxiv.org/abs/1409.3215

Reference: https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb

Preparing Data

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim


In [10]:
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

In [11]:
import spacy
import random, math, time

In [12]:
import os
os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='0'
torch.set_num_threads(4)

In [13]:
SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic= True

spaCy has model for each language ("de" for German and "en" for English) which need to be loaded so we can access the tokenizer of each model.

Note: the models must first be downloaded using the following on the command line:

python -m spacy download en

python -m spacy download de

In [14]:
spacy_de=spacy.load('de')
spacy_en=spacy.load('en')

In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".

In [15]:
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

In [16]:
def tokenizer_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

TorchText's Fields handle how data should be processed. You can read all of the possible arguments here.

In [17]:
SRC = Field(tokenize=tokenize_de,
            init_token='<sos>',
           eos_token='<eos>',
           lower=True)
TRG=Field(tokenize=tokenizer_en,
         init_token='<sos>',
         eos_token='<eos>',
         lower=True)

In [18]:
train_data, valid_data, test_data= Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))

In [19]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [20]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [21]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data,min_freq=2)

In [22]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


In [23]:
import os
os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='0'

In [24]:
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [25]:
BATCH_SIZE=128
train_iterator, valid_iterator, test_iterator =BucketIterator.splits((train_data, valid_data, test_data),batch_size=BATCH_SIZE,
                                                                    device=device)

In [26]:
train_iterator

<torchtext.data.iterator.BucketIterator at 0x7fcb673cef98>

Building seq2seq model

In [27]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.input_dim=input_dim
        self.emb_dim=emb_dim
        self.hid_dim=hid_dim
        self.n_layers=n_layers
        self.dropout=dropout
        
        self.embedding=nn.Embedding(input_dim, emb_dim)
        self.rnn= nn.LSTM(emb_dim,hid_dim,n_layers, dropout=dropout)
        self.dropout=nn.Dropout(dropout)
        
    def forward(self,src):
        embedded=self.dropout(self.embedding(src))
        outputs,(hidden,cell)=self.rnn(embedded)
        return hidden, cell
    

In [28]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.emb_dim=emb_dim
        self.hid_dim=hid_dim
        self.output_dim=output_dim
        self.n_layers=n_layers
        self.dropout=dropout
        
        self.embedding=nn.Embedding(output_dim, emb_dim)
        
        self.rnn=nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.out=nn.Linear(hid_dim, output_dim)
        
        self.dropout=nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        input=input.unsqueeze(0)
        embedded=self.dropout(self.embedding(input))
        output, (hidden,cell)=self.rnn(embedded,(hidden, cell))
        prediction =self.out(output.squeeze(0))
        return prediction, hidden, cell

In [29]:
class Seq2Seq(nn.Module):
    def __init__ (self, encoder, decoder, device):
        super().__init__()
        self.encoder= encoder
        self.decoder= decoder
        self.device= device
        
        assert encoder.hid_dim == decoder.hid_dim, \
        "Hidden dimension of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
        "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size= trg.shape[1]
        max_len=trg.shape[0]
        trg_vocab_size=self.decoder.output_dim
        
        outputs=torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        hidden, cell=self.encoder(src)
        
        input= trg[0,:]
        
        for t in range(1, max_len):
            output, hidden, cell= self.decoder(input, hidden, cell)
            outputs[t]=output
            teacher_force= random.random()< teacher_forcing_ratio
            top1=output.max(1)[1]
            input=(trg[t] if teacher_force else top1)
        return outputs

Training the Seq2Seq model

In [30]:
INPUT_DIM=len(SRC.vocab)
OUTPUT_DIM=len(TRG.vocab)
ENC_EMB_DIM=256
DEC_EMB_DIM=256
HID_DIM=512
N_LAYERS=2
ENC_DROPOUT=0.5
DEC_DROPOUT=0.5

enc=Encoder(INPUT_DIM, ENC_EMB_DIM,HID_DIM,N_LAYERS,ENC_DROPOUT)
dec=Decoder(INPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
model= Seq2Seq(enc, dec, device).to(device)

In [31]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5)
  )
  (decoder): Decoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (out): Linear(in_features=512, out_features=7855, bias=True)
    (dropout): Dropout(p=0.5)
  )
)

In [32]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 15,407,791 trainable parameters


In [33]:
optimizer=optim.Adam(model.parameters())

In [34]:
PAD_IDX=TRG.vocab.stoi['<pad>']
criterion=nn.CrossEntropyLoss(ignore_index=PAD_IDX)

In [35]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss=0
    for i, batch in enumerate(iterator):
        src=batch.src
        trg=batch.trg
        
        optimizer.zero_grad()
        
        output=model(src,trg)
        
        output=output[1:].view(-1,output.shape[-1])
        trg=trg[1:].view(-1)
        
        loss=criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(),clip)
        
        optimizer.step()
        
        epoch_loss +=loss.item()
        
    return epoch_loss/len(iterator)

In [36]:
def evaluate(model, iterator, criterion):
    model.eval()
    
    epoch_loss=0
    
    with torch.no_grad():
        
        for i, batch in enumerate(iterator):
            src= batch.src
            trg= batch.trg
            
            output= model(src, trg, 0)
            
            output=output[1:].view(-1, output.shape[-1])
            trg=trg[1:].view(-1)
            
            loss=criterion(output, trg)
            
            epoch_loss+= loss.item()
    return epoch_loss/len(iterator)

In [37]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time-start_time
    elapsed_mins= int(elapsed_time/60)
    elapsed_secs= int(elapsed_time -(elapsed_mins*60))
    return elapsed_mins, elapsed_secs

In [40]:
N_EPOCHS=100
CLIP=1
best_valid_loss=float('inf')
for epoch in range(N_EPOCHS):
    start_time=time.time()
    train_loss=train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss=evaluate(model, valid_iterator, criterion)
    
    end_time=time.time()
    
    epoch_mins, epoch_secs= epoch_time(start_time, end_time)
    
    if valid_loss< best_valid_loss:
        best_valid_loss=valid_loss
        torch.save(model.state_dict(),'../tut1-model.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 36s
	Train Loss: 2.588 | Train PPL:  13.307
	 Val. Loss: 3.665 |  Val. PPL:  39.045
Epoch: 02 | Time: 0m 37s
	Train Loss: 2.530 | Train PPL:  12.554
	 Val. Loss: 3.666 |  Val. PPL:  39.079
Epoch: 03 | Time: 0m 36s
	Train Loss: 2.443 | Train PPL:  11.508
	 Val. Loss: 3.660 |  Val. PPL:  38.845
Epoch: 04 | Time: 0m 36s
	Train Loss: 2.375 | Train PPL:  10.755
	 Val. Loss: 3.659 |  Val. PPL:  38.837
Epoch: 05 | Time: 0m 36s
	Train Loss: 2.305 | Train PPL:  10.029
	 Val. Loss: 3.641 |  Val. PPL:  38.119
Epoch: 06 | Time: 0m 36s
	Train Loss: 2.245 | Train PPL:   9.436
	 Val. Loss: 3.725 |  Val. PPL:  41.469
Epoch: 07 | Time: 0m 37s
	Train Loss: 2.168 | Train PPL:   8.738
	 Val. Loss: 3.720 |  Val. PPL:  41.264
Epoch: 08 | Time: 0m 37s
	Train Loss: 2.118 | Train PPL:   8.318
	 Val. Loss: 3.730 |  Val. PPL:  41.684
Epoch: 09 | Time: 0m 37s
	Train Loss: 2.061 | Train PPL:   7.852
	 Val. Loss: 3.692 |  Val. PPL:  40.133
Epoch: 10 | Time: 0m 36s
	Train Loss: 1.998 | Train PPL

Epoch: 80 | Time: 0m 37s
	Train Loss: 0.452 | Train PPL:   1.571
	 Val. Loss: 5.461 |  Val. PPL: 235.233
Epoch: 81 | Time: 0m 37s
	Train Loss: 0.448 | Train PPL:   1.565
	 Val. Loss: 5.529 |  Val. PPL: 251.815
Epoch: 82 | Time: 0m 36s
	Train Loss: 0.447 | Train PPL:   1.564
	 Val. Loss: 5.487 |  Val. PPL: 241.594
Epoch: 83 | Time: 0m 36s
	Train Loss: 0.435 | Train PPL:   1.545
	 Val. Loss: 5.585 |  Val. PPL: 266.406
Epoch: 84 | Time: 0m 37s
	Train Loss: 0.436 | Train PPL:   1.546
	 Val. Loss: 5.575 |  Val. PPL: 263.656
Epoch: 85 | Time: 0m 36s
	Train Loss: 0.424 | Train PPL:   1.528
	 Val. Loss: 5.611 |  Val. PPL: 273.371
Epoch: 86 | Time: 0m 37s
	Train Loss: 0.417 | Train PPL:   1.518
	 Val. Loss: 5.711 |  Val. PPL: 302.230
Epoch: 87 | Time: 0m 36s
	Train Loss: 0.421 | Train PPL:   1.524
	 Val. Loss: 5.585 |  Val. PPL: 266.473
Epoch: 88 | Time: 0m 37s
	Train Loss: 0.410 | Train PPL:   1.507
	 Val. Loss: 5.644 |  Val. PPL: 282.472
Epoch: 89 | Time: 0m 36s
	Train Loss: 0.414 | Train PPL

In [43]:
model.load_state_dict(torch.load('../tut1-model.pt'))
test_loss =evaluate(model, test_iterator,criterion)

In [44]:
test_loss

3.6609056293964386