<a href="https://colab.research.google.com/github/faezesarlakifar/SBU-NLP-Lab-summer-school/blob/main/seq2seq_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to Sequence Learning with Neural Network

Acknowledgement : this notebook origins from https://github.com/bentrevett/pytorch-seq2seq

Note : This notebook is just for learning Seq2seq model.  

In [None]:
!pip install -U torch==1.8.0 torchtext==0.9.0

# Reload environment
exit()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

TEXT = data.Field()
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import time, random, math, string

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Introduction

In this notebook, we will start simple model to understand the general concepts by implementing the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper.

The most common sequence-to-sequence (seq2seq) models are encoder-decoder models, which (commonly) use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector (as an abstract representation of the entrie input sentence).

This vector is then decoded by a second RNN which learns to output the target(output) sentence by generating it one word at a time.

![](https://github.com/bentrevett/pytorch-seq2seq/raw/3a8dc5515ff28cb059532439c5687126dd30015f/assets/seq2seq1.png)
Above image shows an example translation. The input sentence "guten morgen", is input into the encoder (green) one word at a time. We also append a start of sequence(<sos\>) and end of sequence(<eos\>) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the current word $x_t$ as well as the hidden state from the previous time-step $h_{t-1}$ You can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both $x_t$ and $h_{t-1}$ :
$$
h_t = EncoderRNN(x_t, h_{t-1}) \tag{1}
$$
Here, we have $X={x_1, x_2, \cdots, x_T}$ where $x_1$ = <sos\> $x_2$ = guten, etc. The initial hidden states $h_0$ is usually either initialized to zeros or a learned parameter.

Once the final word $x_T$ has been passed into RNN, we use the final hidden state $h_T$ as the context vector i.e. $h_T = z$.

With our context vector $z$, we can start decoding it to get the target sentence, "good morning". Again we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the current word, $y_t$, as well as the hidden state from the previous time-step $s_{t-1}$, where the initail decoder hidden state $s_0 = z = h_T$ i.e. the initial hidden state is the final encoder hidden state. similar to the encoder, we can represent the decoder as:
$$
s_t = DecoderRNN(y_t, s_{t-1}) \tag{2}
$$
In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a Linear layer, shown in purple) what we think is the next word in the sequence $\hat{y}_t$.
$$
\hat{y}_t = f(s_t) \tag{3}
$$
The word in the encoder are always generated one after another, with one per time-step. We always use the <sos\> for the first input to the decoder $y_1$, but for subsequent inputs $y_{t > 1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder $\hat{y}_{t-1}$. This is called teacher forcing.

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference (i.e. real world usage) it is common to keep generating words until the model outputs an <eos\> token or after a certain amount of words have been generated.

Once we get our prediction $\hat{Y} = {\hat{y_1},\hat{y_2},\cdots, \hat{y_T}}$, we compare it against our actual target sentence $Y = {y_1, y_2, \cdots y_T}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

## Preparing Data

In [None]:
tokenizer = lambda x: str(x).translate(str.maketrans('', '', string.punctuation)).strip().split()
reverse_tokenizer = lambda x: tokenizer(x)[::-1]

SRC = Field(tokenize=reverse_tokenizer, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenizer, init_token='<sos>', eos_token='<eos>', lower=True)

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'),
                                                   fields=(SRC, TRG))

In [None]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of test examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of test examples: 1000


In [None]:
print(vars(train_data.examples[0]))

{'src': ['büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes']}


In [None]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7805
Unique tokens in target (en) vocabulary: 5940


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATHC_SIZE = 128
# We use a BucketIterator instead of the standard Iterator as it create batches in such a way that it minimizes the amount
# of padding in both the source and target sentences.
train_iter, valid_iter, test_iter = BucketIterator.splits((train_data, valid_data, test_data),
                                                          batch_size=BATHC_SIZE, device=device)

In [None]:
for batch in train_iter:

  # Let's check batch size.
  print(batch)
  break
  print(batch.src[0])


[torchtext.legacy.data.batch.Batch of size 128 from MULTI30K]
	[.src]:[torch.LongTensor of size 30x128]
	[.trg]:[torch.LongTensor of size 34x128]


## Use Bert Tokenizer and Model

In [None]:
!pip install transformers
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Check output of nn.Embeding shape

In [None]:
# an Embedding module containing 10 tensors of size 3
embedding = nn.Embedding(10, 3)
# a batch of 2 samples of 4 indices each
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
print(embedding(input))

tensor([[[-0.1132, -1.2442,  2.1493],
         [-0.9966,  1.7183, -1.1258],
         [-0.4949, -0.8033, -0.1762],
         [-1.4038,  0.7248, -0.0567]],

        [[-0.4949, -0.8033, -0.1762],
         [ 0.4174,  0.0824, -0.7034],
         [-0.9966,  1.7183, -1.1258],
         [-0.6664, -0.4560,  0.7699]]], grad_fn=<EmbeddingBackward>)


### Check Bert Tokenization and word ids

In [None]:
text = "this is a test sentence."
marked_text = "[CLS] " + text + " [SEP]"

tokenized_text = tokenizer.tokenize(marked_text)

print (tokenized_text)

['[CLS]', 'this', 'is', 'a', 'test', 'sentence', '.', '[SEP]']


In [None]:
test_token_ids = tokenizer.convert_tokens_to_ids(tokenized_text)

for tup in zip(tokenized_text, test_token_ids):
    print('{:<12} {:>6,}'.format(tup[0], tup[1]))

[CLS]           101
this          2,023
is            2,003
a             1,037
test          3,231
sentence      6,251
.             1,012
[SEP]           102


## Create Bert Embeding Functions

In [None]:
model = bert_model

In [None]:
def bert_text_preparation(text, tokenizer):

    print(text)

    temp = text

    marked_text = "[CLS] " + temp + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [1]*len(indexed_tokens)


    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    return tokenized_text, tokens_tensor, segments_tensors

In [None]:
def Bert_Embeddings(tokens_tensor, segments_tensors, model):

    with torch.no_grad():
        print(1)
        outputs = model(tokens_tensor, segments_tensors)
        # Removing the first hidden state
        # The first state is the input state
        print(2)
        hidden_states = outputs[1][1:]
        print(3)

    # Getting embeddings from the final BERT layer
    token_embeddings = hidden_states[-1]
    print(3)
    # Collapsing the tensor into 1-dimension
    token_embeddings = torch.squeeze(token_embeddings, dim=0)
    print(4)
    # Converting torchtensors to lists
    list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]
    print(5)

    return list_token_embeddings
    #return token_embeddings

In [None]:
def Get_Batch_Bert_Embeddings(texts):
    target_sentences_embeddings = []

    for text_ in texts:
        tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(text_, tokenizer)
        print("*_*")
        print("tokens_tensor shape:", tokens_tensor.size())
        print("segments_tensors shape:", segments_tensors.size())
        list_token_embeddings = Bert_Embeddings(tokens_tensor, segments_tensors, model)
        print("**_**")

        target_sentences_embeddings.append(list_token_embeddings)

    # to tensor
    target_sentences_embeddings_tensor = torch.tensor(target_sentences_embeddings)

    return target_sentences_embeddings, target_sentences_embeddings_tensor

In [None]:
def get_Bert_ids_from_texts(texts):
    token_ids = []

    for text in texts:
        marked_text = "[CLS] " + text + " [SEP]"
        tokenized_text = tokenizer.tokenize(marked_text)
        text_token_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
        token_ids.append(text_token_ids)

    # to tensor
    tokens_tensor = torch.tensor(token_ids)
    return tokens_tensor

In [None]:
def get_texts_from_Bert_ids(token_ids):
    outputs = []

    for token_id in token_ids:
        output = tokenizer.convert_ids_to_tokens(token_id)
        print("output is:", output)
        outputs.append(output)

    # to tensor
    #tokens_tensor = torch.tensor(final_results)
    return outputs

## Test Functions

In [None]:
texts = ['you are my best friend!', 'my class is starting now.']
ids = get_Bert_ids_from_texts(texts)
print(ids)
out = get_texts_from_Bert_ids(ids)
print(out)

tensor([[ 101, 2017, 2024, 2026, 2190, 2767,  999,  102],
        [ 101, 2026, 2465, 2003, 3225, 2085, 1012,  102]])
output is: ['[CLS]', 'you', 'are', 'my', 'best', 'friend', '!', '[SEP]']
output is: ['[CLS]', 'my', 'class', 'is', 'starting', 'now', '.', '[SEP]']
[['[CLS]', 'you', 'are', 'my', 'best', 'friend', '!', '[SEP]'], ['[CLS]', 'my', 'class', 'is', 'starting', 'now', '.', '[SEP]']]


In [None]:
test_texts = ['you are my best friend', 'my class is here']

"""
for text in test_texts:
    x, y, z = bert_text_preparation(text, tokenizer)
"""
embeddings, embeddings_tnesor = Get_Batch_Bert_Embeddings(test_texts)

print(embeddings)

## Building the Seq2Seq Model

We will build our model in three parts: The encoder, the decoder, and a seq2seq model that encapsulates the encoder and decoder.

### Encoder
First, the encoder, a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut down to 2-layers. The concept of multi-layer RNN is easy to expand from 2 to 4 layers.

For a multi-layer RNN, the input sentence, $X$, goes into the first (bottom) layer of the RNN and hiddne states, $H=\{h_1, h_2, \cdots,h_T\}$ output by this layer are used as inputs to the RNN in the layer above. Thus representing each layer with a superscript, the hidden states in the first layer are given by :
$$
h_t^1 = EncoderRNN^1(x_t, h_{t-1}^1) \tag{4}
$$
The hidden states in the second layer are given by:
$$
h_t^2 = EncoderRNN^2(h_t^1, h_{t-1}^2) \tag{5}
$$
Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer $z^l$

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.n_layers = n_layers

        """
        change Embedding from torch.nn.Embedding into Bert Embedding
        """
        #self.embedding = nn.Embedding(input_dim, emb_dim)

        self.rnn = nn.LSTM(emb_dim, hid_dim, num_layers=n_layers, dropout=dropout)

        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src : [sen_len, batch_size]
        print(src)

        """
        change Embedding from torch.nn.Embedding into Bert Embedding
        """
        embedded = self.dropout(Get_Batch_Bert_Embeddings(src))
        #embedded = self.dropout(self.embedding(src))

        # embedded : [sen_len, batch_size, emb_dim]
        outputs, (hidden, cell) = self.rnn(embedded)
        # outputs = [sen_len, batch_size, hid_dim * n_directions]
        # hidden = [n_layers * n_direction, batch_size, hid_dim]
        # cell = [n_layers * n_direction, batch_size, hid_dim]
        return hidden, cell

### Decoder

Next, we will build our decoder. which also be a 2-layer LSTM.

![](https://github.com/bentrevett/pytorch-seq2seq/raw/3a8dc5515ff28cb059532439c5687126dd30015f/assets/seq2seq3.png)

We can use the following equations to explain the decoder model.
$$
(s_t^1, c_t^1) = DecoderLSTM^1(y_t, (s_{t-1}^1, c_{t-1}^1))
$$
$$
(s_t^2, c_t^2) = DecoderLSTM^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
$$
Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell of our encoder from the same layer. i.e. $(s_0^l, c_0^l) = z^l = (h_T^l, c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^2$ through a linear layer $f$, to make a prediction of what the next token in the target (output) sequence should be $\hat{y}_{t+1}$
$$
\hat{y}_{t+1} = f(s_t^2)
$$

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.output_dim = output_dim
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        """
        change Embedding from torch.nn.Embedding into Bert Embedding
        """
        #self.embedding = nn.Embedding(output_dim, emb_dim)

        self.rnn = nn.LSTM(emb_dim, hid_dim, num_layers=self.n_layers, dropout=dropout)

        self.fc_out = nn.Linear(hid_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):

        # input = [batch_size]
        # hidden = [n_layers * n_dir, batch_size, hid_dim]
        # cell = [n_layers * n_dir, batch_size, hid_dim]

        input = input.unsqueeze(0)
        # input : [1, ,batch_size]

        """
        change Embedding from torch.nn.Embedding into Bert Embedding
        """
        embedded = self.dropout(Get_Batch_Bert_Embeddings(input))
        #embedded = self.dropout(self.embedding(input))
        # embedded = [1, batch_size, emb_dim]

        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        # output = [seq_len, batch_size, hid_dim * n_dir]
        # hidden = [n_layers * n_dir, batch_size, hid_dim]
        # cell = [n_layers * n_dir, batch_size, hid_dim]

        # seq_len and n_dir will always be 1 in the decoder
        prediction = self.fc_out(output.squeeze(0))
        # prediction = [batch_size, output_dim]
        return prediction, hidden, cell

### Seq2Seq
For the final part of the implementation, we will implement the seq2seq model.

- receive the input/source sentence

- using the encoder to produce the context vectors

- using the decoder to produce the predicted output / target sentence.

Our full model will look like this:

![](https://github.com/bentrevett/pytorch-seq2seq/raw/3a8dc5515ff28cb059532439c5687126dd30015f/assets/seq2seq4.png)

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim, \
            'hidden dimensions of encoder and decoder must be equal.'
        assert encoder.n_layers == decoder.n_layers, \
            'n_layers of encoder and decoder must be equal.'

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src = [sen_len, batch_size]
        # trg = [sen_len, batch_size]
        # teacher_forcing_ratio : the probability to use the teacher forcing.
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        # last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> token.
        input = trg[0, :]
        for t in range(1, trg_len):
            # insert input token embedding, previous hidden and previous cell states
            # receive output tensor (predictions) and new hidden and cell states.
            output, hidden, cell = self.decoder(input, hidden, cell)

            # replace predictions in a tensor holding predictions for each token
            outputs[t] = output

            # decide if we are going to use teacher forcing or not.
            teacher_force = random.random() < teacher_forcing_ratio

            # get the highest predicted token from our predictions.
            top1 = output.argmax(1)
            # update input : use ground_truth when teacher_force
            input = trg[t] if teacher_force else top1

        return outputs

# Configuration

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'device: {device}')

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

device: cpu
CUDA is not available.  Training on CPU ...


## Training the Seq2Seq model


In [None]:
!pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
# First initialize our model.

import torch.cuda

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
torch.set_default_tensor_type('torch.FloatTensor')
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)


model = Seq2Seq(encoder, decoder, device).to(device)

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=5940, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 10,403,636 trainable parameters


In [None]:
optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

Next, we'll define our training loop.

First. we'll set the model into "training mode" (turn on the dropout & batch normalization), and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our outputs tensor remains all zeros. So our trg & outputs look something like:
$$
trg = [<sos>, y_1, y_2, y_3, <eos>]
$$
$$
output = [0, \hat{y}_1,\hat{y}_2,\hat{y}_3, <eos>]
$$
Here, when we calculate the loss, we cut off the first element of each tensor to get:
$$
trg = [y_1, y_2, y_3, <eos>]
$$
$$
output = [\hat{y}_1,\hat{y}_2,\hat{y}_3, <eos>]
$$
At each iterator:

- get the source and target sentences from the batch, X and Y

- zero the gradients calculated from the last batch

- feed the source and target into the model to get the output $\hat{y}$

- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
    - we slice off the first column of the output and target tensors as mentioned above.
    
- calculate the gradients with loss.backward()

- clip the gradients to prevent them from exploding (a common issue in RNNs)

- update the parameters of our model by doing an optimizer.step()

- sum the loss value to a running total.

Finally, we return the loss that is average over all batches.

In [None]:
def train(model, iterator, optimizer, criterion, clip):

    model.train()

    epoch_loss = 0

    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg

        optimizer.zero_grad()
        # trg = [sen_len, batch_size]
        # output = [trg_len, batch_size, output_dim]
        output = model(src, trg)
        output_dim = output.shape[-1]

        # transfrom our output : slice off the first column, and flatten the output into 2 dim.
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        # trg = [(trg_len-1) * batch_size]
        # output = [(trg_len-1) * batch_size, output_dim]

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):

    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) # turn off teacher forcing.

            # trg = [sen_len, batch_size]
            # output = [sen_len, batch_size, output_dim]
            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

In [None]:
# a function that used to tell us how long an epoch takes.
def epoch_time(start_time, end_time):

    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time  / 60)
    elapsed_secs = int(elapsed_time -  (elapsed_mins * 60))
    return  elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 10

CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_iter, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iter, criterion)

    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'Seq2SeqModel.pt')
    print(f"Epoch: {epoch+1:02} | Time {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
    print(f"\tValid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss):7.3f}")

In [None]:
def test():
    best_model = Seq2Seq(encoder, decoder, device).to(device)
    best_model.load_state_dict(torch.load('Seq2SeqModel.pt'))

    test_loss = evaluate(model, test_iter, criterion)

    print(f"Test Loss : {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}")

test()

Test Loss : 4.053 | Test PPL:  57.566


## Summary

Through the model is tranditional Seq2Seq model as the author mentioned before. As a beginner of pytorch and deeplearning, there is also many useful tricks worth learning. As a notebook I list them below:

### Data Preparing Part

- In the original paper, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier".

it means when we want to get the context vector, it is beneficial to reverse the order of the input.

- When build the vocabulary we can use min_freq parameter to remove the rare words in the corpus.

e.g. SRC.build_vocab(train_data, min_freq=2)

### Seq2Seq Model Part

- In this notebook, we can learn how to build a deeplearning pipeline (seperate our model into different parts.) and combine them together.

- It is a traditional Seq2Seq model, encoder is used to get the context vectors represented by the hidden and cell state generated by the last layer of LSTM. While decoder initialize its $h_0, c_0$ according to the encoder output. And each time-step $t$, generate the $t+1$ word in the target sentence. update the input according to the previous word and the teacher_force rate.

- This is the first time I have ever seen the teacher_force rate. It is just a simple but useful way to restinct the changing of our model.

### Train Part

- Clip : as mentioned before, before we use optimizer.step(), we should use  torch.nn.utils.clip_grad_norm_(model.parameters(), clip) to avoiding gradients exploding ! Here we set clip=1.