# 3 - Neural Machine Translation by Jointly Learning to Align and Translate

In this implementation, we will be using sequence-to-sequence models. We will be implementing the attention mechanism described in the paper [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473).

This model is a little different than the previous implementations in the following aspect: 

1. We will be using Multi30k dataset which is considered to be standard dataset when it comes to comparing the performance of the different machine translation models. 
2. We will be using bi-direction RNN, This is little add on to the previous RNN layers
3. We will be using attention mechanism to better translate

## Introduction

The general encoder decoder model looks like as given below:

![](figures/Encoder_decoder_testing.png)
Figure: Sequence to Sequence Model

The previous approaches may suffer from below given problem:

1. The Context vector might not be able to remember the entire sequence correctly or may forget information related to early time steps
2. The output vector at each time step holding vital information about each time step is not being used at all. 

# Importing requirements

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import time

In [2]:
# setting manual seeds for reproducibility
SEED = 1234
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
# Below given code snippet helps to choose GPU if avaialbe else CPU. 
device  = torch.device("cuda:1") #torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Dataset

Up to previous iteration we were using the French  to English translation pairs. These pairs are shared from  http://www.manythings.org/anki/fra-eng.zip with few thousand training sample. In this implementation for to keep the preprocessing part simple and I will be using multi30k dataset. Multi30k is slightly larger dataset from WMT 2016 multimodal task, also known as Flickr30k. In multi30k 29,000 training and 1,014 test samples are provided. Present task s relted to the GErman to English translation.  The Attention model has many paramters attached to it and it is really difficult to train such model with such small dataset. Becasue of this reason I am using slightly larger dataset. After This implementation you will see that attention model can generate really meaningful translation.

# Pre-processing 
In Preprocessing torch text is being used to gratly simplify the preprocessing pipline. Torchtext has inbbuilt multi30k data loaders. Load the German and English spaCy models. In previous implemetation we were only tokenizing text by spaces. In this implementation we will be using seperate tokenizer from spacy for English and German Language . Yes! language specific tokenization helps a lot.

In [3]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

We create the tokenizers function for each langage.

In [4]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [5]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

In [6]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

In [7]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [8]:
BATCH_SIZE = 32

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

# Seq2Seq Model

## Encoder

Before going ahead we will first understand bi-directional RNN layers. Up to previous implementation we have used only RNN (or Unidirectional RNN). Unidirectional RNN means it run in a single directions. In this implementation I am using bi-directional RNN. This means each layer has two RNN one running in forward direction of the sequence and another is running in backward direction. Bidirection RNN do not require any extra line of code, it just require parameter `bidirection==True`. After this we can pass the embedded sentence as usually we used to do in the previous implementations. 

In Encoder and Decoder I have used Dropout as an additional operation to add regularize learning. More precisely Dropout is applied while training only. 


|   German	|   English	|
|:---:|:---:|
|  <sos> Mein Name ist Sunil und ich bin Datenwissenschaftler. <eos>	| <sos>  My name is Sunil and I am a data scientist. <eos>	|

Where 

    <sos> : `Start of sequence` indicator
    <eos> :  `End of sequence` indicator
    
Mathematically bidirectional RNN can be represented as :

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderRNN}^\rightarrow(x_t^\rightarrow,h_t^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderRNN}^\leftarrow(x_t^\leftarrow,h_t^\leftarrow)
\end{align*}$$

For an exmaple of German to english the first and second input token for forward and backward RNN is given below :

Forward RNN : $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{Mein}$

Backward RNN : $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{Datenwissenschaftler}$.

As given below it has forward and backward units. Each unit takes one input and generate output $o$  and updated hidden state $h$.

![](figures/Bidirectional_lstm.png)
Figure: Showing bidirectional RNN

Above given is the image of the bidirection RNN. In one layer it has forward and backward RNN units. For the purpose of understanding I have shown a RNN layer with bidirectional units in unrolled state. Each word of the input in the embedded form taken in the forward and in the backward direction. At each input at $t^{th}$ timestep one output is generated forward output $ O_i^{\rightarrow} $ and a backward output $ O_i^{\leftarrow} $. At the end the forward hidden state $z^\rightarrow=h_T^\rightarrow$ and backward hidden $z^\leftarrow=h_T^\leftarrow$. Hidden state in forward and backward state are also represented in form $z^\rightarrow$ and $z^\leftarrow$ respectively.


This time to keep things simple we only pass the embedded  input to the GRU and leave initlialization of the forward state ($h_0^\rightarrow$)and backward ( $h_0^\leftarrow$ ) state to the GRU only. 

Finally both the hidden state from forward and backward run are concatenated to get the context vector $Z$ for layer $l$, represented by : $ Z_l = (z^\rightarrow, z^\leftarrow)$

Encoder is very siimllar to our previous implementation. Finally encoder gives two outputs:
1. Outputs for each timestep will be of shape  **[src sent len, batch size, hid dim * num directions]**
2. Concatenated hidden states of shape **[n layers * num directions, batch size, hid dim]**
 
In the code  **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

Decoder is not bidirectional and in the original paper author only uses the only one hidden state. Insted here I have concatenated both the forward and the backward hiddens state and applied activation on top of it.


$$C= ReLu(g(h_T^\rightarrow, h_T^\leftarrow)) $$

$C$ stand for final context vector.


In [9]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.dropout = dropout
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src sent len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src sent len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src sent len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.relu(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src sent len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
    
        return outputs, hidden

### Attention

Next is attention layer. Attntion layer takes the encoder hidden and encoder output as the output. Attention mechnism has many operations attached to it. but the main idea is to combine Encode outputs ($O$) and the content vector (Final Encoder hidden state) ($C$) to produce the Attention weight. This attention weight is having information about weight required for each token in the source language. This attention state represent which source word should be given more weight to generate the next target word. This attentionvector after Batchwise Matrix Multiplication given added to the previously generated decoder token ($O_{t-1}$) and genertae next token $O_{t}$ at time $t$. 

![](figures/Attention.png)
Figure :  Attention Mechanism

The over all procedure as shown in the figure can be step by step summarized as:
1. Take Encoder outputs and Encoder hidden 
2. Repeat Encoder Hidden to source sequence lenght times
3. Concatenate both the output after proper permutation
4. Apply additional operations like permute and carry out batchwise matrix multiplication(BMM) with the learnable vector `V`. 
5. After these operations the generated weight are called as Attention weight.
6. Attention weight undergo batchwise matrix multiplication(BMM) with encoder output, These attention weight then given to the decoder.



In [10]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Parameter(torch.rand(dec_hid_dim))
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat encoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        
        #hidden = [batch size, src sent len, dec hid dim]
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        energy = torch.relu(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 

        
        #energy = [batch size, src sent len, dec hid dim]
        
        energy = energy.permute(0, 2, 1)
        
        #energy = [batch size, dec hid dim, src sent len]
        
        #v = [dec hid dim]
        
        v = self.v.repeat(batch_size, 1).unsqueeze(1)
        
        #v = [batch size, 1, dec hid dim]
                
        attention = torch.bmm(v, energy).squeeze(1)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

### Decoder

Decoder contains the attention layer. Decoder funnction takes the input, encoder hidden/ Context vector ($C$) and encoder outputs. 
Lets for example if we have a batch size of 4 then for the for the first time steps index corresponding to the `<start>` token are give as $[idx[<start>], idx[<start>], idx[<start>], idx[<start>]]$ This is equivalent to shape
[1,4]. Lets say we have embedding dimention 10 then each index will be represented by 10 simensional dense vector and hence after application of the embedding the resultant shape will be [1, 4, 10]. This serve as the input to decoder rnn after addition with attention weight.

Decoder has following steps to perform. 
1. Attention weight vector is added to the embeddings of previously genrated token in decoder. If the First token need to be generated then all embeddings of `<start>` token is provided as input. decoder will output the next token.
2. As per teacher forcing training scheme, the generated token is given as input to the next timestep. 


In [11]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src sent len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src sent len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [sent len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #sent len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        output = self.out(torch.cat((output, weighted, embedded), dim = 1))
        
        #output = [bsz, output dim]
        
        return output, hidden.squeeze(0)


### Seq2Seq
The `Seq2Seq` class encapsulates three of the above discussed modules, This class carries out following functionality.

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $C$ and $O$
- the initial decoder hidden state is set to be the `context` vector, $C$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode with the help of a loop:

In [12]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src sent len, batch size]
        #trg = [trg sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        output = trg[0,:]
        
        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

## Training the Seq2Seq Model

We initialise our parameters, encoder, decoder and seq2seq model

In [13]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.

In [14]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5)
  )
)

Calculate the number of parameters. We get an increase of almost 50% in the amount of parameters from the last model. 

In [15]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,917 trainable parameters


We create an optimizer.

In [16]:
optimizer = optim.Adam(model.parameters())

We initialize the loss function.

In [17]:
PAD_IDX = TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

We then create the training loop...

In [18]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing.

In [19]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Finally, define a timing function.

In [20]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Then, we train our model, saving the parameters that give us the best validation loss.

In [21]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} | Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 26s
	Train Loss: 4.288 | Train PPL:  72.829
	 Val. Loss: 3.759 |  Val. PPL:  42.912
Epoch: 02 | Time: 1m 27s
	Train Loss: 2.973 | Train PPL:  19.547
	 Val. Loss: 3.304 |  Val. PPL:  27.218
Epoch: 03 | Time: 1m 22s
	Train Loss: 2.424 | Train PPL:  11.296
	 Val. Loss: 3.145 |  Val. PPL:  23.227
Epoch: 04 | Time: 1m 22s
	Train Loss: 2.059 | Train PPL:   7.842
	 Val. Loss: 3.107 |  Val. PPL:  22.344
Epoch: 05 | Time: 1m 26s
	Train Loss: 1.794 | Train PPL:   6.011
	 Val. Loss: 3.235 |  Val. PPL:  25.411
Epoch: 06 | Time: 1m 23s
	Train Loss: 1.607 | Train PPL:   4.988
	 Val. Loss: 3.278 |  Val. PPL:  26.523
Epoch: 07 | Time: 1m 26s
	Train Loss: 1.461 | Train PPL:   4.310
	 Val. Loss: 3.365 |  Val. PPL:  28.934
Epoch: 08 | Time: 1m 23s
	Train Loss: 1.342 | Train PPL:   3.827
	 Val. Loss: 3.472 |  Val. PPL:  32.210
Epoch: 09 | Time: 1m 24s
	Train Loss: 1.252 | Train PPL:   3.498
	 Val. Loss: 3.607 |  Val. PPL:  36.842
Epoch: 10 | Time: 1m 23s
	Train Loss: 1.168 | Train PPL

Finally, we test the model on the test set using these "best" parameters.

In [22]:
model.load_state_dict(torch.load('tut3-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.170 | Test PPL:  23.814 |


We've improved on the previous model, but this came at the cost of doubling the training time.

In the next notebook, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output.

# Inspecting outputs
Below given functions will help in genrating human readable output. The function will provide both the predicted and the original output.

In [23]:
def visual_decoding(output: torch.Tensor):
    visual_output = [] 
    for i in range(0,output.shape[0]):
        sentence  = []
        for each_idx in output[i]:
            word = TRG.vocab.itos[each_idx]
            if word == "<eos>":
                break
            sentence.append(word)
        visual_output.append(' '.join(sentence))        
                
    return (visual_output)

In [24]:
for i, batch in enumerate(train_iterator):
    src = batch.src
    trg = batch.trg
    output = model(src, src, 0)
    # predicted
    predicted_argmax = torch.argmax(output, dim=2)
    predicted_output = visual_decoding(predicted_argmax.permute(1,0))
    
    trg_output = visual_decoding(trg.permute(1,0))
    
    for predicted, original in zip(predicted_output,trg_output ):
        print (predicted, " ||| " ,original)
    
    break 


<unk> a man wearing a and and tie plays a guitar on a sidewalk .  |||  <sos> a man in a vest and tie is playing the guitar on a sidewalk .
<unk> the black dog is running through the water .  |||  <sos> the black dog runs through the water .
<unk> a young newlywed couple cuts their cake .  |||  <sos> a young newlywed couple cuts their wedding cake that their reception .
<unk> child in a gray hooded sweatshirt stands at the bottom of a red slide .  |||  <sos> child in a gray hoodie standing on the bottom of a red plastic slide .
<unk> a man with glasses is sitting behind a table with memorabilia memorabilia memorabilia memorabilia .  |||  <sos> a man in glasses is sitting behind a table laden with military memorabilia .
<unk> two men are sawing a wood and trying to complete project while a project while a man is a of a cigarette .  |||  <sos> two men are sawing a board of wood trying to complete their project , while one of those two men smokes a cigarette .
<unk> a girl in pink twirls a