This assignment explores two key concepts – sub-word modeling and convolutional networks – and applies them to the NMT system we built in the previous assignment. The Assignment 4 NMT model can be thought of as four stages:

1. Embedding layer: Converts raw input text (for both the source and target sentences) to a sequence of dense word vectors via lookup.
2. Encoder: A RNN that encodes the source sentence as a sequence of encoder hidden states.
3. Decoder: A RNN that operates over the target sentence and attends to the encoder hidden states to produce a sequence of decoder hidden states.
4. Output prediction layer: A linear layer with softmax that produces a probability distribution for the next target word on each decoder timestep.

- In Section 1 of this assignment, we will replace (1) with a character-based convolutional encoder
- and in Section 2 we will enhance (4) by adding a character-based LSTM decoder

# Section 1

![](../images/ex5_1.png)

## code for VocabEntry

In [3]:
from collections import Counter
from docopt import docopt
from itertools import chain
import json
import torch
from typing import List
from utils import read_corpus, pad_sents, pad_sents_char

[nltk_data] Downloading package punkt to /home/quantran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [75]:
class VocabEntry(object):
    """ Vocabulary Entry, i.e. structure containing either
    src or tgt language terms.
    """

    def __init__(self, word2id=None):
        """ Init VocabEntry Instance.
        @param word2id (dict): dictionary mapping words 2 indices
        """
        if word2id:
            self.word2id = word2id
        else:
            self.word2id = dict()
            self.word2id['<pad>'] = 0  # Pad Token
            self.word2id['<s>'] = 1  # Start Token
            self.word2id['</s>'] = 2  # End Token
            self.word2id['<unk>'] = 3  # Unknown Token
        self.unk_id = self.word2id['<unk>']
        self.id2word = {v: k for k, v in self.word2id.items()}

        ## Additions to the A4 code:
        self.char_list = list(
            """ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]""")

        self.char2id = dict()  # Converts characters to integers
        self.char2id['∏'] = 0  # <pad> token
        self.char2id['{'] = 1  # start of word token
        self.char2id['}'] = 2  # end of word token
        self.char2id['Û'] = 3  # <unk> token
        for i, c in enumerate(self.char_list):
            self.char2id[c] = len(self.char2id)
        self.char_pad = self.char2id['∏']
        self.char_unk = self.char2id['Û']
        self.start_of_word = self.char2id["{"]
        self.end_of_word = self.char2id["}"]
        assert self.start_of_word + 1 == self.end_of_word

        self.id2char = {v: k for k, v in self.char2id.items()}  # Converts integers to characters
        ## End additions to the A4 code

    def __getitem__(self, word):
        """ Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        @param word (str): word to look up.
        @returns index (int): index of word
        """
        return self.word2id.get(word, self.unk_id)

    def __contains__(self, word):
        """ Check if word is captured by VocabEntry.
        @param word (str): word to look up
        @returns contains (bool): whether word is contained
        """
        return word in self.word2id

    def __setitem__(self, key, value):
        """ Raise error, if one tries to edit the VocabEntry.
        """
        raise ValueError('vocabulary is readonly')

    def __len__(self):
        """ Compute number of words in VocabEntry.
        @returns len (int): number of words in VocabEntry
        """
        return len(self.word2id)

    def __repr__(self):
        """ Representation of VocabEntry to be used
        when printing the object.
        """
        return 'Vocabulary[size=%d]' % len(self)

    def id2word(self, wid):
        """ Return mapping of index to word.
        @param wid (int): word index
        @returns word (str): word corresponding to index
        """
        return self.id2word[wid]

    def add(self, word):
        """ Add word to VocabEntry, if it is previously unseen.
        @param word (str): word to add to VocabEntry
        @return index (int): index that the word has been assigned
        """
        if word not in self:
            wid = self.word2id[word] = len(self)
            self.id2word[wid] = word
            return wid
        else:
            return self[word]

    def words2charindices(self, sents):
        """ Convert list of sentences of words into list of list of list of character indices.
        @param sents (list[list[str]]): sentence(s) in words
        @return word_ids (list[list[list[int]]]): sentence(s) in indices
        """
        return [[[self.char2id.get(c, self.char_unk) for c in ("{" + w + "}")] for w in s] for s in sents]

    def words2indices(self, sents):
        """ Convert list of sentences of words into list of list of indices.
        @param sents (list[list[str]]): sentence(s) in words
        @return word_ids (list[list[int]]): sentence(s) in indices
        """
        return [[self[w] for w in s] for s in sents]

    def indices2words(self, word_ids):
        """ Convert list of indices into words.
        @param word_ids (list[int]): list of word ids
        @return sents (list[str]): list of words
        """
        return [self.id2word[w_id] for w_id in word_ids]

    def to_input_tensor_char(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tensor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size, max_word_length)
        """
        ### YOUR CODE HERE for part 1e
        ### TODO:
        ###     - Use `words2charindices()` from this file, which converts each character to its corresponding index in the
        ###       character-vocabulary.
        list_of_indices = self.words2charindices(sents) # list of list of list
        ###     - Use `pad_sents_char()` from utils.py, which pads all words to max_word_length of all words in the batch,
        ###       and pads all sentences to max length of all sentences in the batch. Read __init__ to see how to get
        ###       index of character-padding token
        sents_var = torch.tensor(pad_sents_char(list_of_indices,self.char_pad),dtype=torch.long, device=device).permute(1,0,2)
        sents_var = sents_var.contiguous()
        ###     - Connect these two parts to convert the resulting padded sentences to a torch tensor.
        return sents_var
        ### HINT:
        ###     - You may find .contiguous() useful after reshaping. Check the following links for more details:
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.contiguous
        ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view

        ### END YOUR CODE

    def to_input_tensor(self, sents: List[List[str]], device: torch.device) -> torch.Tensor:
        """ Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tesnor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size)
        """
        word_ids = self.words2indices(sents)
        sents_t = pad_sents(word_ids, self['<pad>'])
        sents_var = torch.tensor(sents_t, dtype=torch.long, device=device)
        return torch.t(sents_var)

    @staticmethod
    def from_corpus(corpus, size, freq_cutoff=2):
        """ Given a corpus construct a Vocab Entry.
        @param corpus (list[str]): corpus of text produced by read_corpus function
        @param size (int): # of words in vocabulary
        @param freq_cutoff (int): if word occurs n < freq_cutoff times, drop the word
        @returns vocab_entry (VocabEntry): VocabEntry instance produced from provided corpus
        """
        vocab_entry = VocabEntry()
        word_freq = Counter(chain(*corpus))
        valid_words = [w for w, v in word_freq.items() if v >= freq_cutoff]
        print('number of word types: {}, number of word types w/ frequency >= {}: {}'
              .format(len(word_freq), freq_cutoff, len(valid_words)))
        top_k_words = sorted(valid_words, key=lambda w: word_freq[w], reverse=True)[:size]
        for word in top_k_words:
            vocab_entry.add(word)
        return vocab_entry

In [76]:
temp = VocabEntry()
temp.words2charindices([['I','love','you'],['I','know']])

[[[1, 12, 2], [1, 41, 44, 51, 34, 2], [1, 54, 44, 50, 2]],
 [[1, 12, 2], [1, 40, 43, 44, 52, 2]]]

In [81]:
[temp.id2char[i] for i in [1, 12, 2]]

['{', 'I', '}']

In [77]:
[temp.id2char[i] for i in [1, 40, 43, 44, 52, 2]]

['{', 'k', 'n', 'o', 'w', '}']

In [78]:
temp1 = temp.to_input_tensor_char([['I','loveee','you'],['I','know']],torch.device('cuda:0'))
temp1

tensor([[[ 1, 12,  2,  0,  0,  0,  0,  0],
         [ 1, 12,  2,  0,  0,  0,  0,  0]],

        [[ 1, 41, 44, 51, 34, 34, 34,  2],
         [ 1, 40, 43, 44, 52,  2,  0,  0]],

        [[ 1, 54, 44, 50,  2,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0]]], device='cuda:0')

In [82]:
temp1.shape

torch.Size([3, 2, 8])

In [79]:
temp1[1:]

tensor([[[ 1, 41, 44, 51, 34, 34, 34,  2],
         [ 1, 40, 43, 44, 52,  2,  0,  0]],

        [[ 1, 54, 44, 50,  2,  0,  0,  0],
         [ 0,  0,  0,  0,  0,  0,  0,  0]]], device='cuda:0')

In [8]:
temp.char2id['∏']

0

## code for highway

In [9]:
import torch.nn as nn
import torch.nn.functional as F
import torch

![](../images/ex5_2.png)

In [10]:
class Highway(nn.Module):
    def __init__(self,e_word):
        super().__init__()
        self.e_word = e_word
        self.w_proj = nn.Linear(e_word,e_word)
        self.w_gate = nn.Linear(e_word,e_word)
        # init linear weight and bias?
    def forward(self,x_conv_out):
        """
         
        raw_input x_padded: (max_sentence_length,bs,max_word_length aka m)
        which should be output of to_input_tensor_char()
        
        --char_emb()-->
        x_emb: (max_sentence_length,bs,max_word_length,e_char)
        with e_char is size of character embedding. 
        
        --reshape()-->
        x_reshaped: (max_sentence_length,bs,e_char,max_word_length)
        
        --cnn()-->
        x_conv: (max_sentence_length,bs,e_word,max_word_length-k+1)
        with k is kernel size,e_word is the desired word embedding size
        
        --relu_and_globalmaxpool()-->
        x_conv_out: (max_sentence_length,bs,e_word)
        
        --high_way()-->
        x_highway: (max_sentence_length,bs,e_word)
        
        --dropout()-->
        x_word_emb: (max_sentence_length,bs,e_word)
        
        input: x_conv_out shape (bs,max_sentence_length,e_word)
        output: x_highway shape (bs,max_sentence_length,e_word) (no dropout applied)
        """
        
        x_proj = F.relu(self.w_proj(x_conv_out))
        x_gate = torch.sigmoid(self.w_gate(x_conv_out))
        x_highway = x_gate * x_proj + (1-x_gate) * x_conv_out
        return x_highway

In [11]:
# test highway
temp_highway = Highway(2)
temp_conv_out = torch.randn(4,3,2)
temp_result = temp_highway(temp_conv_out)

In [12]:
temp_result.shape

torch.Size([4, 3, 2])

## code for cnn

In [13]:
class CNN(nn.Module):
    def __init__(self,e_char,e_word,k=5,padding=1):
        super().__init__()
        self.conv1d = nn.Conv1d(e_char, e_word, kernel_size = k, padding = padding)
        self.mp1d = nn.AdaptiveMaxPool1d(1)
        self.e_word = e_word
    def forward(self,x_reshaped):
        """
        input: x_reshaped: (max_sentence_length,bs,e_char,max_word_length)
        
        output:  x_conv_out: (max_sentence_length,bs,e_word)
            - e_word is the desired word embedding size
        """
#         x_conv_out2 = []
#         for each_sen in torch.split(x_reshaped,1,dim=0):
#             each_sen = each_sen.squeeze(dim=0) # bs,e_char,max_word_length
            
#             x_conv = self.conv1d(each_sen) # (bs,e_word,max_word_length-k+1). 
#             #relu
#             result = F.relu(x_conv) # (bs,e_word,max_word_length-k+1)
#             #maxpool
#             result = self.mp1d(result).squeeze(2) # (bs,e_word,1) to (bs,e_word) after squeezing
            
#             x_conv_out2.append(result)
            
#         x_conv_out2 = torch.stack(x_conv_out2,dim=0)
        
        # you can combine first and second dimension to avoid loop while conv1d
        sent_length,bs = x_reshaped.shape[0],x_reshaped.shape[1]
        new_view = (sent_length * bs,x_reshaped.shape[2],x_reshaped.shape[3])        
        x_reshaped2 = x_reshaped.view(new_view)
#         (max_sentence_length * bs ,e_char,max_word_length)
        
        x_conv = self.conv1d(x_reshaped2)  # (sent_length*bs,e_word,max_word_length-k+1).
        x_conv_out = F.relu(x_conv)
        x_conv_out = self.mp1d(x_conv_out).squeeze(-1) # (sent_length*bs,e_word,1) to (sent_length*bs,e_word)
        x_conv_out = x_conv_out.view(sent_length,bs,self.e_word)
        
        return x_conv_out.contiguous()

In [14]:
# test cnn
temp_conv = nn.Conv1d(3,4,2) #in_channels,out_channels,kernel_size

In [15]:
temp_w = temp_conv.weight.data
temp_b = temp_conv.bias.data

In [16]:
temp_x = torch.randn(1, 3, 2) # bs,in_channels aka emb size,number_of_items

In [17]:
temp3 = temp_conv(temp_x)
temp3

tensor([[[ 0.2154],
         [-0.8181],
         [-0.2951],
         [ 0.4281]]], grad_fn=<SqueezeBackward1>)

In [18]:
temp3.shape # bs,out_channels,new_number_of_items

torch.Size([1, 4, 1])

In [19]:
# manual calculation
(temp_w[0] * temp_x[0]).sum() + temp_b[0]

tensor(0.2154)

In [20]:
# testing cnn + maxpool

In [21]:
temp_conv = nn.Conv1d(3,4,2)
temp_x = torch.randn(2, 3, 4)
temp3 = temp_conv(temp_x)

In [22]:
temp3,temp3.shape

(tensor([[[-0.0903,  0.2882,  0.1889],
          [ 0.0381,  0.4108,  0.0909],
          [ 0.0468, -0.0842,  0.1885],
          [-0.0687, -0.6686,  0.0187]],
 
         [[ 0.0435,  0.7521,  0.0692],
          [ 0.3959,  0.1063, -0.1477],
          [ 1.0713, -0.3758, -0.7133],
          [ 0.6434, -0.2213, -0.8657]]], grad_fn=<SqueezeBackward1>),
 torch.Size([2, 4, 3]))

In [23]:
temp_mp = nn.AdaptiveMaxPool1d(1)

In [24]:
temp4 = temp_mp(temp3)
temp4,temp4.shape

(tensor([[[0.2882],
          [0.4108],
          [0.1885],
          [0.0187]],
 
         [[0.7521],
          [0.3959],
          [1.0713],
          [0.6434]]], grad_fn=<SqueezeBackward1>),
 torch.Size([2, 4, 1]))

In [25]:
temp4.squeeze(-1).shape

torch.Size([2, 4])

In [26]:
# test everything

In [27]:
temp_cnn = CNN(3,4,2)
temp_x = torch.randn(3,1,3,4)

In [28]:
temp_final = temp_cnn(temp_x)

In [31]:
temp_final.shape

torch.Size([3, 1, 4])

In [32]:
# temp_final[0].shape

# temp_final[0] == temp_final[1]

## code for ModelEmbeddings

In [33]:
import torch.nn as nn

# Do not change these imports; your module names should be
#   `CNN` in the file `cnn.py`
#   `Highway` in the file `highway.py`
# Uncomment the following two imports once you're ready to run part 1(j)

from cnn import CNN
from highway import Highway


# End "do not change"

![](../images/ex5_3.png)

In [34]:
class ModelEmbeddings(nn.Module):
    """
    Class that converts input words to their CNN-based embeddings.
    """

    def __init__(self, word_embed_size, vocab):
        """
        Init the Embedding layer for one language
        @param word_embed_size (int): Embedding size (dimensionality) for the output word
        aka e_word
        
        @param vocab (VocabEntry): VocabEntry object. See vocab.py for documentation.

        Hints: - You may find len(self.vocab.char2id) useful when create the embedding
        """
        super(ModelEmbeddings, self).__init__()
        self.word_embed_size = word_embed_size
        self.vocab = vocab
        self.e_char = 50
        self.char_emb = nn.Embedding(len(vocab.char2id),self.e_char,padding_idx=vocab.char_pad)
        self.highway = Highway(self.word_embed_size)
        self.cnn = CNN(self.e_char,self.word_embed_size)
        self.dropout = nn.Dropout(p=0.3)
    def forward(self, x_padded):
        """
        Looks up character-based CNN embeddings for the words in a batch of sentences.
        @param x_padded: Tensor of integers of shape (sentence_length, batch_size, max_word_length) where
            each integer is an index into the character vocabulary
        @param x_word_emb: Tensor of shape (sentence_length, batch_size, word_embed_size), containing the
            CNN-based embeddings for each word of the sentences in the batch
        """
        
        
#         raw_input x_padded: (max_sentence_length,bs,max_word_length aka m)
#             - each integer is an index into the character vocabulary
#             - this should be output of to_input_tensor_char()
        
#         --char_emb()-->
#         x_emb: (max_sentence_length,bs,max_word_length,e_char)
#             - with e_char is size of character embedding.      
        x_emb = self.char_emb(x_padded)
        
#         --reshape()-->
#         x_reshaped: (max_sentence_length,bs,e_char,max_word_length)
        x_reshaped = x_emb.permute(0,1,3,2)
    
#         --cnn()-->
#         x_conv: (max_sentence_length,bs,e_word,max_word_length-k+1)
#             - with k is kernel size,e_word is the desired word embedding size
#             - do a loop for each sentence
#         --relu_and_globalmaxpool()-->
#         x_conv_out: (max_sentence_length,bs,e_word)
        x_conv_out = self.cnn(x_reshaped)

#         --high_way()-->
#         x_highway: (max_sentence_length,bs,e_word)
        x_highway = self.highway(x_conv_out)
#         --dropout()-->
#         x_word_emb: (max_sentence_length,bs,e_word)
        x_word_emb = self.dropout(x_highway)
        return x_word_emb


## code for new nmt_model

Only small changes to a long codebase, so check nmt_model.py

Note that there are more notes about how to use char_decoder and details on shape in this python file

# Part 2: CharDecoderLSTM

## code for char_decoder

In [61]:
"""
CS224N 2019-20: Homework 5
"""

import torch
import torch.nn as nn
import torch.nn.functional as F

![](../images/ex5_4.png)

In [None]:
class CharDecoder(nn.Module):
    def __init__(self, hidden_size, char_embedding_size=50, target_vocab=None):
        """ Init Character Decoder.

        @param hidden_size (int): Hidden size of the decoder LSTM
        @param char_embedding_size (int): dimensionality of character embeddings
        @param target_vocab (VocabEntry): vocabulary for the target language. See vocab.py for documentation.
        """
        super(CharDecoder, self).__init__()
        self.target_vocab = target_vocab
        self.charDecoder = nn.LSTM(char_embedding_size, hidden_size)
        self.char_output_projection = nn.Linear(hidden_size, len(self.target_vocab.char2id))
        self.decoderCharEmb = nn.Embedding(len(self.target_vocab.char2id), char_embedding_size,
                                           padding_idx=self.target_vocab.char_pad)

    def forward(self, input, dec_hidden=None):
        """ Forward pass of character decoder.

        @param input (Tensor): tensor of integers, shape (length, batch_size)
        @param dec_hidden (tuple(Tensor, Tensor)): internal state of the LSTM before reading the input characters. 
        A tuple of two tensors of shape (1, batch, hidden_size)

        @returns scores (Tensor): called s_t in the PDF, 
        shape (length, batch_size, self.vocab_size)
        
        @returns dec_hidden (tuple(Tensor, Tensor)): internal state of the LSTM after reading the input characters. 
        A tuple of two tensors of shape (1, batch, hidden_size)
        """
        ### YOUR CODE HERE for part 2a
        inp_emb = self.decoderCharEmb(input) #(length,bs,char_emb_size)
        outp,new_dec_hidden = self.charDecoder(inp_emb,dec_hidden)
        # outp shape: (length,bs,hidden_size)
        scores = self.char_output_projection(outp) #(length,bs,vocab_size)
        return scores,new_dec_hidden
    
        ### END YOUR CODE

    def train_forward(self, char_sequence, dec_hidden=None):
        """ Forward computation during training.

        @param char_sequence (Tensor): tensor of integers, shape (length, batch_size).
        "length" here is max_word_length, aka number of chars for the longest words ever in the batch
        "batch_size" is max_sent_length * bs
        Note that "length" here and in forward() need not be the same.
        @param dec_hidden (tuple(Tensor, Tensor)): initial internal state of the LSTM, obtained from the output of the word-level decoder. A tuple of two tensors of shape (1, batch_size, hidden_size)

        @returns The cross-entropy loss (Tensor), computed as the *sum* of cross-entropy losses of all the words in the batch.
        """
        ### YOUR CODE HERE for part 2b
        ### TODO - Implement training forward pass.
        ###
        
        X = char_sequence[:-1] # c0 to cn for training
        s,dec_hidden = self.forward(X,dec_hidden)
        
        vocab_size = len(self.target_vocab.char2id)
        target = char_sequence[1:].view(-1).contiguous() # c1 to cn+1 for testing, flatten for loss
        scores = s.view(-1,vocab_size).contiguous()
        
        loss   = nn.CrossEntropyLoss(
            reduction= "sum", # When compute loss_char_dec, we take the sum, not average
            ignore_index=self.target_vocab.char_pad # not take into account pad character when compute loss
        )
        return loss(scores, target)
        ### Hint: - Make sure padding characters do not contribute to the cross-entropy loss. 
        ### Check vocab.py to find the padding token's index.
        ###       - char_sequence corresponds to the sequence x_1 ... x_{n+1} 
        ### (e.g., <START>,m,u,s,i,c,<END>). 
        ### Read the handout about how to construct input and target sequence of CharDecoderLSTM.
        ###       - Carefully read the documentation for nn.CrossEntropyLoss 
        ### and our handout to see what this criterion have already included:
        ###             https://pytorch.org/docs/stable/nn.html#crossentropyloss

        ### END YOUR CODE

    def decode_greedy(self, initialStates, device, max_length=21):
        """ Greedy decoding
        This is called only in inference and only when word-model produces an UNK
        If the translation contains any <UNK> tokens, 
        then for each of those positions, we use the word-based decoder’s combined
        output vector to initialize the CharDecoderLSTM’s initial h0 and c0, then use CharDecoderLSTM to
        generate a sequence of characters.

        @param initialStates (tuple(Tensor, Tensor)): initial internal state of the LSTM, 
        a tuple of two tensors of size (1, batch_size, hidden_size)
        NOTE that this is the new hidden state h (the combined output vector, which is AN ACTIVATION) 
            which is selectively chosen b/c word-model produces UNK tokens at these position
        Our job is to create a matrix of size (bs,max_length) initialized with start_of_word token
        and run it recursively through CharDecoderLSTM to generate a list of words (with length bs) 
        to replace these UNKs
        
        @param device: torch.device (indicates whether the model is on CPU or GPU)
        @param max_length (int): maximum length of words to decode

        @returns decodedWords (List[str]): a list (of length batch_size) of strings, 
                            each of which has length <= max_length.
                              The decoded strings should NOT contain the start-of-word and end-of-word characters.
        """
        bs = initialStates[0].shape[1]
        dec_hidden = initialStates
        inp = torch.LongTensor([[self.target_vocab.start_of_word]*bs]).to(device) # shape (1,bs). 
        
        # Bptt is always 1 for inp (even for the loop below) instead of increment from 1
        # because we already update dec_hidden for LSTM
        # Otherwise, we will feed (1,bs) then (2,bs) ... (max_length-1,bs) into LSTM without
        # feeding dec_hidden. This wastes computation however.
        
        results=torch.LongTensor([[0]*bs]).to(device) # (1,bs)
        for s in range(max_length):
            scores,dec_hidden = self.forward(inp,dec_hidden) #(1,bs,vocab_size)
            inp = F.softmax(scores,dim=2).argmax(dim=2) # (1,bs)
            # at s=0, even though inp is the same (1,bs vector filled with start_of_word index)
            # new inp (1,bs) will have different values because the  values are determined by dec_hidden, not the LSTM itself 
            # dec_hidden has shape (1,bs,hidden_size), so basically like running a NN linear layer
            results = torch.cat([results,inp.detach()])

        decodedWords = []
        results = torch.transpose(results[1:],0,1).detach().cpu().numpy() # (bs,max_length)
        for row in results:
            row_str=[]
            for idx in row:
                if idx == self.target_vocab.end_of_word: break
                row_str.append(self.target_vocab.id2char[idx])
            decodedWords.append(''.join(row_str))


        return decodedWords

        ### YOUR CODE HERE for part 2c
        ### TODO - Implement greedy decoding.
        ### Hints:
        ###      - Use initialStates to get batch_size.
        ###      - Use target_vocab.char2id and target_vocab.id2char 
        ### to convert between integers and characters
        ###      - Use torch.tensor(..., device=device) to turn a list of character indices into a tensor.
        ###      - You may find torch.argmax or torch.argmax useful
        ###      - We use curly brackets as start-of-word and end-of-word characters. 
        ### That is, use the character '{' for <START> and '}' for <END>.

        ### END YOUR CODE

Final BLEU score after running on full dataset

Train:
- epoch 29, iter 196000, cum. loss 82.27, cum. ppl 54.79 cum. examples 64000
- begin validation ...
- validation: iter 196000, dev. ppl 66.432232
- hit patience 1
-...
- epoch 29, iter 196330, avg. loss 86.20, avg. ppl 60.32 cum. examples 10537, speed 5536.83 words/sec, time elapsed 23065.94 sec
- reached maximum number of epochs!


Test:
- Corpus BLEU: 36.56285318055767 which is > 36 ==> full 6 points


In [60]:
# minor testing
#(length,bs,vocab_size)
temp = torch.randn(2,4,3)
temp

tensor([[[ 1.3297, -1.0496, -0.6587],
         [-0.1881, -1.0775,  0.0590],
         [ 1.9649,  0.3657,  0.3198],
         [-0.4647, -0.6989,  0.4196]],

        [[-0.6974,  1.5548, -0.4080],
         [ 0.2737, -0.2731,  1.8505],
         [ 0.7109, -0.6390, -0.6627],
         [ 0.4823,  1.2234, -1.4263]]])

In [64]:
temp2=F.softmax(temp,dim=2).argmax(dim=2)
temp2,temp2.shape

(tensor([[0, 2, 0, 2],
         [1, 2, 0, 1]]),
 torch.Size([2, 4]))