In [1]:
from collections import Counter
from gensim.models import Word2Vec
from random import random
from nltk import word_tokenize
from nltk.translate.bleu_score import sentence_bleu
from torch import nn
from torch.autograd import Variable

import numpy as np
import torch
import torch.nn.functional as F

use_cuda = torch.cuda.is_available()
print(use_cuda)



True


# Data Acquisition

For this assignment, you must download the data and extract it into `data/`. The dataset contains two files, both containing a single caption on each line. We should have 415,795 sentences in the training captions and 500 sentences in the validation captions.

To download the data, run the following directly on your server: `wget https://s3-us-west-2.amazonaws.com/cpsc532l-data/a3_data.zip`

In [2]:
# Load the data into memory.
train_sentences = [line.strip() for line in open("data/mscoco_train_captions.txt").readlines() if line.strip() != '']
val_sentences = [line.strip() for line in open("data/mscoco_val_captions.txt").readlines()]

for index, sentence in enumerate(train_sentences):
    if sentence[-1] != '.':
        train_sentences[index] = sentence + '.'

for index, sentence in enumerate(val_sentences):
    if sentence[-1] != '.':
        val_sentences[index] = sentence + '.'
        
print(len(train_sentences))
print(len(val_sentences))
print(train_sentences[0])

414143
500
A very clean and well decorated empty bathroom.


# Preprocessing

The code provided below creates word embeddings for you to use. After creating the vocabulary, we construct both one-hot embeddings and word2vec embeddings. 

All of the packages utilized should be installed on your Azure servers, however you will have to download an NLTK corpus. To do this, follow the instructions below:

1. SSH to your Azure server
2. Open up Python interpreter
3. `import nltk`
4. `nltk.download()`

    You should now see something that looks like:

    ```
    >>> nltk.download()
    NLTK Downloader
    ---------------------------------------------------------------------------
        d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
    ---------------------------------------------------------------------------
    Downloader> 

    ```

5. `d punkt`
6. Provided the download finished successfully, you may now exit out of the Python interpreter and close the SSH connection.

Please look through the functions provided below **carefully**, as you will need to use all of them at some point in your assignment.

In [3]:
sentences = train_sentences

# Lower-case the sentence, tokenize them and add <SOS> and <EOS> tokens
sentences = [["<SOS>"] + word_tokenize(sentence.lower()) + ["<EOS>"] for sentence in sentences]

# Create the vocabulary. Note that we add an <UNK> token to represent words not in our vocabulary.
vocabularySize = 1000
word_counts = Counter([word for sentence in sentences for word in sentence])
vocabulary = ["<UNK>"] + [e[0] for e in word_counts.most_common(vocabularySize-1)]
word2index = {word:index for index,word in enumerate(vocabulary)}
one_hot_embeddings = np.eye(vocabularySize)

# Build the word2vec embeddings
wordEncodingSize = 300
filtered_sentences = [[word for word in sentence if word in word2index] for sentence in sentences]
w2v = Word2Vec(filtered_sentences, min_count=0, size=wordEncodingSize)
w2v_embeddings = np.concatenate((np.zeros((1, wordEncodingSize)), w2v.wv.syn0))

# Define the max sequence length to be the longest sentence in the training data. 
maxSequenceLength = max([len(sentence) for sentence in sentences])

def preprocess_numberize(sentence):
    """
    Given a sentence, in the form of a string, this function will preprocess it
    into list of numbers (denoting the index into the vocabulary).
    """
    tokenized = word_tokenize(sentence.lower())
        
    # Add the <SOS>/<EOS> tokens and numberize (all unknown words are represented as <UNK>).
    tokenized = ["<SOS>"] + tokenized + ["<EOS>"]
    numberized = [word2index.get(word, 0) for word in tokenized]
    
    return numberized

def preprocess_one_hot(sentence):
    """
    Given a sentence, in the form of a string, this function will preprocess it
    into a numpy array of one-hot vectors.
    """
    numberized = preprocess_numberize(sentence)
    
    # Represent each word as it's one-hot embedding
    one_hot_embedded = one_hot_embeddings[numberized]
    
    return one_hot_embedded

def preprocess_word2vec(sentence):
    """
    Given a sentence, in the form of a string, this function will preprocess it
    into a numpy array of word2vec embeddings.
    """
    numberized = preprocess_numberize(sentence)
    
    # Represent each word as it's one-hot embedding
    w2v_embedded = w2v_embeddings[numberized]
    
    return w2v_embedded

def compute_bleu(reference_sentence, predicted_sentence):
    """
    Given a reference sentence, and a predicted sentence, compute the BLEU similary between them.
    """
    reference_tokenized = word_tokenize(reference_sentence.lower())
    predicted_tokenized = word_tokenize(predicted_sentence.lower())
    return sentence_bleu([reference_tokenized], predicted_tokenized)


# 1. Building a Language Decoder

We now implement a language decoder. For now, we will have the decoder take a single training sample at a time (as opposed to batching). For our purposes, we will also avoid defining the embeddings as part of the model and instead pass in embedded inputs. While this is sometimes useful, as it learns/tunes the embeddings, we avoid doing it for the sake of simplicity and speed.

Remember to use LSTM hidden units!

In [4]:
print(maxSequenceLength)
print(w2v_embeddings.shape)
print(w2v_embeddings)

print(vocabulary[0:5])
print(train_sentences[2])
print(preprocess_one_hot(train_sentences[0]))
print(preprocess_one_hot(train_sentences[0]).shape)

59
(1000, 300)
[[ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.64252836 -0.04464054 -0.02437208 ...  0.45854273 -0.40346768
  -0.58654529]
 [ 0.29306009  0.19502763  0.38849041 ...  0.50763428 -0.24476127
  -0.47664955]
 ...
 [-0.57009608  0.53415418 -0.37802663 ... -0.00389384 -0.61393547
   0.01855109]
 [ 0.53694671 -0.15880792  0.51926643 ... -0.20401719 -0.10407556
  -0.26090741]
 [ 0.403263   -1.45369279  0.12051474 ... -0.27986085  0.36217818
  -0.32311931]]
['<UNK>', 'a', '.', '<SOS>', '<EOS>']
A blue and white bathroom with butterfly themed wall tiles.
[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(11, 1000)


In [11]:
input_sentence = preprocess_one_hot(train_sentences[0])
input_sentence = torch.from_numpy(input_sentence[0])
input_sentence = Variable(input_sentence.float())
input_sentence = input_sentence.cuda()
input_sentence = input_sentence.view(1, 1, 1000)

lstm = nn.LSTM(1000, 300).cuda()
output, hidden = lstm(input_sentence)

linear = nn.Linear(300, 1000).double().cuda()
output = linear(output.double().cuda())
print(output)

loss = nn.CrossEntropyLoss()
input = Variable(torch.randn(3, 5), requires_grad=True)
target = Variable(torch.LongTensor(3).random_(5))
print(input)
print(target)
output = loss(input, target)
torch.cuda.current_device()


Variable containing:
( 0  ,.,.) = 
1.00000e-02 *
  1.6582 -5.8084 -5.8018  ...   1.0613  5.2599  2.9784
[torch.cuda.DoubleTensor of size 1x1x1000 (GPU 0)]

Variable containing:
 0.2716  1.6143 -0.2492 -0.4618 -0.7349
-0.2343 -2.1428  0.6304 -1.6372  1.3402
-0.6010  1.5082 -1.9139  0.3596 -0.7784
[torch.FloatTensor of size 3x5]

Variable containing:
 2
 1
 3
[torch.LongTensor of size 3]



0

In [4]:
class DecoderLSTM(nn.Module):
    # Your code goes here
    def __init__(self, input_size, hidden_size, output_size):
        super(DecoderLSTM, self).__init__()
        self.hidden_size = hidden_size

        self.lstm = nn.LSTM(input_size, hidden_size).double().cuda()
        self.linear = nn.Linear(hidden_size, output_size).double().cuda()
        self.softmax = nn.LogSoftmax(dim=2).double().cuda()

    def forward(self, input, hidden):
        output, hidden = self.lstm(input, hidden)
        output = self.linear(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        return result.double().cuda()


# 2. Training a Language Decoder

We must now train the language decoder we implemented above. An important thing to pay attention to is the [inputs for an LSTM](http://pytorch.org/docs/master/nn.html#torch.nn.LSTM).

In [5]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [None]:
def train(target_variable, 
          decoder, 
          decoder_optimizer, 
          criterion, 
          embeddings=one_hot_embeddings): 
    """
    Given a single training sample, go through a single step of training.
    """
    
    decoder_optimizer.zero_grad()

    # target_variable has (batch_size, n_words, n_vocab)
    target_length = target_variable.size()[1]

    loss = 0

    # First word in sentence needs to be fed h1=0
    decoder_input = target_variable[0][1] # First one is SOS
    prev_hidden = (decoder.initHidden(), decoder.initHidden())
    predicted_word_index = 0

    for index_word in range(2, target_length):
        decoder_input = decoder_input.view(1, 1, vocabularySize)
        decoder_output, prev_hidden = decoder(decoder_input, prev_hidden)
        
        topv, topi = decoder_output.data.topk(1)
        predicted_word_index = int(topi[0][0][0])
        # print('sum:', decoder_output.sum().data[0])
        # print(index_word, predicted_word_index, topv[0][0][0])
        # This is the next input, without teacher forcing it's the predicted output
        decoder_input = torch.from_numpy(embeddings[predicted_word_index])
        decoder_input = Variable(decoder_input).cuda()
        
        # This is just to conform with the pytorch format..
        # CrossEntropyLoss takes input1: (N, C) and input2: (N).
        _, actual_word_index = target_variable[0][index_word].data.topk(1)
        actual_word_index = Variable(actual_word_index)

        # Compare current output to next "target" input
        loss += criterion(decoder_output.view(1, decoder_output.size(2)), actual_word_index)
        
        # Stop on EOS
        # if predicted_word_index == word2index['<EOS>']:
        #   break
            
    
    # Last word in sentence is fed x=0
    # zeros = Variable(torch.zeros(1, 1, vocabularySize).double()).cuda()
    # decoder_output, _ = decoder(zeros, prev_hidden)
    # loss += criterion(decoder_output, zeros) # What should this be?
    
    loss.backward()
    decoder_optimizer.step()

    # index_word keeps track of the current word
    # in case of break (EOS) and non-break (teacher-forcing), it'll be the actually count.
    return loss.data[0] / index_word
    

# Train the model and monitor the loss. Remember to use Adam optimizer and CrossEntropyLoss
decoder = DecoderLSTM(vocabularySize, 300, vocabularySize)
decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=0.0001)
criterion = nn.NLLLoss()  # Since my LSTM has softmax as final layer, add NLL loss instead

n_iters = len(train_sentences)
print_every = 1000
print_loss_total = 0
start = time.time()
for s_index in range(1, n_iters):
    input_sentence = preprocess_one_hot(train_sentences[s_index])
    n_words = input_sentence.shape[0]
    input_sentence = torch.from_numpy(input_sentence)
    input_sentence = input_sentence.view(1, n_words, vocabularySize)
    input_sentence = Variable(input_sentence).cuda()
    loss = train(input_sentence, decoder, decoder_optimizer, criterion)
    
    print_loss_total += loss
    
    if s_index % print_every == 0:
        print_loss_avg = print_loss_total / print_every
        print_loss_total = 0
        print('%s (%d %d%%) %.4f' % (timeSince(start, s_index / n_iters),
                                     s_index, s_index / n_iters * 100, print_loss_avg))


0m 37s (- 261m 20s) (1000 0%) 4.5910
1m 16s (- 262m 17s) (2000 0%) 4.1580
1m 55s (- 264m 13s) (3000 0%) 4.0781
2m 35s (- 264m 54s) (4000 0%) 4.0283
3m 13s (- 263m 40s) (5000 1%) 3.9425
3m 51s (- 262m 50s) (6000 1%) 3.9330
4m 31s (- 263m 1s) (7000 1%) 3.9186
5m 10s (- 262m 41s) (8000 1%) 3.8757
5m 49s (- 262m 6s) (9000 2%) 3.8457
6m 28s (- 261m 39s) (10000 2%) 3.8771
7m 7s (- 261m 18s) (11000 2%) 3.8434
7m 47s (- 261m 13s) (12000 2%) 3.8914
8m 27s (- 260m 45s) (13000 3%) 3.8548
9m 5s (- 260m 4s) (14000 3%) 3.8475
9m 44s (- 259m 10s) (15000 3%) 3.8205
10m 23s (- 258m 36s) (16000 3%) 3.8072
11m 2s (- 257m 53s) (17000 4%) 3.8749
11m 40s (- 257m 5s) (18000 4%) 3.8085
12m 20s (- 256m 43s) (19000 4%) 3.8706
13m 0s (- 256m 15s) (20000 4%) 3.8384
13m 39s (- 255m 38s) (21000 5%) 3.8192
14m 18s (- 254m 58s) (22000 5%) 3.7802
14m 56s (- 254m 5s) (23000 5%) 3.7696
15m 35s (- 253m 30s) (24000 5%) 3.8240
16m 14s (- 252m 47s) (25000 6%) 3.8611
16m 52s (- 251m 57s) (26000 6%) 3.8123
17m 30s (- 251m 9s)

In [76]:
"""
Models
    1. './model/decoder_noEOS_23000_3_48'
"""
torch.save(decoder.state_dict(), PATH)

# 3. Building Language Decoder MAP Inference

We now define a method to perform inference with our decoder and test it with a few different starting words. This code will be fairly similar to your training function from part 2.

In [None]:
def inference(decoder, init_word, embeddings=one_hot_embeddings, max_length=maxSequenceLength):
    # Your code goes here

print(inference(decoder, init_word="the"))
print(inference(decoder, init_word="man"))
print(inference(decoder, init_word="woman"))
print(inference(decoder, init_word="dog"))

# 4. Building Language Decoder Sampling Inference

We must now modify the method defined in part 3, to sample from the distribution outputted by the LSTM rather than taking the most probable word.

It might be useful to take a look at the output of your model and (depending on your implementation) modify it so that the outputs sum to 1. 

In [None]:
def sampling_inference(decoder, init_word, embeddings=one_hot_embeddings, max_length=maxSequenceLength):
    # Your code goes here

# Print the results with sampling_inference by drawing 5 samples per initial word, requiring to run 
# the code below 5 times
print(sampling_inference(decoder, init_word="the"))
print(sampling_inference(decoder, init_word="man"))
print(sampling_inference(decoder, init_word="woman"))
print(sampling_inference(decoder, init_word="dog"))

# 5.  Building Language Encoder

We now build a language encoder, which will encode an input word by word, and ultimately output a hidden state that we can then be used by our decoder.

In [91]:
class EncoderLSTM(nn.Module):
    # Your code goes here

# Initialize the encoder with a hidden size of 300. 

# 6. Connecting Encoder to Decoder and Training End-to-End

We now connect our newly created encoder with our decoder, to train an end-to-end seq2seq architecture. 

It's likely that you'll be able to re-use most of your code from part 2. For our purposes, the only interaction between the encoder and the decoder is that the *last hidden state of the encoder is used as the initial hidden state of the decoder*. 

In [None]:
# Your code goes here

# 7. Testing 

We must now define a method that allows us to do inference using the seq2seq architecture. We then run the 500 validation captions through this method, and ultimately compare the **reference** and **generated** sentences using our **BLEU** similarity score method defined above, to identify the average BLEU score.

In [116]:
def seq2seq_inference(sentence, embeddings=one_hot_embeddings, max_length=maxSequenceLength):
    # Your code goes here

In [None]:
# Perform inference for all validation sequences and report the average BLEU score
    # Your code goes here

# 8. Encoding as Generic Feature Representation

We now use the final hidden state of our encoder, to identify the nearest neighbor amongst the training sentences for each sentence in our validation data.

It would be effective to first define a method that would generate all of the hidden states and store these hidden states **on the CPU**, and then loop over the generated hidden states to identify/output the nearest neighbors.

In [130]:
def final_encoder_hidden(sentence):
    # Your code goes here

# Now run all training data and validation data to store hidden states
    # Your code goes here

In [None]:
# Now get nearest neighbors and print

# 9. Effectiveness of word2vec

We now repeat everything done above using word2vec embeddings in place of one-hot embeddings. This will require re-running steps 1-8.