# Learning sentence representations from natural language inference data
### Advanced Topics in Computational Semantics - Practical I - Ard Snijders - 12854913

In [2]:
import torch
from torchtext import data
from torchtext import datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.set_printoptions(threshold=100)
import pickle

import nltk.tokenize 
nltk.download('punkt')
import argparse
import numpy as np
import sys
from pprint import pprint
import random

import os
import logging
import senteval
import warnings
import pandas as pd
warnings.simplefilter(action='ignore', category=FutureWarning)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ardsnijders/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Notebook overview

This interactive notebook is comprised of four major sections: 

1. Overview of models
2. Loading pre-trained models for inference
3. Results on SNLI and SentEval
4. Error Analysis (including mini perturbation study!)

# 1. Model Overview


Before we can load in pretrained models, we must first define what they should look like (to my knowledge, that is - I'm not sure if it's possible to load model checkpoints without first creating instances of them, and to create instances, the classes need to be defined somewhere - and I can't import them from another module in a jupyter notebook.

For classification we use a simple Multi-Layer Perceptron - the size of its input dimension is dependent on the encoder model coupled to it. The MLP takes as its input the transformed combination of sentence vectors u and v. The MLP has a single hidden dimension of size 512, after which the input is projected onto a final 3-way layer, one for each of the labels. I chose to use a leakyReLu activation function between the two linear projections, as I found it slightly improved my accuracy.

In [3]:
class MLPClassifier(nn.Module):
    """
    Class for Multi-Layer Perceptron Classifier
    """
    def __init__(self, input_dim, hidden_dim):
        """
        Initializes the network
        :param input_dim: dimension of first layer
        :param hidden_dim: dimension of hidden layer
        """
        super(MLPClassifier, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LeakyReLU(),
            nn.Linear(hidden_dim, 3),
        )

    def forward(self, input):
        """
        Method to do the forward pass
        :param input: batch of concatenations of [u,v,u-v,u*v]
        :return: [batch_size, 3] tensor with predictions
        """

        output = self.network(input)
        return output


## Average Embedding Sentence Encoder
The simplest sentence representation can be achieved by summing the embeddings of each word in the sentence, and then dividing this by the length of the sequence to get the average embedding for the sentence.

In [4]:
class AverageEmbeddings(nn.Module):
    """
    Simple network that takes average of all word embeddings in the sentence
    """
    def __init__(self, vocab_size, emb_dim):

        super(AverageEmbeddings, self).__init__()
        self.embedding = nn.EmbeddingBag(num_embeddings=vocab_size, embedding_dim=emb_dim, mode="mean")
        self.embedding.weight.requires_grad=False

    def forward(self, sentences, sequence_lengths):
        """
        the forward pass of the model
        :param sentences: sequence representing hypothesis/premise sentence
        :param offsets: length of offset
        :return:
        """
        sequence_lengths = None
        embedded = self.embedding(sentences, sequence_lengths)
        return embedded

## LSTM Sentence Encoder

We can incorporate more linguistic information by using a sequential model such as the LSTM. The LSTM takes as its input a sequence of embeddings and processes it sequentially, meaning that words are fed through the network one at a time; each pass for a single word is called a timestep, and for a sequence of K words the LSTM will go through a series of K timesteps to process that sequence. 

In [71]:
class LSTM_Encoder(nn.Module):
    """
    Class for all LSTM encoder models
    """
    def __init__(self, vocab_size, hidden_dim, emb_dim, num_layers, bidirectional, maxpool):
        """
        Initializes the LSTM network
        :param vocab_size: size of the vocabulary for initializing nn.Embedding
        :param hidden_dim: dimensionality of hidden layer in LSTM
        :param emb_dim: dimensionality of word embeddings
        :param num_layers: number of "stacked" LSTM layers (always 1)
        :param bidirectional: flag to turn on/off bidirectional encoding
        """
        super(LSTM_Encoder, self).__init__()
        self.bidirectional = bidirectional
        self.maxpool = maxpool

        # Initialize embedding layers. Turn off gradient tracking as we don't want to update the embeddings.
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.embedding.weight.requires_grad=False

        # Layers to initialize hidden state & cell state(?)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers, bidirectional=bidirectional, dropout=0.5)
 

    def forward(self, sentences, sequence_lengths):
        """
        Performs the forward pass through the network
        :param sentences: batch of sequences representing hypotheses or premises sentences
        :param offsets: tensor with number of padding tokens per sentence
        :return:
        - if unidirectional: ([batch_size, hidden_dim]) shaped tensor with last hidden state of network
        - if bidirectional: ([batch_size, 2*hidden_dim]) shaped tensor with concatenated forward/backward hidden states
        - if bidirectional and max_pooling: ([batch_size, 2*hidden_dim]) tensor with max-pool on features for all t
        """
        embedded = self.embedding(sentences)

        #this function packs sequences in batch into tuple of two lists with
        packed_embedded = torch.nn.utils.rnn.pack_padded_sequence(embedded, batch_first=True, 
                                                                  lengths=sequence_lengths, 
                                                                  enforce_sorted=False)
        output, (hidden, cell) = self.encoder(packed_embedded)

        # Pad the packed output which contains the hidden states
        # Function returns a tuple with padded outputs and tensor with lengths of each sequence in the batch
        output_padded, seq_lengths = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True, 
                                                                            padding_value=-9999)
        # Bidirectional LSTM w/ Max-pooling on word level features
        if self.bidirectional and self.maxpool:
            batch = output_padded
            batch = [(torch.max(sent, dim=0, keepdim=True)[0]).squeeze() for sent in batch]
            
            # Combine list of tensors back into tensor of tensors of shape ([batch_size, hidden_dim*2])
            batch = torch.stack(batch, dim=0)
            result = batch
        
        # Bidirectional LSTM
        elif self.bidirectional and not self.maxpool:
            forward_h_s = hidden[0]
            backward_h_s = hidden[1]
            result = torch.cat((forward_h_s, backward_h_s), 1)
        
        # Uni-directional LSTM
        else:
            result = torch.squeeze(hidden)

        return result

## Training details and hyperparameters

I largely chose to stick with the hyperparameters as provided by Conneau et al (2017). That is;

- MLP with a single hidden layer of dim 512
- LSTM with 2048 hidden units
- Initial learning rate of 0.1. Learning rate is divided by 5 when validation accuracy stops increasing. 

I also diverge from them in some respects:

- LeakyReLU in MLP seemed to give slightly better performance on the validation set
- p = 0.5 dropout for the nn.encoder in the LSTM network gave similar improvements

I trained all models for 25 epochs on Lisa; however, I found that this was largely overkill; most models seemed to have converged after about half of that already. I settled on checkpoints when I no longer saw significant increases in the validation accuracy:

- average embeddings = 15 epochs
- uni_LSTM = 12 epochs
- bi_LSTM = 13 epochs
- bi_LSTM w/ MP = 12 epochs

# 2. Loading pre-trained models for inference 

### Loading Glove Vectors used during training

I pickled the glove vectors that were used to initialize embedding weights during training - we need to load them first before we can initialize and load the desired models.

In [6]:
def load_vectors():   
    with open('glovevecs.pickle', 'rb') as handle:
        GloveVecs = pickle.load(handle)
    
    return GloveVecs
GloveVecs = load_vectors()

### Loading sentence encoder & MLP classifier
In order to load the sentence encoder and MLP classifier, we must first create instances of each, with the appropriate parameters (e.g. the dimensionality of the input of the MLP should match the output of the encoder). We then load the model checkpoints. This function returns the pretrained encoder model and corresponding classifier.

In [7]:
def load_model(model_type, emb_dim, hidden_dim, epochs, bidirectional, maxpool):
    """
    params:
    model: str parameter that specifies whether to load average embedding or LSTM
    emb_dim: int parameter that specifies dimensionality of word embeddings
    hidden_dim: int parameter that specifies number of hidden units in MLP/LSTM
    epochs: int parameter to specify for which training epoch we want to load the models.
            this is 25 for all models, except for the bi-LSTM, which seemed to converge around 19 epochs already.
    bidirectional: bool parameter to specify whether we want to load unidirectional or bidirectional LSTM
    maxpool: bool parameter to specify whether we want bi-LSTM with or without maxpooling
    
    returns: encoder and classifier instances loaded with weights from the given training epoch
    """
    
    #Create instances of desired encoder and corresponding classifier, then load checkpoint files.
    if model_type == 'AVERAGE':                                                                                                                                                  
                                                                                                                                                                               
        # Instantiate sentence encoder, populate with Glove embeddings.                                                                                                            
        model = AverageEmbeddings(vocab_size=vocab_size, emb_dim=emb_dim)                                                                                                   
        model.embedding.weight.data.copy_(GloveVecs)                                                                                                                               

        # Instantiate classifier                                                                                                                                                   
        input_dim = int(emb_dim * 4)                                                                                                                                        
        classifier = MLPClassifier(input_dim=input_dim, hidden_dim=512)
        
        classifier_path = "./models/{}/classifier_checkpoint_epoch_{}.pt".format(model_type, epochs)
        classifier.load_state_dict(torch.load(classifier_path, map_location=torch.device('cpu')))                                                                                                                                       
                                                                                                                                                                               
    elif model_type == 'LSTM':                                                                                                                                                   
                                                                                                                                                                               
        # Instantiate sentence encoder, populate with Glove embeddings.                                                                                                            
        model = LSTM_Encoder(vocab_size=vocab_size, hidden_dim=hidden_dim, emb_dim=emb_dim, 
                             num_layers=1, bidirectional=bidirectional, maxpool=maxpool)
        model.embedding.weight.data.copy_(GloveVecs)
        
        # Load checkpoint file for LSTM encode
        model_path = "./models/{}_bidirectional:_{}_maxpool:_{}/model_checkpoint_epoch_{}.pt".format(
                model_type, bidirectional, maxpool, epochs)
        model.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))

        # Instantiate classifier                                                                                                                                                   
        if not bidirectional:                                                                                                                                               
            input_dim = int(4 * hidden_dim)                                                                                                                                 
        else:                                                                                                                                                                      
            input_dim = int(4 * hidden_dim * 2)
        classifier = MLPClassifier(input_dim=input_dim, hidden_dim=512)
        
        # Load checkpoint file for MLP
        classifier_path = "./models/{}_bidirectional:_{}_maxpool:_{}/classifier_checkpoint_epoch_{}.pt".format(
                model_type, bidirectional, maxpool, epochs)
        classifier.load_state_dict(torch.load(classifier_path, map_location=torch.device('cpu')))
    
    return model, classifier                                                                  

## Loading the vocabulary

We need to load the vocabulary used during training - we need to know what ids to assign for each word in our new sentences. We also need to know the vocab size to initialize the embedding layers of our encoder instances with the right dimensionality. I extracted the defaultdict attribute from the torchtext.vocab.Vocab object with _object_.vocab.stoi - torchtext.vocab.Vocab itself is an attribute from the torchtext.data.field.field object, which was used to build the vocab during training. 

In [8]:
def load_vocab():   
    with open('vocabulary.pickle', 'rb') as handle:
        vocab = pickle.load(handle)
    return vocab

In [18]:
vocab = load_vocab()
print('Vocabulary size:', len(vocab.keys()), 'unique words.')

vocab_size = len(vocab.keys())

Vocabulary size: 36544 unique words.


## Converting sentence pairs into sequence tensors
The *sent2seq* function serves to convert sentences into sequences of word_ids that correspond to the ids the model was trained on, employing the vocab object that was used during training, such that sequences can be handled appropriately by the embedding layers of the encoders.

Because my original models expect batches of datapoints, and we're only feeding in pairs of sentences created "on the spot", I use *artifical_batch* to 'upsample' the example sentences into a single batch of 64 identical sequence tensors per sentence, so that I don't have to alter my model architecture too much.

In [19]:
def artificial_batch(sequence, unsq):
    seq_tensor = torch.tensor(sequence, dtype=torch.long)      
    seq_tensor = seq_tensor.unsqueeze(0)
    seq_tensor = seq_tensor.repeat(64,1)
    if unsq == False:
        seq_tensor = seq_tensor.squeeze(1)
    return seq_tensor

def sent2seq(sentences, vocab):
    # Input = pair of strings where strings are premise and hypothesis sentences, respectively
    # Empty list to store sequences of ids
    prep_sentences = []
    sequence_lengths = []  
    for sentence in sentences:     
        # Preprocess sentence: tokenization and lowercasing. 
        sentence = nltk.tokenize.word_tokenize(sentence.lower())        
        # Create list for sequence for current sentence
        prep_sentence = []     
        # Convert sequence to sequence by look-up in vocab
        # If word not in vocab, assign value associated with '<unk>'   
        for word in sentence:
            if word in vocab:
                prep_sentence.append(vocab[word])
            else:
                prep_sentence.append(vocab['<unk>'])            
        # Convert list into [64, <len_list>] tensor for both sentence and seq lengths
        seq_batch = artificial_batch(prep_sentence, unsq=True)
        len_batch = artificial_batch(len(prep_sentence), unsq=False)       
        # Append batch tensors to lists
        prep_sentences.append(seq_batch)
        sequence_lengths.append(len_batch)
             
    return prep_sentences, sequence_lengths

In [20]:
# Output for example premise and hypothesis:
sentences = ['The man is sleeping in his bed', 'The man is awake']
prep_sentences, sequence_lengths = sent2seq(sentences, vocab)

print('Premise:',sentences[0])
print("Corresponding premise sequence:", (prep_sentences[0][0].tolist()), '\n')
print('Hypothesis:', sentences[1])
print("Corresponding hypothesis sequence:", prep_sentences[1][0].tolist())

Premise: The man is sleeping in his bed
Corresponding premise sequence: [4, 7, 6, 151, 5, 21, 314] 

Hypothesis: The man is awake
Corresponding hypothesis sequence: [4, 7, 6, 2975]


## Predict function

This function takes in prepared sentences and passes them to the desired encoder/classifier to generate a prediction.

In [21]:
def predict(prep_sentences, sequence_lengths, model, classifier):
    # Extract premise and hypothesis sequence and sequence lengths
    premise = prep_sentences[0]
    hypothesis = prep_sentences[1]
    premise_len = sequence_lengths[0]
    hypothesis_len = sequence_lengths[1]
    
    # 'notify' model layers that we're no longer in training mode
    model.eval()
    torch.no_grad()
    
    # Pass sequences through given encoder
    u = model(sentences=premise, sequence_lengths=premise_len)
    v = model(sentences=hypothesis, sequence_lengths=hypothesis_len)

    # Compute the joint sentence representation
    differences = torch.abs(u-v)
    products = torch.mul(u, v)
    final_vectors = torch.cat((u, v, differences, products), 1)
    
    # Pass resulting vector through the MLP classifier
    predictions = classifier(final_vectors)
    
    # Take the highest prediction, retrieve corresponding index
    _, index = predictions.max(dim=1)

    return index

## Inference function
Now that we can load our models and convert our sequences into sentences, we can generate predictions for any given pair of sentences, by first passing them through the prepare function, and then feeding them through the pretrained models.

In [22]:
def inference(sentences, model_type, emb_dim, hidden_dim, epochs, bidirectional, maxpool): 
    #load vocab
    vocab = load_vocab()

    #prepare sentences by converting them into sequences with wordids.
    prep_sentences, sequence_lengths = sent2seq(sentences, vocab)
    
    #load the desired model and classifier
    model, classifier = load_model(model_type, emb_dim, hidden_dim, epochs, bidirectional, maxpool)
    model = model.eval()
    
    #feed prepared sentences through model to get a prediction
    prediction = predict(prep_sentences, sequence_lengths, model, classifier)

    # "64" predictions is quite unncessary here, so we just take one value
    prediction = prediction[0].item()
    
    label_dict = {0:'entailment', 1:'contradiction', 2:'neutral'}
    
    return label_dict[prediction]

### Example: Bi-LSTM with Max-Pooling

In [72]:
relations = {
    'contradiction_1' : ['Bob is awake', 'Bob is sleeping'],
    'contradiction_2' : ['A car is driving on the road', 'The car is standing still'],
    'contradiction_3' : ['A man inspects the uniform of a figure in some East Asian country.', 'The man is sleeping'],
    'contradiction_4' : ['A black race car starts up in front of a crowd of people.', 
                       'A man is driving down a lonely road.'],
    'entailment_1' : ['Bob is awake', 'Bob is lying in his bed'],
    'entailment_3' : ['A soccer game with multiple males playing.', 'Some men are playing a sport.'],
    'entailment_4' : ['At the end of Pennsylvania Avenue, people began to line up for a White House tour.', 
                    'People formed a line at the end of Pennsylvania Avenue.'],
    'neutral_1' : ['Bob is awake', 'It is sunny outside'],
    'neutral_2' : ['A car is driving on the road', 'It is sunny outside'],
    'neutral_3' : ['A smiling costumed woman is holding an umbrella.', 'Some men are playing a sport.'],
    'neutral_4' : ['An older and younger man smiling.', 
                 'Two men are smiling and laughing at the cats playing on the floor.'],
    'neutral_5' : ['A car is driving on the road', 'The car is speeding on the highway']}
for key in relations.keys():
    print(relations[key],
        '\n True Relation: '+ key[:-2] + '. Predicted relation: ',
        inference(relations[key], 
          model_type='LSTM', 
          emb_dim=300, 
          hidden_dim=2048, 
          epochs=12, 
          bidirectional=True, 
          maxpool=True), '\n')

  "num_layers={}".format(dropout, num_layers))


['Bob is awake', 'Bob is sleeping'] 
 True Relation: contradiction. Predicted relation:  contradiction 

['A car is driving on the road', 'The car is standing still'] 
 True Relation: contradiction. Predicted relation:  contradiction 

['A man inspects the uniform of a figure in some East Asian country.', 'The man is sleeping'] 
 True Relation: contradiction. Predicted relation:  contradiction 

['A black race car starts up in front of a crowd of people.', 'A man is driving down a lonely road.'] 
 True Relation: contradiction. Predicted relation:  contradiction 

['Bob is awake', 'Bob is lying in his bed'] 
 True Relation: entailment. Predicted relation:  contradiction 

['A soccer game with multiple males playing.', 'Some men are playing a sport.'] 
 True Relation: entailment. Predicted relation:  entailment 

['At the end of Pennsylvania Avenue, people began to line up for a White House tour.', 'People formed a line at the end of Pennsylvania Avenue.'] 
 True Relation: entailment. Pr

# 3. Results for SNLI / Sent Eval

In the tables below I report the performance for the used models. I report accuracy on both the validation and test set of the SNLI natural language inference dataset. I also report the micro- and macro-accuracies on Sent Eval, as calculated from the model performance on a range of tasks, which can be observed in table 2.

Model | dev (NLI) | test (NLI) | micro (SE) | macro (SE) | epoch checkpoint
---|  --- | --- | --- | --- | ---|
Average-Emb | 72.27 | 72.60 | 83.62 | 78.00 | 15
Uni-LSTM | 81.70 | 81.54 | 84.06 | 81.27 | 12
Bi-LSTM | 81.97 | 81.85| 84.02 | 81.20 | 13
Bi-LSTM-MP | __84.70__ | __84.61__| __86.06__ | __82.88__ | 12

- - - - - - - - - - - Table 1: __Performance on SNLI and SentEval.__



Model | MR | CR | MPQA | SUBJ | SST2 | TREC | MRPC | SICK-E | STS14 (Pearson) | STS14 (Spearman)
---| --- | --- | --- | --- | ---| --- | --- | --- | --- | --- |
Average-Emb | 54.11  | 79.63     | 84.38 | 99.6 | 78.03 | 82.2 | 70.26 | 75.77 | 0.4963 | 0.5184 
Uni-LSTM   | __71.86__ | 78.17   | 84.96 | 99.6 | 76.72 | 81.6 | 73.1 | 84.17 | 75.77 | 0.5771 | 0.5578
Bi-LSTM    | 71.68      | 78.2     | 85.01 | 99.6 | 76.94 | 81.4 | 72.93 | 83.8 | 0.5787 | 0.5587 |
Bi-LSTM-MP | 69.81    | __80.9__ | __85.62__ | 99.6 | __79.52__ | __87.6__ | __75.07__ | __84.92__ | __0.6383__ | __0.6238__ |

- - - - -  Table 2: __Test accuracies on individual SentEval tasks.__

### Performance analysis

Considering the results above, we can note a couple of things. The average embedding encoder has the lowest performance, the uni and bi-LSTM encoders have comparatively better, but otherwise fairly similar performance (both approx 81.5 ~ 81.8% accuracy), and the Bi-LSTM with max-pooling outperforms the other models, with an overall  accuracy of 84.61%. Arguably, this makes sense intuitively; The average embedding encoder does not take into account word order; by assigning equal importance to each word in the sequence, it generates a comparatively simplistic sentence representation.

The uni-LSTM is more sophisticated in this respect; sequential by design, the LSTM architecture has the capacity to 'remember' what it has seen in previous timesteps, which allows it to learn long distance dependencies, for instance. The bi-LSTM takes this one step further, as it processes a sentence in both a forward and backward direction, allowing it to construct a representation based on both the 'past' and the 'future' context. 

In this respect, I had expected the bi-LSTM to outperform the uni-LSTM slightly, but their performances are fairly similar. Nevertheless, both the uni-LSTM and bi-LSTM outperform the average embedding encoder across all relations.

Finally, the bi-LSTM with max-pooling extends the bi-LSTM model by incorporating a max-pooling operation in its forward pass; here, it considers the word-level hidden states for both the forward and backward directions, such that the final sentence representation is composed of the largest values from all the word-level hidden states. This  model performs slightly better than the uni-LSTM and vanilla bi-LSTM, which seems to suggest that word-level hidden states encapsulate some additional information, compared to just the hidden states.

The performance on SNLI seems to be mirrored BY the results on Sent Eval; except on task MR, the Bi-LSTM with max-pooling consistently outperforms the other encoders models. This seems to su

### Sent Eval code

Below you can see the main code that was used to generate embeddings for feeding into the Sent Eval framework.

In [None]:
encoder, _ = load_model('AVERAGE', emb_dim=300, hidden_dim=2048, epochs=15, bidirectional=False, maxpool=False)

def batcher(params, batch):
    """
    params: senteval parameters.
    batch: numpy array of text sentences (of size params.batch_size)
    output: numpy array of sentence embeddings (of size params.batch_size)
    """
    sequences, seq_lengths = sent2ids(batch)
    embeddings = encoder(sentences=sequences, sequence_lengths=seq_lengths).detach()
    return embeddings

logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.DEBUG)
PATH_TO_DATA = '/Users/ardsnijders/Desktop/ATCS/SentEval/data'
params = {'task_path': PATH_TO_DATA, 'usepytorch': False, 'kfold': 5}

se = senteval.engine.SE(params, batcher)

# if __name__ == "__main__":
#     transfer_tasks = ['MR', 'CR', 'MPQA', 'SUBJ', 'SST2', 'TREC',
#                           'MRPC', 'SICKEntailment', 'STS14']
#     results = se.eval(transfer_tasks)

# 4. Error Analysis

In order get a better idea of where the models perform well, I loaded the test set in DataFrame and performed inference for all models. This allows us to see more easily how well the models were able to predict specific relations.

In [76]:
pd.set_option('display.width', 10000)
pd.set_option('display.max_columns', 500)
df = pd.read_csv('snli_1.0_test.txt', sep='	')
df = df[['sentence1', 'sentence2', 'gold_label']]
df.rename(columns={'sentence1': 'Premise', 'sentence2': 'Hypothesis'}, inplace=True)
# delete the entries where the gold label == '-'
df = df[df['gold_label'] != '-']
df_ex = df.head(5)
df_ex.style.set_properties(subset=['Premise', 'Hypothesis'], **{'width': '300px'})

Unnamed: 0,Premise,Hypothesis,gold_label
0,This church choir sings to the masses as they sing joyous songs from the book at a church.,The church has cracks in the ceiling.,neutral
1,This church choir sings to the masses as they sing joyous songs from the book at a church.,The church is filled with song.,entailment
2,This church choir sings to the masses as they sing joyous songs from the book at a church.,A choir singing at a baseball game.,contradiction
3,"A woman with a green headscarf, blue shirt and a very big grin.",The woman is young.,neutral
4,"A woman with a green headscarf, blue shirt and a very big grin.",The woman is very happy.,entailment


In [77]:
print('Number of omitted rows:', 10000-len(df))

Number of omitted rows: 176


## Some functions for generating predictions from DataFrame

### In a nutshell: 

*create_batches* creates iterables using both the premise and hypothesis columns in the dataframe.

Then, *sent2ids* processes these batches, converts sentences to sequences of ids and records their lengths. This function is also used by the *batcher* function which is used for SentEval.

Next, *get_predictions* uses batches of sequences and lengths to generate predictions for the desired model.

Finally, *calc_accuracy* can be used to compute the accuracy by feeding in a dataframe of choice

In [86]:
def create_batches(batch_size, column, dataframe):
    val_iter = []
    batch = []
    for index, row in dataframe.iterrows():
        pair = [row[column]]
        batch.append(pair)
        if (index+1) % batch_size == 0:
            val_iter.append(batch)
            batch = []
    return val_iter

def sent2ids(batch):   
    # Tokenize & lowercase
    batch = [sent if sent != [] else ['.'] for sent in batch]
    batch = [' '.join(sentence) for sentence in batch]
    batch = [nltk.tokenize.word_tokenize(sentence.lower()) for sentence in batch]
    
    # Record the sentence lengths
    seq_lengths = [len(sent) for sent in batch] 
    # Record max seq_length
    batch_size = len(seq_lengths)
    max_len = max(seq_lengths)
    
    # Create a tensor with ones of shape batch_size, max_seq_len
    sequences = torch.ones((batch_size, max_len),dtype=torch.long)
    # Create a tensor with sequence lengths
    seq_lengths = torch.tensor(seq_lengths, dtype=torch.long)
    
    # Convert to sequences with ids, populate sequences tensor
    for i, sentence in enumerate(batch):
        for j, word in enumerate(sentence):
            if word in vocab:
                sequences[i][j] = vocab[word]
            else:
                sequences[i][j] = vocab['<unk>']              
    # Delete rows where seq_length = 0, as these are empty sentences
    sequences = sequences[seq_lengths>0]
    
    # Adjust the seq_lengths accordingly
    seq_lengths = seq_lengths[seq_lengths>0]
    
    return sequences, seq_lengths

def get_predictions(premise_iter, hypothesis_iter, model_type, emb_dim, hidden_dim, epochs, bidirectional, maxpool):
    
    vocab = load_vocab()
    result = []
    for i in range(len(premise_iter)):  
        #load the desired model and classifier
        model, classifier = load_model(model_type, emb_dim, hidden_dim, epochs, bidirectional, maxpool)
        
        premises, premise_lens = sent2ids(premise_iter[i]) 
        hypotheses, hypothesis_lens = sent2ids(hypothesis_iter[i])
        
        prep_sentences = [premises, hypotheses]
        sequence_lengths = [premise_lens, hypothesis_lens]
        
        predictions = predict(prep_sentences=prep_sentences,
                              sequence_lengths=sequence_lengths, 
                              model=model, classifier=classifier)
        
        label_dict = {0:'entailment', 1:'contradiction', 2:'neutral'}
        
        for prediction in predictions:
            result.append( label_dict[prediction.item()] )
        
#         print('{} out of {} batches for {} with bidir: {} and maxpool: {} processed'.format(i+1, len(premise_iter), 
                                                                                            #model_type, bidirectional, maxpool))      
    return result

def calc_accuracy(model, dataframe):
    
    correct = len( dataframe[ dataframe[model]==dataframe['gold_label'] ] )
    acc = round( 100 * (correct / len(dataframe)), 2)
    return acc

# Generating Predictions from DataFrame

In the cells below, I create batch objects from the premise and hypothesis columns in the dataframe containing the test set examples. I then pass these on to get_predictions to predict the relations between sentences, for each sentence encoder.

In [89]:
# Create batch objects with size 100
premise_iter = create_batches(100, 'Premise', df)
hypothesis_iter = create_batches(100, 'Hypothesis',df)

The cells below were ran to get predictions using the Average Embedding encoder, Uni-LSTM, Bi-LSTM, and Bi-LSTM-MP, respectively.

In [90]:
average    = get_predictions(premise_iter, hypothesis_iter,
                          model_type='AVERAGE', emb_dim=300, 
                          hidden_dim=2048, epochs=15, 
                          bidirectional=False, maxpool=False)

In [91]:
uni_lstm   = get_predictions(premise_iter, hypothesis_iter,
                          model_type='LSTM', emb_dim=300, 
                          hidden_dim=2048, epochs=12, 
                          bidirectional=False, maxpool=False)

In [92]:
bi_lstm    = get_predictions(premise_iter, hypothesis_iter,
                          model_type='LSTM', emb_dim=300, 
                          hidden_dim=2048, epochs=13, 
                          bidirectional=True, maxpool=False)

In [93]:
bi_lstm_mp = get_predictions(premise_iter, hypothesis_iter, 
                          model_type='LSTM', 
                          emb_dim=300, hidden_dim=2048, 
                          epochs=12, bidirectional=True, maxpool=True)

### Adding the previously computed relation predictions to separate columns in the DataFrame

In [94]:
df['AVERAGE'] = average
df['UNI_LSTM'] = uni_lstm
df['BI_LSTM'] = bi_lstm
df['BI_LSTM_MP'] = bi_lstm_mp

### Creating separate dataframes for each relation to allow calculation of per-relation accuracies

In [95]:
contradictions = df[df['gold_label']=='contradiction']
entailments = df[df['gold_label']=='entailment']
neutrals = df[df['gold_label']=='neutral']

### Computing classification accuracy for each model, for each relation

In [96]:
print('Total number of examples in test set: ' + str(len(df)) +'\n')

# Compute classification accuracy for each relation, for each model
models = ['AVERAGE', 'UNI_LSTM', 'BI_LSTM', 'BI_LSTM_MP']
categories = [contradictions, entailments, neutrals]
for model in models:
    print('Overall classification accuracy using '+model+' encoder: '+str(calc_accuracy(model,df)) +'%')
    for category in categories:
        print('Accuracy using ' + 
              str(model) + ' encoder on predicting relation: ' + 
              category.gold_label.iloc[0] + 
              ': ' + str(calc_accuracy(model, category))
              + '%')
    print('\n')

Total number of examples in test set: 9824

Overall classification accuracy using AVERAGE encoder: 72.6%
Accuracy using AVERAGE encoder on predicting relation: contradiction: 70.37%
Accuracy using AVERAGE encoder on predicting relation: entailment: 78.95%
Accuracy using AVERAGE encoder on predicting relation: neutral: 68.19%


Overall classification accuracy using UNI_LSTM encoder: 81.54%
Accuracy using UNI_LSTM encoder on predicting relation: contradiction: 84.37%
Accuracy using UNI_LSTM encoder on predicting relation: entailment: 82.84%
Accuracy using UNI_LSTM encoder on predicting relation: neutral: 77.32%


Overall classification accuracy using BI_LSTM encoder: 81.85%
Accuracy using BI_LSTM encoder on predicting relation: contradiction: 83.78%
Accuracy using BI_LSTM encoder on predicting relation: entailment: 83.49%
Accuracy using BI_LSTM encoder on predicting relation: neutral: 78.19%


Overall classification accuracy using BI_LSTM_MP encoder: 84.61%
Accuracy using BI_LSTM_MP enco

## Relation-wise results
### Entailment
Most models seem to perform the best when predicting 'entailment'; notably, the average embedding encoder has an accuracy of almost 80% for this category. In case of the uni- and bi-LSTMs, there doesn't seem to be a clear difference in performance between 'entailment' and 'contradiction'. Possibly, a reason why predicting this relation is "easy" is because most of the entailment problems in the data are comprised of two sentences which are virtually identical, semantically speaking. Moreover, the hypothesis in these cases is usually a very 'succinct' version of the premise. For instance, consider the following sentence pair:

Premise: *An old man with a package poses in front of an advertisement*

Hypothesis: *A man poses in front of an ad*

In this case, the hypothesis does not present any information which is not already explicitly mentioned in the premise, which might make it easy for models to infer that the relation is entailment. Most of the entailment problems in the data seem to be formulated like this; Long, detailed premises with short, succint hypotheses.

It seems that the models struggle more when the hypothesis contains information which can be *inferred* from the premise using world knowledge, but is not stated in the premise explicitly. Consider the following sentence pair:

Premise: *A land rover is being driven across a river*

Hypothesis: *A land rover is splashing water as it crosses a river*

Anyone who has knowledge about 'water' and its behaviour in the real world knows that it will 'splash' when disturbing it - in this case, by driving through it. For machines however, such deductions are not necessarily as straightforward; rather, they might 'perceive' the phenomenon of 'splashing' to be something that cannot be consistent with the premise phrase 'driven across a river', (even though arguably, the words are semantically close) - resultantly, the models all (falsely) predict 'entailment'.

In this respect, the fact that all models perform the best on entailment problems might be a little misleading, in the sense that most of the entailment problems (I say most, though I obivously haven't looked at all 180k+ thousand of them - I just base this on scanning the dataframes) are formulated rather simplistically; when faced with more difficult/ambiguous entailment problems, the models will possibly struggle more.

In [97]:
df_entail = df[(df.gold_label=='entailment')].head(10)# & (df.gold_label != df.AVERAGE)]
df_entail.style.set_properties(subset=['Premise', 'Hypothesis'], **{'width': '300px'})

Unnamed: 0,Premise,Hypothesis,gold_label,AVERAGE,UNI_LSTM,BI_LSTM,BI_LSTM_MP
1,This church choir sings to the masses as they sing joyous songs from the book at a church.,The church is filled with song.,entailment,entailment,entailment,entailment,entailment
4,"A woman with a green headscarf, blue shirt and a very big grin.",The woman is very happy.,entailment,neutral,entailment,neutral,neutral
6,An old man with a package poses in front of an advertisement.,A man poses in front of an ad.,entailment,entailment,entailment,entailment,entailment
10,A statue at a museum that no seems to be looking at.,There is a statue that not many people seem to be interested in.,entailment,entailment,neutral,entailment,neutral
12,A land rover is being driven across a river.,A Land Rover is splashing water as it crosses a river.,entailment,neutral,neutral,neutral,neutral
13,A land rover is being driven across a river.,A vehicle is crossing a river.,entailment,entailment,entailment,entailment,entailment
16,A man playing an electric guitar on stage.,A man playing guitar on stage.,entailment,entailment,entailment,entailment,entailment
18,A blond-haired doctor and her African american assistant looking threw new medical manuals.,A doctor is looking at a book,entailment,entailment,entailment,entailment,entailment
22,"One tan girl with a wool hat is running and leaning over an object, while another person in a wool hat is sitting on the ground.",A tan girl runs leans over an object,entailment,entailment,neutral,neutral,entailment
26,A young family enjoys feeling ocean waves lap at their feet.,A family is at the beach.,entailment,entailment,entailment,entailment,entailment


### Contradiction

All models seem to do slightly worse when predicting 'contradiction', save for the uni- and bi-LSTMs, which as mentioned, have comparable accuracies for this relation. 

In [99]:
df_contra = df[(df.gold_label=='contradiction')].head(10)# & (df.gold_label != df.AVERAGE)]
df_contra.style.set_properties(subset=['Premise', 'Hypothesis'], **{'width': '300px'})

Unnamed: 0,Premise,Hypothesis,gold_label,AVERAGE,UNI_LSTM,BI_LSTM,BI_LSTM_MP
2,This church choir sings to the masses as they sing joyous songs from the book at a church.,A choir singing at a baseball game.,contradiction,contradiction,contradiction,contradiction,contradiction
5,"A woman with a green headscarf, blue shirt and a very big grin.",The woman has been shot.,contradiction,entailment,neutral,neutral,neutral
8,An old man with a package poses in front of an advertisement.,A man walks by an ad.,contradiction,entailment,contradiction,contradiction,contradiction
11,A statue at a museum that no seems to be looking at.,Tons of people are gathered around the statue.,contradiction,neutral,contradiction,contradiction,contradiction
14,A land rover is being driven across a river.,A sedan is stuck in the middle of a river.,contradiction,contradiction,contradiction,contradiction,contradiction
15,A man playing an electric guitar on stage.,A man playing banjo on the floor.,contradiction,contradiction,contradiction,contradiction,contradiction
19,A blond-haired doctor and her African american assistant looking threw new medical manuals.,A man is eating pb and j,contradiction,contradiction,contradiction,contradiction,contradiction
21,"One tan girl with a wool hat is running and leaning over an object, while another person in a wool hat is sitting on the ground.",A boy runs into a wall,contradiction,contradiction,contradiction,contradiction,contradiction
25,A young family enjoys feeling ocean waves lap at their feet.,A family is out at a restaurant.,contradiction,contradiction,contradiction,contradiction,contradiction
29,A couple walk hand in hand down a street.,A couple is sitting on a bench.,contradiction,contradiction,contradiction,contradiction,contradiction


### Neutral

Loosely speaking, it appears that 'neutral' makes for the most difficult category. A possible reason for this could be that intuitively, this task is also the most difficult in general - In my opinion, there's comparatively more ambiguity involved, especially for sentences where it holds that "*X* being true in both the premise and hypothesis, need not necessarily entail *Y* for the hypothesis, even if *Y* does not contradict *X*". I suppose getting these right requires more 'inference' ability; in this respect some of them are almost trick-questions. For instance, consider the following pair of sentences, from the dev set:

*Premise: A group of people sit on benches at a park outside a building.	
Hypothesis: The people sit together.*

All models predict 'entailment' for this relation, but it is labelled 'neutral'; arguably, a group of people sitting on benches does not necessarily entail that they are sitting *together*. Possibly, the models make a connection between the words 'group' and 'together', which one could argue are semantically closely related; in this respect, the hypothesis of sitting together might be entailed by the premise that there is a group of people, explaining why all models predict 'entailment'.

Some of the other 'neutral' relations are more straightforward:

*Premise: A man sits in a chair on the sidewalk in front of a huge display of brightly colored artwork.	
Hypothesis: The man was selling the artwork*

Possibly, because the word *selling* is somewhat semantically distant from most of the words in the premise, all models are able to quite easily determine that the hypothesis cannot be entailed from the premise; furthermore, there are no words in the hypothesis that would indicate a contradiction. All models correctly predicted 'neutral' for this pair.

In [49]:
df_neutral = df[(df.gold_label=='neutral')].head(10)# & (df.gold_label != df.AVERAGE)]
df_neutral.style.set_properties(subset=['Premise', 'Hypothesis'], **{'width': '300px'})

Unnamed: 0,Premise,Hypothesis,gold_label,AVERAGE,UNI_LSTM,BI_LSTM,BI_LSTM_MP
0,This church choir sings to the masses as they sing joyous songs from the book at a church.,The church has cracks in the ceiling.,neutral,contradiction,contradiction,contradiction,contradiction
3,"A woman with a green headscarf, blue shirt and a very big grin.",The woman is young.,neutral,entailment,neutral,neutral,neutral
7,An old man with a package poses in front of an advertisement.,A man poses in front of an ad for beer.,neutral,neutral,neutral,neutral,neutral
9,A statue at a museum that no seems to be looking at.,The statue is offensive and people are mad that it is on display.,neutral,neutral,contradiction,neutral,neutral
17,A man playing an electric guitar on stage.,A man is performing for cash.,neutral,neutral,neutral,neutral,neutral
20,A blond-haired doctor and her African american assistant looking threw new medical manuals.,A doctor is studying,neutral,entailment,neutral,neutral,entailment
23,"One tan girl with a wool hat is running and leaning over an object, while another person in a wool hat is sitting on the ground.",A man watches his daughter leap,neutral,contradiction,contradiction,contradiction,neutral
24,A young family enjoys feeling ocean waves lap at their feet.,A young man and woman take their child to the beach for the first time.,neutral,neutral,neutral,neutral,neutral
28,A couple walk hand in hand down a street.,The couple is married.,neutral,neutral,neutral,neutral,neutral
34,A man reads the paper in a bar with green lighting.,The man is reading the sportspage.,neutral,entailment,neutral,neutral,neutral


# 4.1 Mini Perturbation Study: How much does word order really matter?

I wanted to get a better idea on what types of information the different encoders depend on most when generating sentence embeddings for relation classification. I do this by introducing a perturbation in the data *at test time*: shuffling words within sentences, such that the sentences no longer have any meaningful sense of word order. Namely, I made a copy of the original DataFrame, shuffled the words in the sentences, and ran inferences for all 4 models.

In [42]:
# Copy DataFrame
df2 = df[['Premise', 'Hypothesis', 'gold_label']]

def scramble(sentence):
    shuffled_sent = sentence.split()
    random.shuffle(shuffled_sent)
    return ' '.join(shuffled_sent)

columns = ['Premise', 'Hypothesis']

# Shuffle sentences
for column in columns:
    df2[column] = df2[column].apply(lambda x: scramble(x))
    
# Create batch objects with size 100
premise_iter = create_batches(100, 'Premise', df2)
hypothesis_iter = create_batches(100, 'Hypothesis', df2)

# Generate predictions for every encoder
average_2    = get_predictions(premise_iter, hypothesis_iter,
                          model_type='AVERAGE', emb_dim=300, 
                          hidden_dim=2048, epochs=15, 
                          bidirectional=False, maxpool=False)

bi_lstm_2    = get_predictions(premise_iter, hypothesis_iter,
                          model_type='LSTM', emb_dim=300, 
                          hidden_dim=2048, epochs=13, 
                          bidirectional=True, maxpool=False)

uni_lstm_2   = get_predictions(premise_iter, hypothesis_iter,
                          model_type='LSTM', emb_dim=300, 
                          hidden_dim=2048, epochs=12, 
                          bidirectional=False, maxpool=False)

bi_lstm_mp_2 = get_predictions(premise_iter, hypothesis_iter, 
                          model_type='LSTM', 
                          emb_dim=300, hidden_dim=2048, 
                          epochs=12, bidirectional=True, maxpool=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


1 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
2 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
3 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
4 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
5 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
6 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
7 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
8 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
9 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
10 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
11 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
12 out of 99 batches for AVERAGE with bidir: False and maxpool: False processed
13 out of 99 batches for AVERAGE with bidir: Fals

In [43]:
# Add predictions for each model by creating new columns in df2
df2['AVERAGE'] = average_2
df2['UNI_LSTM'] = uni_lstm_2
df2['BI_LSTM'] = bi_lstm_2
df2['BI_LSTM_MP'] = bi_lstm_mp_2

# Create sub-DataFrames for each relation
contradictions = df2[df2['gold_label']=='contradiction']
entailments = df2[df2['gold_label']=='entailment']
neutrals = df2[df2['gold_label']=='neutral']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [74]:
df2_ex = df2.head(5)
df2_ex.style.set_properties(subset=['Premise', 'Hypothesis'], **{'width': '300px'})

Unnamed: 0,Premise,Hypothesis,gold_label,AVERAGE,UNI_LSTM,BI_LSTM,BI_LSTM_MP
0,the as sing joyous church the to a sings they songs from choir at book church. masses This,cracks church the ceiling. has The in,neutral,contradiction,contradiction,neutral,contradiction
1,from the book masses songs sings a sing church. as choir at they church This the joyous to,church song. filled is with The,entailment,entailment,entailment,entailment,entailment
2,masses they church songs church. joyous book to as at choir sings sing from This a the the,game. A at a baseball singing choir,contradiction,contradiction,contradiction,entailment,contradiction
3,"A big and shirt a woman headscarf, a with blue very green grin.",is The young. woman,neutral,entailment,entailment,neutral,neutral
4,"green big with A woman headscarf, a grin. shirt a blue very and",woman is happy. very The,entailment,neutral,entailment,neutral,neutral


In [45]:
print('Total number of examples scrambled in test set: ' + str(len(df2)) +'\n')

# Compute classification accuracy for each relation, for each model
models = ['AVERAGE', 'UNI_LSTM', 'BI_LSTM', 'BI_LSTM_MP']
categories = [contradictions, entailments, neutrals]
for model in models:
    print('Overall classification accuracy using '+model+' encoder: '+str(calc_accuracy(model,df2)) +'%')
    for category in categories:
        print('Accuracy using ' + 
              str(model) + ' encoder on predicting relation: ' + 
              category.gold_label.iloc[0] + 
              ': ' + str(calc_accuracy(model, category))
              + '%')
    print('\n')

Total number of examples scrambled in test set: 9824

Overall classification accuracy using AVERAGE encoder: 72.6%
Accuracy using AVERAGE encoder on predicting relation: contradiction: 70.37%
Accuracy using AVERAGE encoder on predicting relation: entailment: 78.95%
Accuracy using AVERAGE encoder on predicting relation: neutral: 68.19%


Overall classification accuracy using UNI_LSTM encoder: 72.78%
Accuracy using UNI_LSTM encoder on predicting relation: contradiction: 74.08%
Accuracy using UNI_LSTM encoder on predicting relation: entailment: 71.29%
Accuracy using UNI_LSTM encoder on predicting relation: neutral: 73.04%


Overall classification accuracy using BI_LSTM encoder: 71.06%
Accuracy using BI_LSTM encoder on predicting relation: contradiction: 71.42%
Accuracy using BI_LSTM encoder on predicting relation: entailment: 74.17%
Accuracy using BI_LSTM encoder on predicting relation: neutral: 67.44%


Overall classification accuracy using BI_LSTM_MP encoder: 78.65%
Accuracy using BI_LS

# Mini Perturbation Study: Results

As expected, shuffling the words of sentences does not affect the performance of the average embedding encoder; it performs exactly the same when used to predict relations for the unaltered sentences. Furthermore, we can see a stark drop in the performance of both the uni-LSTM and bi-LSTMs; their accuracies become comparable to that of the average embedding encoder. 

Arguably, this suggests that as expected, the uni-LSTM and bi-LSTM are largely able to outperform average embedding models on unaltered sentences by virtue of them being able to account for word order. However, one might verify this theory by also training these encoders on shuffled sentences, and then seeing whether the equalities of performance still hold. It also makes sense that they 

Surprisingly however, is the performance of the BI_LSTM_MP encoder; despite not being able to rely on word order given the shuffled sentences, it still outperforms the other models at (almost) all levels (though interestingly, the average word embedding encoder performs the highest for predicting contradiction here). This seems to suggest that taking the max-pool over word-level features allows the model to construct a slightly more meaningful sentence representation, when compared to just using the last hidden state(s) - even in the absence of word order. 

We can also observe that compared to the previous results (for unaltered sentences), there no longer seem to be performance patterns which hold for all models (i.e. models consistently doing better on 'entailment', and worse on 'neutral', as was discussed previously).