# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In this lab we'll work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [1]:
import torch
import torch.nn as nn
from torchtext.data import Field, BucketIterator, TabularDataset
device = torch.device('cuda:0')

In [2]:
def dataloader(path_to_snli, batch_size=8):
    
    #Fields: premise sent, hypothesis sent, relation label
    Tokens = Field(tokenize=lambda x:x.split(), lower=True, batch_first=True) #TODO lowercase?
    Labels = Field(batch_first=True)
    
    fields = [('premise',Tokens),('hypothesis',Tokens),('label',Labels)]
    
    #Process from csv files
    train,test = TabularDataset.splits(
            path=path_to_snli, train='simple_snli_1.0_train.csv', test='simple_snli_1.0_test.csv',
            format='csv', fields=fields, skip_header=False, 
            csv_reader_params = {'delimiter':'\t','quotechar':'、'})
    
    #Build vocab
    Labels.build_vocab(train) # nr of classes
    Tokens.build_vocab(train)

    #Batch iterator
    train_iter, test_iter = BucketIterator.splits(
            (train,test), batch_size=batch_size, shuffle=True, device=device,
             sort_within_batch=True, sort_key=lambda x: len(x.premise)+len(x.hypothesis))
    
    return train_iter, test_iter, Tokens.vocab, Labels.vocab


In [3]:
train_iter, test_iter, vocab, labels = dataloader('simple-snli')

In [4]:
[labels.itos[i] for i in range(len(labels))]  # most golds & predictions should be indexed 2~4

['<unk>', '<pad>', 'entailment', 'contradiction', 'neutral', '-']

# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [5]:
def pooling(input_tensor):
    indices = torch.argmax(input_tensor, dim=1)
    output_tensor = torch.gather(input_tensor, dim=1, index=indices.unsqueeze(dim=1)).squeeze(dim=1)
    return output_tensor

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[2 marks]**

In [6]:
def combine_premise_and_hypothesis(premise, hypothesis):
    
    #input: BxD, BxD
    p,h = premise, hypothesis
    elements = [p, h, abs(p-h), p*h] # BxD, BxD, BxD, BxD

    output = torch.cat(elements, dim=1) # Concat vectors in dim1 (skip dim0, ie batchsize)
    
    return output # Bx(D*4)

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [7]:
class SNLIModel(nn.Module):
    def __init__(self, vocab_dim, num_labels, h_dim):
        super(SNLIModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_dim, h_dim)
        self.rnn = nn.LSTM(h_dim, h_dim, batch_first=True, bidirectional=True )
        self.classifier = nn.Linear(8*h_dim, num_labels) 
        self.dropout = nn.Dropout(0.2) 
        
    def forward(self, premise, hypothesis):
       #1)Encode the Hypothesis and the Premise using one shared bidirectional LSTM
        #Embed premise & hypothesis sentences
        p, h = self.embeddings(premise), self.embeddings(hypothesis) #BxNxD
        
        #Encode through BiLSTM
        p,(_, _) = self.rnn(p)# BxNxHD*2
        h,(_, _) = self.rnn(h)
        p, h = self.dropout(p), self.dropout(h)
        
       #2)Perform max over the tokens in the premise and the hypothesis
        # Max pool the embeddings
        # 🙄 we use torch.max here for efficiency reasons cuz the Python way is very slow compared to C++
        p_pooled, h_pooled = pooling(p), pooling(h) #BxNxHD*2 => BxHD*2
       
       #3)Combine the encoded premise and encoded hypothesis into one representation
        ph_representation = combine_premise_and_hypothesis(p_pooled, h_pooled) #BxD => B x HD*2*4
        
       #4)Predict the relationship 
        predictions = self.classifier(ph_representation)
        
        return predictions

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [8]:
train_iter, test_iter, vocab, labels = dataloader('simple-snli', batch_size=512)

In [11]:
epochs,learning_rate = 3, 0.001

snli_model = SNLIModel( len(vocab), len(labels), 384)
snli_model.to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(snli_model.parameters(), lr=learning_rate)

#Training model
print('Training model...')
from statistics import mean
total_loss = [] 
for _ in range(epochs):
    
    total_loss.clear()
    
    for i, batch in enumerate(train_iter):
        
        premise, hypothesis = batch.premise, batch.hypothesis
        label = batch.label  # gold labels of batch
        
        output = snli_model(premise, hypothesis)
        loss = loss_function(output, label.view(-1)) # modelout:BxL, target:B
        total_loss += [loss.item()]

        print(f'Average total loss: {mean(total_loss)}', end='\r')
        
        # compute gradients; # update parameters; # reset gradients
        loss.backward();     optimizer.step();    optimizer.zero_grad()
    
    print()

Training model...
Avg loss: 0.7520143614258877
Avg loss: 0.5964578376814377
Avg loss: 0.5097863134672476


In [15]:
# Test with test_iter
test_loss = []
snli_model.eval()
for i, batch in enumerate(test_iter):
    
    premise, hypothesis = batch.premise, batch.hypothesis
    label = batch.label  # gold labels of batch

    with torch.no_grad(): # dont collect gradients when testing
        output = snli_model(premise, hypothesis)
    batch_loss = loss_function(output, label.view(-1))
    test_loss += [batch_loss.item()]

    print('Average test loss:', mean(test_loss), '  Tested on :', len(test_loss)*len(batch), end='\r')

Average test loss: 0.711665254831314   Tested on : 5440

In [13]:
confusion = {'Golds':[], 'Predicted': []}
results=[]

for i, batch in enumerate(test_iter):
    
    premise, hypothesis = batch.premise, batch.hypothesis
    label = batch.label  # gold labels of batch
    
    with torch.no_grad(): # dont collect gradients when testing
        output = snli_model(premise, hypothesis)
        
    for b in range( len(batch) ):
        
        goldlabel = labels.itos[label[b]]
        prediction = labels.itos[ torch.argmax(output[b]) ]
        
        results.append( int(goldlabel==prediction) )
        confusion['Golds'].append(goldlabel)
        confusion['Predicted'].append(prediction)
        
import pandas as pd
df = pd.DataFrame(confusion, columns=['Golds','Predicted'])
matrix = pd.crosstab(df['Golds'], df['Predicted'], rownames=['Golds'], colnames=['Predicted'])
print( 'Accuracy:', mean(results) )
matrix

Accuracy: 0.7369


Predicted,contradiction,entailment,neutral
Golds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-,43,39,94
contradiction,2630,119,488
entailment,318,2192,858
neutral,464,208,2547


Suggest a _baseline_ that we can compare our model against **[2 marks]**

A simple baseline, but a pretty bad one, would be most common class: whichever appears most, contradictions or entailment or neutral, is considered the most likely class, and the baseline is the frequency of that class. E.g. if 50% of the testdata is labeled neutral, 50% would be our baseline.

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[4 marks]**.

We could for example evaluate model on another dataset, for example change data to use larger sentences. Another way would be to compare performance with the performance of other simpler models, or pretrained models.

We can also compare predictions with human-judged datasets (which would require the our model being trained on the same dataset as the one that the human-judged set is derived from), as well as comparing with other models that were also trained with the same dataset.

Suggest some ways to improve the model **[3 marks]**.

In our model, we max-pooled the premise and hypothesis separately before combining them. Thus some information may be filtered out before the combination. Maybe switching the order of max pooling and combination can let us get the interaction of the two vectors first before extracting/filtering by pooling. Alternatively, we can apply the method mentioned in Talman et al (2019) by feeding the layers both the pooled output from the previous layer as well as the original input, so the information doesn't get too diluted in the deeper layers.

Also, Bowman et al (2015) mentioned the importance of syntactic structures of sentences when determining the relationships between premise and hypothesis.

We think this aligns with how humans judge the premise vs hypothesis: we would picture/imagine what the sentences describe and decide whether one entails or contradict the other, instead of just comparing the texts themselves (it would be like someone who doesn't understand English but reads Latin script studying our dataset and eventually works out a pattern for classification, despite not understanding the sentence meanings).

Therefore, it may be useful to parse the raw sentences into syntax trees first before embedding to help establishing the relations of roles described in a sentence. This is done in the SNLI 1.0 datasets in the sentence_binary_parse and sentence_parse columns. It will be interesting to compare the performance of training with raw sentence vs training with parsed trees.

Another way is to use pre-trained model and tokenizer, eg BERT to for the embeddings because interpreting the inference of premise->hypothesis requires world knowledge especially when determining the hierarchical relationships of things (eg, a rabbit is an animal but not vice versa).


### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.