# A6: Natural Language Inference using Neural Networks

by Adam Ek, Bill Noble, and others.

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.


In this lab we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

## 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/), but unfortunately that dataset is down at the moment. Instead, we will use the version uploaded to [HuggingFace](https://huggingface.co/datasets/stanfordnlp/snli) available through the `datasets` library. See the [documentation](https://huggingface.co/docs/datasets/v2.19.0/loading#hugging-face-hub) for loading a dataset from the HuggingFace hub.

The (simplified) data is organized as follows:

* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll need to build a dataloader. You can adapt your code from the previous lab to the new dataset. **[1 mark]**

In [45]:
#   LOADING OUR DATA - SNLI DATASET &
#   DOING OUR IMPORTS:
from datasets import load_dataset,load_dataset_builder
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import torch.nn as nn
from transformers import BertTokenizer
import numpy as np

dataset = load_dataset('stanfordnlp/snli')
ds_builder = load_dataset_builder('stanfordnlp/snli')
ex = dataset['train'][0]
#print(ex)
print(ds_builder.info)

DatasetInfo(description='', citation='', homepage='', license='', features={'premise': Value(dtype='string', id=None), 'hypothesis': Value(dtype='string', id=None), 'label': ClassLabel(names=['entailment', 'neutral', 'contradiction'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='snli', config_name='plain_text', version=0.0.0, splits={'test': SplitInfo(name='test', num_bytes=1260154, num_examples=10000, shard_lengths=None, dataset_name='snli'), 'validation': SplitInfo(name='validation', num_bytes=1264286, num_examples=10000, shard_lengths=None, dataset_name='snli'), 'train': SplitInfo(name='train', num_bytes=65953155, num_examples=550152, shard_lengths=None, dataset_name='snli')}, download_checksums={'hf://datasets/stanfordnlp/snli@cdb5c3d5eed6ead6e5a341c8e56e669bb666725b/plain_text/test-00000-of-00001.parquet': {'num_bytes': 411531, 'checksum': None}, 'hf://datasets/stanfordnlp/snli@cdb5c3d5eed6ead6e5a341c8e56e669bb666

In [3]:
# CREATING OUR DATASET FOR THE DATALOADER (will be easier since we are using a built-in dataset from HuggingFace):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')   
dataset = dataset.filter(lambda example: example["label"] != -1)
dataset.set_format(type='torch', columns=['premise', 'hypothesis', 'label'])
# our datasets:
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']


def collate_function(batch):
    """Merges a list of samples to form a mini-batch.
    This function will be used as an argument for the DataLoader below."""
    batch_premise = [example['premise'] for example in batch]   # a list of batch size premises.
    batch_hypothesis = [example['hypothesis'] for example in batch] # a list of batch size hypotheses.
    batch_label = [example['label'] for example in batch] # a list of batch size labels (as tensors).

    return {'premise': batch_premise, 'hypothesis': batch_hypothesis, 'label': batch_label}


Filter:   0%|          | 0/9824 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9842 [00:00<?, ? examples/s]

Filter:   0%|          | 0/549367 [00:00<?, ? examples/s]

Notice that the dataset comes as a dictionary-like object with three splits: `'test'`, `'train'`, and `'validation'`. Each item is a dictionary containing a `'premise'`, `'hypothesis'`, and `'label'`.

In [4]:
# smaller datasets for debugging purposes
train_dataset = train_dataset.filter(lambda e, i: i<1000, with_indices=True)
validation_dataset = validation_dataset.filter(lambda e, i: i<1000, with_indices=True)
test_dataset = test_dataset.filter(lambda e, i: i<1000, with_indices=True)

Filter:   0%|          | 0/549367 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9842 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9824 [00:00<?, ? examples/s]

In [27]:
# DATALOADERS FOR OUR TRAIN, VALIDATION AND TEST DATA:
dataloader_train = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_function)
dataloader_validation = DataLoader(validation_dataset, batch_size=32, shuffle=True, collate_fn=collate_function)
dataloader_test = DataLoader(test_dataset, batch_size=32, shuffle=True, collate_fn=collate_function)

for batch in dataloader_train:
    print(batch)
    break  

{'premise': ['A boy is jumping on skateboard in the middle of a red bridge.', 'A woman and a child holding on to the railing while on trolley.', 'A white and brown dog is leaping through the air.', 'A man in a gold skirt sits in front of the computer.', 'A car sinking in water.', 'Two dogs biting another dog in a field.', 'A young woman packs belongings into a black suitcase.', 'Two young girls are playing outside in a non-urban environment.', 'A woman wearing a green and pink dress is dancing with someone wearing a blue top with white pants.', 'There is a woman holding a baby, along with a man with a save the children bag.', "A model posing to look as if she's a real female soccer player.", 'Two people dancing, wearing dance costumes.', 'Four guys in wheelchairs on a basketball court two are trying to grab a basketball in midair.', 'A man holds a clipboard and a pen as a woman looks at them.', "Woman in white in foreground and a man slightly behind walking with a sign for John's Pizza

## 2. Tokenization

This data does not come pre-tokenized. Instead of training our own tokenizer, we can use the BERT tokenizer like in the preivous assignment. Even though we aren't using BERT the tokenizer works with any model. See the documentation on [using a pretrained tokenizer](https://huggingface.co/docs/tokenizers/en/quicktour#using-a-pretrained-tokenizer). **[1 mark]**

In [36]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_word_to_idx(data_loader):
    """Goes through a dataloader's batches, transforms tokens into indices, creates a tensor out of the 
    premise and hypothesis. Labels are already tensors so they simply get appended into tokenized_batch as they are."""
    tokenized_batch = []  # tokenized batches get appended here
    for batch in data_loader:
        tokenized_premises = tokenizer(batch['premise'],padding='longest',truncation=True,return_tensors="pt")
        tokenized_hypotheses = tokenizer(batch['hypothesis'],padding='longest',truncation=True,return_tensors="pt")
        # Appending tokenized premise, hypothesis, and label to the tokenized batch
        tokenized_batch.append({
            'premises_input_ids': tokenized_premises['input_ids'],
            'premises_attention_mask': tokenized_premises['attention_mask'],
            #'premises_token_type_ids': tokenized_premises['token_type_ids'],
            'hypotheses_input_ids': tokenized_hypotheses['input_ids'],
            'hypotheses_attention_mask': tokenized_hypotheses['attention_mask'],
            #'hypotheses_token_type_ids': tokenized_hypotheses['token_type_ids'],
            'label': batch['label']
            })

    #print(f"This is the [0] from the tokenized batch: {tokenized_batch[0]}")        
    return tokenized_batch

training_data = tokenize_word_to_idx(dataloader_train)
validation_data = tokenize_word_to_idx(dataloader_validation)
testing_data = tokenize_word_to_idx(dataloader_test)





In [64]:
for batch in training_data[:2]:
    print(batch)

{'premises_input_ids': tensor([[  101,  1037,  2158,  1999,  1037,  2630,  3797,  7719,  2648,  2894,
          2007,  1037,  7433,  6277,  4201,  2041,  1999,  2392,  1997,  2032,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1037,  2450,  2007,  1037,  2304,  6598,  7365,  2627,  2019,
          7254,  3185, 13082,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1037,  2158,  2006,  1037,  2395,  1999,  1037,  4408,  1056,
          1011,  3797,  4324,  2070,  4066,  1997, 13855,  2875,  1037,  2450,
          1999,  1037,  5061,  1056,  1011,  3797,  1998, 13178,  1012,   102],
        [  101,  1037,  2158,  1998,  1037,  2450,  2024,  3061,  2279,  2000,
         10801,  1010,  3331,  2096,  2178,  2158,  3504,  2012,  2060, 10801,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0]]), 'premises_attention_mas

## 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform pooling. There is a builtin function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}

This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the representation at each token position. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [14]:
import torch

def max_pooling(input_tensor):
  output_tensor,_ = torch.max(input_tensor, dim=1)
  return output_tensor

test_unpooled = torch.rand(32, 100, 512)
test_pooled = max_pooling(test_unpooled)
#print(test_pooled.size()) # should be torch.Size([32, 512])
#for batch in training_data[:5]:
#  print(batch['premises_input_ids'])
#  print(max_pooling(batch['premises_input_ids']))

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [15]:
def combine_premise_and_hypothesis(hypothesis, premise):
    output = torch.cat((premise, hypothesis, torch.abs(premise - hypothesis), premise * hypothesis), dim=1)
    return output

test_hypothesis = test_pooled.clone()
test_premise = test_pooled.clone()
test_combined = combine_premise_and_hypothesis(test_hypothesis, test_premise)
print(test_combined.size()) # should be torch.Size([32, 2048])

torch.Size([32, 2048])


<div class="comment"> `max_pooling` and `combine_premise_and_hypothesis` will used as part of the model! They are expected to get an input *after* running the premise and hypothesis through the embedding and RNN. Doing this adds a new dimension -- the dimension of the sentence representation with the `hidden_size` of the model. When you run max_pooling on the raw input tokens it reduces the batch dimension instead since there is no hidden dimension.
</div>

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**8 marks**]

In [49]:
class SNLIModel(nn.Module):
    def __init__(self, vocab_size, hidden_size, embedding_dim, labels_size,dropout_rate=0.5):
        super(SNLIModel,self).__init__()
        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, num_layers=1, bidirectional=False, batch_first=True)
        self.dropout = nn.Dropout(p=dropout_rate)
        self.classifier = nn.Linear(in_features=hidden_size*4, out_features=labels_size)
        
    def forward(self, premise, hypothesis):
        p = self.embeddings(premise)
        h = self.embeddings(hypothesis)
        
        p_out,_= self.rnn(p)
        h_out,_ = self.rnn(h)

        p_pooled = max_pooling(p_out)
        h_pooled = max_pooling(h_out)

        p_pooled = self.dropout(p_pooled)
        h_pooled = self.dropout(h_pooled)
        
        ph_representation = combine_premise_and_hypothesis(h_pooled,p_pooled)
        ph_representation = self.dropout(ph_representation)
        predictions = self.classifier(ph_representation)
        
        return predictions

In [50]:
model = SNLIModel(tokenizer.vocab_size,512,512,3)


## 3. Training

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[10 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [54]:
from sklearn.metrics import classification_report
import torch.optim as optim
from torch.nn import CrossEntropyLoss

epochs = 8

#We were unable to access the server these last couple of days so we used only a small portion of the dataset and didn't get a chance to test the whole dataset.

loss_function = CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(),lr=0.003)
model = model

for epoch in range(epochs):
    total_loss = 0
    for batch in training_data:
        optimizer.zero_grad()
        premises = batch['premises_input_ids']
        hypothesis = batch['hypotheses_input_ids']
        labels = torch.tensor(batch['label'])     
        #print(labels)
        outputs = model(premises,hypothesis)
        #print(outputs)
        loss = loss_function(outputs,labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch + 1}, Loss : {total_loss/len(training_data)}')    
# test model after all epochs are completed
def evaluate_model(model, data):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in data:
            premises = batch['premises_input_ids']
            hypothesis = batch['hypotheses_input_ids']
            labels = torch.tensor(batch['label']) 
            outputs = model(premises, hypothesis)
            _, preds = torch.max(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())  
    return all_preds, all_labels

#evaluate the model on the test set
test_preds, test_labels = evaluate_model(model, testing_data)

#compute precision, recall, and F1 score
print(classification_report(test_labels, test_preds, target_names=['entailment', 'contradiction', 'neutral']))


        


Epoch 1, Loss : 0.04223360407377186
Epoch 2, Loss : 0.04066039649023878
Epoch 3, Loss : 0.033160697478706425
Epoch 4, Loss : 0.013330697315950601
Epoch 5, Loss : 0.002337848700278755
Epoch 6, Loss : 0.0009390919954057608
Epoch 7, Loss : 0.0004217031505504565
Epoch 8, Loss : 0.0002617694609625687
               precision    recall  f1-score   support

   entailment       0.54      0.65      0.59       344
contradiction       0.55      0.41      0.47       327
      neutral       0.50      0.52      0.51       329

     accuracy                           0.53      1000
    macro avg       0.53      0.53      0.52      1000
 weighted avg       0.53      0.53      0.52      1000



In [81]:
print(max_prediction)
print(labels)
print(torch.sum(max_prediction == labels).item())

tensor([2, 1, 0, 0])
tensor([2, 1, 0, 0])
4


In [83]:
print(len(testing_data)*32)

1000


## 4. Testing

Test the model on the testset. For each example in the test set, compute a prediction from the model (`entailment`, `contradiction` or `neutral`). Compute precision, recall, and F1 score for each label. **[10 marks]**

Suggest a _baseline_ that we can compare our model against **[2 marks]**

In [48]:
num_classes = 3  # Assuming three classes: entailment, contradiction, neutral
num_samples = 1000  # Number of samples in the test set

# Generate random predictions
random_preds = np.random.randint(num_classes, size=num_samples)

# Collect the true labels from the test set
true_labels = []
for batch in testing_data:
    labels = torch.tensor(batch['label'])
    true_labels.extend(labels.cpu().numpy())

# Compute precision, recall, and F1 score for the random baseline
print("Random Baseline Classification Report:")
print(classification_report(true_labels, random_preds, target_names=['entailment', 'contradiction', 'neutral']))

Random Baseline Classification Report:
               precision    recall  f1-score   support

   entailment       0.37      0.34      0.35       344
contradiction       0.33      0.35      0.34       327
      neutral       0.33      0.33      0.33       329

     accuracy                           0.34      1000
    macro avg       0.34      0.34      0.34      1000
 weighted avg       0.34      0.34      0.34      1000



In [None]:
#Due to the pretty strict label balance of the corpus, we opted for random labeling baseline instead of most common baseline. Based on the results, the baseline would be 35%
#and our model is performing 20% better.

**Your answer should go here**

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[3 marks]**.

In [None]:
#A qualitative analysis of some sort (referring to a manual examination of the labels in the dataset) might also give us insight into the way
#the model is working and doing its predictions. 
#We can also compare our model's perfomance to other models used for similar NLI-tasks.

**Your answer should go here**

Suggest some ways to improve the model **[3 marks]**.

In [None]:
#We can make the LSTM bidirectional instead of unidirectional, increase the hidden size and add another layer.

**Your answer should go here**

## Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

In [None]:
#We met 3 times for 3-4 hours each time and everyone was present and contributed equally.

## Marks

This assignment has a total of 23 marks.