# Learning sentence representations from Natural Language Inference (NLI) data

This notebook contains the result analysis for the first ATCS Practical involving sentence representation learning using NLI and evaluating the obtained sentence encoders on Facebook Research's SentEval multi-task evaluation framework.

## Sentence Representation Training 

The different implemented encoders were trained on the Stanford Natural Language Inference (SNLI) Corpus. This section will test the different implemented models on the SNLI dataset.

### Libraries and Seeding

In [1]:
#Import Relevant libraries
import nltk
import torch
import numpy as np
import pandas as pd
from IPython.display import display
import json

#User Libraries
from models import *
from data import *
from evaluation import *


nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lcur1136/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#Seed 
seed = 1234

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)


torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark = False

#We recommend using cuda
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device: " + str(device))

Using device: cpu


### Dataset and Vocabulary

In [3]:
dataset = CustomDataset(dataset_name="snli", tokenizer_cls=NLTKTokenizer)
_, _, test_data = dataset.get_data()

embedding_path = "dataset/glove.840B.300d.txt"
vocab_path = "dataset/vocab.pickle"
dataset_vocab_path = "dataset/dataset_vocab.pickle"

dataset_vocab = dataset.get_vocab(splits=["train"], vocab_path=dataset_vocab_path)

vocab, featureVectors = load_embeddings(path=embedding_path, tokenizer_cls=NLTKTokenizer, dataset_vocab=dataset_vocab, vocab_path=vocab_path, use_tqdm=True)

Found cached dataset snli (/home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-26da30925fbad333.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-030550b06704f0a9.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-5f71913ab959d4e9.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-d4ba7e085145c696.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-a34bde53a6f7997e.arrow
Loading cached processed 

Loading saved Vocabulary from dataset/dataset_vocab.pickle
Loading saved Vocabulary from dataset/vocab.pickle


### Training and Validation results

This section will show the training and validation results for the different implemented models

In [4]:
#TODO Show results
test_accuracies = {
    "AWE":0.607,
    "LSTM":0.715,
    "BiSLTM":0.727,
    "BiLSTM-Max":0.803
}

val_accuracies = {
    "AWE":0.605,
    "LSTM":0.718,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.805
}

df = pd.DataFrame({'Encoder': list(test_accuracies.keys()), 
                   'Validation Accuracies': list(val_accuracies.values()),
                   'Testing Accuracies': list(test_accuracies.values())})
df = df[['Encoder', 'Validation Accuracies', 'Testing Accuracies']]
display(df)

Unnamed: 0,Encoder,Validation Accuracies,Testing Accuracies
0,AWE,0.605,0.607
1,LSTM,0.718,0.715
2,BiSLTM,0.728,0.727
3,BiLSTM-Max,0.805,0.803


### Model Evaluation

Running the cells below will evaluate the different models on the SNLI Test split. It is recommended to use a machine with a cuda-enabled GPU.

In [None]:
#Around 

criterion = nn.CrossEntropyLoss()
batch_size = 64


model_paths = {
    "AWE":"models/AWESentenceEncoder_300_0.60_2023-04-15-16-47-23.pt",
    "LSTM":"models/LSTMEncoder_2048_0.72_2023-04-15-17-14-48.pt",
    "BiLSTM":"models/BiLSTMEncoder_4096_0.73_2023-04-15-18-23-21.pt",
    "BiLSTM-Max":"models/BiLSTMEncoder_pooling-max_4096_0.80_2023-04-15-19-55-20.pt"
}

snli_results = {key: None for key in model_paths}

#### AWE

In [None]:
#Around 20 seconds on CPU.
model = "AWE"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

#### LSTM

In [None]:
#Around 7 minutes on CPU.
model = "LSTM"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

#### BiLSTM

In [None]:
model = "BiLSTM"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

#### BiLSTM with Max pooling

In [None]:
model = "BiLSTM-Max"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

### Model Inference

In this section we will demonstrate how to predict entailment with our model given a hypotheis and premise. The output can have three different values:
- 0: The hypothesis entails the premise.
- 1: The premise and hypothesis neither entail nor contradict each other.
- 2: The hypothesis contradicts the premise.



In [12]:
model_paths = {
    "AWE":"models/AWESentenceEncoder_300_0.60_2023-04-15-16-47-23.pt",
    "LSTM":"models/LSTMEncoder_2048_0.72_2023-04-15-17-14-48.pt",
    "BiLSTM":"models/BiLSTMEncoder_4096_0.73_2023-04-15-18-23-21.pt",
    "BiLSTM-Max":"models/BiLSTMEncoder_pooling-max_4096_0.80_2023-04-15-19-55-20.pt"
}

labels = ["entailment", "contradiction", "neutral"]

model_to_use = "BiLSTM-Max" #Select a model from "AWE", "LSTM", "BiLSTM" or "BiLSTM-Max", you could also add your own path to model_paths
model_path = model_paths[model_to_use]

premise = "Singel library is always so full."
hypothesis = "There are never free seats."

#Load Model
model = torch.load(model_path, map_location=device)

prediction = snli_inference(premise, hypothesis, model, vocab, device).tolist()[0]

print(f"Premise: {premise}\nHypothesis: {hypothesis}\nEntailment: {labels[prediction]}")

Premise: Singel library is always so full.
Hypothesis: There are never free seats.
Entailment: 0


### Result Visualisation

In [None]:
df = pd.DataFrame(snli_results.items(), columns=['Encoder', 'Testing Accuracy'])
df['Testing Accuracy'] = df['Testing Accuracy'].round(3)
display(df)

### Result Analysis

## SentEval Multi-Task Evaluation

In this section we will use the SentEval framework to evaluate our sentence encoders on 10 different transfer tasks.

The tasks are the following:

1. **MR (Movie Review)**: Binary sentiment classification task where the goal is to predict whether a given movie review is positive or negative.
2. **CR (Customer Review)**: Binary sentiment classification task where the goal is to predict whether a given customer review is positive or negative.
3. **SUBJ (Subjectivity)**: Binary classification task where the goal is to predict whether a given sentence is subjective or objective.
4. **MPQA (Opinion polarity)**: Binary classification task where the goal is to predict whether a given sentence expresses a positive or negative opinion.
5. **SST2 (Stanford Sentiment Treebank)**: Binary sentiment classification task where the goal is to predict whether a given sentence has a positive or negative sentiment.
6. **TREC (Question classification)**: Multi-class classification task where the goal is to classify a given question into one of six types: "what", "who", "where", "when", "why", and "how".
7. **MRPC (Microsoft Research Paraphrase Corpus)**: Binary classification task where the goal is to predict whether a pair of sentences are semantically equivalent or not.
8. **SICKRelatedness**: Regression task where the goal is to predict the relatedness score between two sentences on a scale of 1 to 5.
9. **SICKEntailment**: Binary classification task where the goal is to predict whether a given pair of sentences entails each other, contradicts each other, or neither.
10. **STS14 (Semantic Textual Similarity)**: Regression task where the goal is to predict the similarity score between two sentences on a scale of 0 to 5.

### Results

The evaluation of the different models can be performed following the instructions found in the [README](README.md) file. We will visualize the saved results for the different models across the evaluation tasks.

#### Result Loading

In [5]:
result_paths = {
    "AWE":"results/AWESentenceEncoder_sentEval.pt",
    "LSTM":"results/LSTMEncoder_sentEval.pt",
    "BiLSTM":"results/BiLSTMEncoder_sentEval.pt",
    "BiLSTM-Max":"results/BiLSTMEncoder_pooling-max_sentEval.pt"
}

results = {key: None for key in result_paths}

for model in result_paths:
    results[model] = torch.load(result_paths[model], map_location=torch.device(device))

#### Micro and Macro accuracy calculations

In [None]:
print()

In [14]:
def get_micro_macro(results):
    sum_dev = 0
    sum_samples = 0
    n_tasks = len(results)

    for task in results:
        cont = False
        if "devacc" not in results[task]:
            print(f"No devacc in {task}")
            cont = True
        if "ndev" not in results[task]:
            print(f"No ndev in {task}")
            cont = True
        if cont:
            continue
        sum_dev+=results[task]["devacc"]
        sum_samples+=results[task]["ndev"]

    macro =  (sum_dev / n_tasks) * 100
    micro = (sum_dev / sum_samples) * 100

    return micro, macro

micro_acc = {key: None for key in results}
macro_acc = {key: None for key in results}

for model in results:
    micro_acc[model], macro_acc[model] = get_micro_macro(results[model])

df = pd.DataFrame({'Model': list(results.keys()), 
                   'Micro': list(micro_acc.values()),
                   'Macro': list(macro_acc.values())})
df = df[['Model', 'Micro', 'Macro']]
display(df)



No devacc in SICKRelatedness
No devacc in STS14
No ndev in STS14
No devacc in SICKRelatedness
No devacc in STS14
No ndev in STS14
No devacc in SICKRelatedness
No devacc in STS14
No ndev in STS14
No devacc in SICKRelatedness
No devacc in STS14
No ndev in STS14


Unnamed: 0,Model,Micro,Macro
0,AWE,1.319048,6060.1
1,LSTM,1.239057,5692.6
2,BiLSTM,1.320462,6066.6
3,BiLSTM-Max,1.38859,6379.6


Function for Printing Results

In [9]:
def numpy_converter(obj):
    """Converts numpy types to native Python types."""
    if isinstance(obj, np.generic):
        return obj.item()
    elif isinstance(obj, np.ndarray):
        return obj.tolist()

print(json.dumps(results["AWE"], indent=4, default=numpy_converter))

{
    "MR": {
        "devacc": 71.95,
        "acc": 70.9,
        "ndev": 10662,
        "ntest": 10662
    },
    "CR": {
        "devacc": 76.75,
        "acc": 75.87,
        "ndev": 3775,
        "ntest": 3775
    },
    "SUBJ": {
        "devacc": 87.72,
        "acc": 87.09,
        "ndev": 10000,
        "ntest": 10000
    },
    "MPQA": {
        "devacc": 83.29,
        "acc": 82.99,
        "ndev": 10606,
        "ntest": 10606
    },
    "SST2": {
        "devacc": 74.31,
        "acc": 75.4,
        "ndev": 872,
        "ntest": 1821
    },
    "TREC": {
        "devacc": 62.55,
        "acc": 65.0,
        "ndev": 5452,
        "ntest": 500
    },
    "MRPC": {
        "devacc": 72.84,
        "acc": 71.07,
        "f1": 80.27,
        "ndev": 4076,
        "ntest": 1725
    },
    "SICKRelatedness": {
        "devpearson": 0.721680682218488,
        "pearson": 0.7388506149153962,
        "spearman": 0.6703121379698931,
        "mse": 0.46596709897600425,
        "yhat":

#### Result Analysis

TODO: Write

## Further Research Questions

TODO: Write