# Learning sentence representations from Natural Language Inference (NLI) data

This notebook contains the result analysis for the first ATCS Practical involving sentence representation learning using NLI and evaluating the obtained sentence encoders on Facebook Research's SentEval multi-task evaluation framework.

## Sentence Representation Training 

The different implemented encoders were trained on the Stanford Natural Language Inference (SNLI) Corpus. This section will test the different implemented models on the SNLI dataset.

### Libraries and Seeding

In [1]:
#Import Relevant libraries
import nltk
import torch
import numpy as np
import pandas as pd
from IPython.display import display
import json

#User Libraries
from models import *
from data import *
from evaluation import *


nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lcur1136/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#Seed 
seed = 1233

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)


torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark = False

#We recommend using cuda
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Using device: " + str(device))

Using device: cpu


### Dataset and Vocabulary

In [27]:
dataset = CustomDataset(dataset_name="snli", tokenizer_cls=NLTKTokenizer)
_, _, test_data = dataset.get_data()

embedding_path = "dataset/glove.840B.300d.txt"
vocab_path = "dataset/vocab.pickle"
dataset_vocab_path = "dataset/dataset_vocab.pickle"

dataset_vocab = dataset.get_vocab(splits=["train"], vocab_path=dataset_vocab_path)

vocab, featureVectors = load_embeddings(path=embedding_path, tokenizer_cls=NLTKTokenizer, dataset_vocab=dataset_vocab, vocab_path=vocab_path, use_tqdm=True)

Found cached dataset snli (/home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-26da30925fbad333.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-030550b06704f0a9.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-5f71913ab959d4e9.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-d4ba7e085145c696.arrow
Loading cached processed dataset at /home/lcur1136/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-a34bde53a6f7997e.arrow
Loading cached processed 

Loading saved Vocabulary from dataset/dataset_vocab.pickle
Loading saved Vocabulary from dataset/vocab.pickle


### Training and Validation results

This section will show the training and validation results for the different implemented models

In [4]:
#TODO Show results
test_accuracies = {
    "AWE":0.607,
    "LSTM":0.717,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.805
}

val_accuracies = {
    "AWE":0.606,
    "LSTM":0.718,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.806
}

df = pd.DataFrame({'Encoder': list(test_accuracies.keys()), 
                   'Validation Accuracies': list(val_accuracies.values()),
                   'Testing Accuracies': list(test_accuracies.values())})
df = df[['Encoder', 'Validation Accuracies', 'Testing Accuracies']]
display(df)

Unnamed: 0,Encoder,Validation Accuracies,Testing Accuracies
0,AWE,0.605,0.607
1,LSTM,0.718,0.715
2,BiSLTM,0.728,0.727
3,BiLSTM-Max,0.805,0.803


### Model Evaluation

Running the cells below will evaluate the different models on the SNLI Test split. It is recommended to use a machine with a cuda-enabled GPU.

In [25]:
#Around 

criterion = nn.CrossEntropyLoss()
batch_size = 64


model_paths = {
    "AWE":"models/AWESentenceEncoder_complex_300_0.61_2023-04-19-15-56-22.pt",
    "LSTM":"models/LSTMEncoder_complex_2048_0.72_2023-04-19-17-12-22.pt",
    "BiLSTM":"models/BiLSTMEncoder_complex_4096_0.73_2023-04-19-18-47-14.pt",
    "BiLSTM-Max":"models/BiLSTMEncoder_pooling-max_complex_4096_0.81_2023-04-19-21-49-39.pt"
}

snli_results = {key: None for key in model_paths}

#### AWE

In [28]:
#Around 20 seconds on CPU.
model = "AWE"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

KeyboardInterrupt: 

#### LSTM

In [None]:
#Around 7 minutes on CPU.
model = "LSTM"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

#### BiLSTM

In [None]:
model = "BiLSTM"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

#### BiLSTM with Max pooling

In [None]:
model = "BiLSTM-Max"
snli_results[model] = test_model_snli(model_paths[model], test_data, criterion, device=device)
print("Test Accuracy: " + str(snli_results[model]))

### Model Inference

In this section we will demonstrate how to predict entailment with our model given a hypotheis and premise. The output can have three different values:
- 0: The hypothesis entails the premise.
- 1: The premise and hypothesis neither entail nor contradict each other.
- 2: The hypothesis contradicts the premise.



In [7]:
labels = ["entailment", "contradiction", "neutral"]

model_to_use = "BiLSTM-Max" #Select a model from "AWE", "LSTM", "BiLSTM" or "BiLSTM-Max", you could also add your own path to model_paths
model_path = model_paths[model_to_use]

premise = "Singel library is always so full."
hypothesis = "There are never free seats."

#Load Model
model = torch.load(model_path, map_location=device)

prediction = snli_inference(premise, hypothesis, model, vocab, device).tolist()[0]

print(f"Premise: {premise}\nHypothesis: {hypothesis}\nResult: {labels[prediction]}")

Premise: Singel library is always so full.
Hypothesis: There are never free seats.
Entailment: entailment


### Result Visualisation
(For models tested running the Model Evaluation cells)

In [8]:
df = pd.DataFrame(snli_results.items(), columns=['Encoder', 'Testing Accuracy'])
df['Testing Accuracy'] = df['Testing Accuracy'].round(3)
display(df)

Unnamed: 0,Encoder,Testing Accuracy
0,AWE,0.607
1,LSTM,
2,BiLSTM,
3,BiLSTM-Max,


### Result Analysis

The table above will not show any results if the evaluation was not ran for each notebook, nevertheless, we can refer to the table in the above section "Training and Validation results" which contains the training and validation accuracies for the different models. The classifier used was composed of a single hidden layer MLP with Tanh non-linearities as defined in the paper.  Furthermore, we utilized GloVe embedding as in the original paper, aligning them with all of the splits of the SNLI dataset. In reality, it doesn't make much of a difference using only the training set for alignment as, due to Heap's Law, the amount of new words that we observe in the validation and test set are not many, given the large training set available. 

The results obtained differ from the original paper's, obtaining lower results for each of the models. Nevertheless, we do observe similar trends to the ones present in the reproduced paper. Mainly, it can be observed how the BiLSTM encoder with max pooling outperforms all the others, with the BiLSTM, LSTM and AWE encoders performing worse in that order. The trends show how using more complex encoders allow the encoders to learn better sentence representatons that, in turn, allow the NLI classifier to perform better on this specific task. It is also interesting to observe how averaging word embeddings obtains an accuracy of 0.6 in the task. GloVe embeddings are much more generic, and using trainable encoders allow us to obtain more contextualized embeddings that allow the encoders to better deal with things like word sense ambiguity which, when using solely GloVe embeddings, could be difficult to deal with.

We also expected our bidirectonal LSTM models to outperform the Unidirectional LSTM models. The later stages of the LSTM encoder lose information coming from the start of the sequence, which in our specific task, may negatively impact the final performance of the model. This is why Bidirectional LSTM encoders were expected to perform better, which is shown given the results. Furthermore, it was also expected that Max-Pooling would increase model performance, as by only taking the final hidden states in both directions could shift the focus of the classifier to the start and end of the sequences, whereas by performing Max Pooling, we are basically giving the classifier a sentence representation that combines the most meaningful features for the sequence.

Lets analyse two different examples and try to analyse why our models could fail:

Premise - “Two men sitting in the sun”
Hypothesis - “Nobody is sitting in the shade”
Label - Neutral (likely predicts contradiction)

This example coulb be challenging for a model like AWE as it will be very difficult to capture the negation in the hypothesis. Furthermore the Unidirectional LSTM encoder may also fail in this example as the negation appears in the beggining of the sentence, which adds to the difficulty of correctly interpreting it. In addition, the LSTM encoder will likely focus on the end of the sentences, the last words of these are "sun" and "shade", which could maybe also cause the model to create sentence representations that are contradictive. The other two encoders may also fail in this example if they dont properly capture the negation in the hypothesis.

Premise - “A man is walking a dog”
Hypothesis - “No cat is outside”
Label - Neutral (likely predicts contradiction)

In this second example, AWE and LSTM encoders are likely going to predict a contradiction as the words cat and dog are likely going to be interpreted in the same way, and of course, we could think of cat contradicting dod. The BiLSTM encoders with and without max pooling are likely going to perform better, capturing the context of the entire sentence. The negation in this case does not highly affect the result of the prediction but it could be the case that for the simple Bidirectional LSTM, the meaning of the words "dog" and "cat", are not properly captured in the last hidden states of both encoders.

Finally, the low accuracy can be probably attributed to the implementation of the Data Processing pipeline, as all of the dataloaders have been manually implemented, which looking back, was probably not the best decision to start with but nevertheless allowed me to become a lot more comfortable with having to build my own training pipelines. An alternative would have been to use PyTorch's DataLoader functions and making a custom collate_fn to obtain the data.

## SentEval Multi-Task Evaluation

In this section we will use the SentEval framework to evaluate our sentence encoders on 10 different transfer tasks.

The tasks are the following:

1. **MR (Movie Review)**: Binary sentiment classification task where the goal is to predict whether a given movie review is positive or negative.
2. **CR (Customer Review)**: Binary sentiment classification task where the goal is to predict whether a given customer review is positive or negative.
3. **SUBJ (Subjectivity)**: Binary classification task where the goal is to predict whether a given sentence is subjective or objective.
4. **MPQA (Opinion polarity)**: Binary classification task where the goal is to predict whether a given sentence expresses a positive or negative opinion.
5. **SST2 (Stanford Sentiment Treebank)**: Binary sentiment classification task where the goal is to predict whether a given sentence has a positive or negative sentiment.
6. **TREC (Question classification)**: Multi-class classification task where the goal is to classify a given question into one of six types: "what", "who", "where", "when", "why", and "how".
7. **MRPC (Microsoft Research Paraphrase Corpus)**: Binary classification task where the goal is to predict whether a pair of sentences are semantically equivalent or not.
8. **SICKRelatedness**: Regression task where the goal is to predict the relatedness score between two sentences on a scale of 1 to 5.
9. **SICKEntailment**: Binary classification task where the goal is to predict whether a given pair of sentences entails each other, contradicts each other, or neither.
10. **STS14 (Semantic Textual Similarity)**: Regression task where the goal is to predict the similarity score between two sentences on a scale of 0 to 5.

### Results

The evaluation of the different models can be performed following the instructions found in the [README](README.md) file. We will visualize the saved results for the different models across the evaluation tasks.

#### Result Loading

In [3]:
result_paths = {
    "AWE":"results/AWESentenceEncoder_complex_sentEval_2023-04-20-02-11-18.pt",
    "LSTM":"results/LSTMEncoder_complex_sentEval_2023-04-20-04-26-37.pt",
    "BiLSTM":"results/BiLSTMEncoder_complex_sentEval_2023-04-20-06-36-48.pt",
    "BiLSTM-Max":"results/BiLSTMEncoder_pooling-max_complex_sentEval_2023-04-20-08-50-55.pt"
}

results = {key: None for key in result_paths}

for model in result_paths:
    results[model] = torch.load(result_paths[model], map_location=torch.device(device))

#### Micro and Macro accuracy calculations

In [4]:
def calculate_micro_macro(results):
    sum_dev = 0
    sum_samples = 0
    sum_micro = 0
    n_tasks = 0

    for task in results:
        cont = False
        if "devacc" not in results[task] or "ndev" not in results[task]:
            continue
        else:
            n_tasks+=1
        sum_dev+=results[task]["devacc"]
        sum_samples+=results[task]["ndev"]
        sum_micro+=results[task]["devacc"] * results[task]["ndev"]

    macro = 0
    if n_tasks > 0:
        macro =  (sum_dev / n_tasks)

    micro = (sum_micro / sum_samples)

    return micro, macro


micro_acc = {key: None for key in results}
macro_acc = {key: None for key in results}

for model in results:
    micro_acc[model], macro_acc[model] = calculate_micro_macro(results[model])

df = pd.DataFrame({'Model': list(results.keys()), 
                   'Micro': list(micro_acc.values()),
                   'Macro': list(macro_acc.values())})
df['Micro'] = df['Micro'].round(1)
df['Macro'] = df['Macro'].round(1)
df = df[['Model', 'Micro', 'Macro']]
display(df)

Unnamed: 0,Model,Micro,Macro
0,AWE,77.5,75.8
1,LSTM,71.9,71.1
2,BiLSTM,76.6,75.5
3,BiLSTM-Max,80.5,79.3


Function for Printing Results

In [7]:
def result_to_str(result):
    def numpy_converter(obj):
        """Converts numpy types to native Python types."""
        if isinstance(obj, np.generic):
            return obj.item()
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
    return json.dumps(result, indent=4, default=numpy_converter)

#Example Use
#print(result_to_str(results["AWE"]))

#### Result Analysis

The results for the micro and macro accuracies for the different models are much closer to those from the original paper. Like in the paper, the BiLSTM model with Max pooling performs the best. As expected, our BiLSTM model is the second best but what is surprising is that our AWE encoder comes very close in performance to our BiLSTM encoder. Finally, our LSTM encoder performs the worst out of all of them. One reason for our LSTM encoder to perform so poorly is that the model has overfit to the SNLI task, meaning that the sentence embeddings do not capture general-purpose information about them, leading to poor generalizeability of the embeddings accross tasks that require different semantic information from the embeddings. This could be a valid explanation as for example, our AWE encoder has no learnable weights and thus can't over specialize to the task, making it very generic. Furthermore, we can see that our BiLSTM encoder with Max-Pooling performs very well on average, meaning that it was able to capture general-purpose information in the generated sentence embeddings.

## Further Research Questions

After analysing the results from SentEval. We came up with the hypothesis that our LSTM encoder had over-specialized to the NLI task. These results sparked an interest in seeing how the architecture of the Classifier model used for NLI affects the capability of learning general purpose information accross the different encoders. Up until this moment, the encoders were trained using a classifier with one hidden layer of size 512. We decided to investigate using different classifier architectures, and examining the effect that these had on both SNLI accuracies and SentEval accuracies.

We tested three additional Classifier architectures:

1. **Linear-3 Clasifier**:  This classifier is the same as the original one but with no non-linear activation functions, the motivation behind trying this was to see whether using a much simpler classifier, enforces the encoder to capture more meaningful sentence information.
2. **Linear-2 Classifier**:  This classifier is just composed of two linear layers, reducing the number of learnable parameters for the clasifier and hopefully exacerbating the effect observed in the Linear-3 classifier.
3. **Non-Linear-4 Classifier**: We add an additional linear layer to the original classifier (which can be referred to as Non-Linear-3 classifier). This will allow us to observe the effect of having a more sophisticated classifier, we suppose that having a larger classifier will result in the classifier learning some of the general-purpose information that should be learnt by the encoder, resulting in a less effective encoder when evaluated on different tasks.

For the Non-Linear-4 classifier, training and evaluation was not performed on the simple BiLSTM classifier and AWE classifier due to time constraints. For the AWE classifier, all SentEval results should remain the same as it has no learnable parameters, but we expected the encoders trained with more complex models to perform better on SNLI and worse on SentEval. 

### SNLI Training Results

In [24]:
test_accuracies_non_linear = {
    "AWE":0.607,
    "LSTM":0.717,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.805
}

val_accuracies_non_linear = {
    "AWE":0.606,
    "LSTM":0.718,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.806
}

test_accuracies_linear_3 = {
    "AWE":0.608,
    "LSTM":0.715,
    "BiSLTM":0.730,
    "BiLSTM-Max":0.805
}

val_accuracies_linear_3 = {
    "AWE":0.604,
    "LSTM":0.717,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.806
}

test_accuracies_linear_2 = {
    "AWE":0.607,
    "LSTM":0.715,
    "BiSLTM":0.727,
    "BiLSTM-Max":0.803
}

val_accuracies_linear_2 = {
    "AWE":0.605,
    "LSTM":0.718,
    "BiSLTM":0.728,
    "BiLSTM-Max":0.805
}


test_accuracies_non_linear_4 = {
    "LSTM":0.717,
    "BiLSTM-Max":0.804
}

val_accuracies_non_linear_4 = {
    "LSTM":0.715,
    "BiLSTM-Max":0.807
}

# create dataframes for each architecture
df_non_linear = pd.DataFrame({'Model': list(test_accuracies_non_linear.keys()), 
                              'Non-Linear-3 val': list(val_accuracies_non_linear.values()),
                              'Non-Linear-3 test': list(test_accuracies_non_linear.values()),
                              'Non-Linear-4 val': [val_accuracies_non_linear_4.get(key, "N/A") for key in test_accuracies_non_linear.keys()],
                              'Non-Linear-4 test': [test_accuracies_non_linear_4.get(key, "N/A") for key in test_accuracies_non_linear.keys()]
                             })

df_linear_3 = pd.DataFrame({'Model': list(test_accuracies_linear_3.keys()), 
                            'Linear-3 val': list(val_accuracies_linear_3.values()),
                            'Linear-3 test': list(test_accuracies_linear_3.values())})

df_linear_2 = pd.DataFrame({'Model': list(test_accuracies_linear_2.keys()), 
                            'Linear-2 val': list(val_accuracies_linear_2.values()),
                            'Linear-2 test': list(test_accuracies_linear_2.values())})

# merge the dataframes on the 'Model' column
merged_df = pd.merge(df_non_linear, df_linear_3, on='Model')
merged_df = pd.merge(merged_df, df_linear_2, on='Model')
merged_df = merged_df.round(2)

# reorder columns
merged_df = merged_df[['Model', 'Non-Linear-3 val', 'Non-Linear-3 test', 'Non-Linear-4 val', 'Non-Linear-4 test', 'Linear-3 val', 'Linear-3 test', 'Linear-2 val', 'Linear-2 test']]

# display the dataframe
display(merged_df)


Unnamed: 0,Model,Non-Linear-3 val,Non-Linear-3 test,Non-Linear-4 val,Non-Linear-4 test,Linear-3 val,Linear-3 test,Linear-2 val,Linear-2 test
0,AWE,0.61,0.61,,,0.6,0.61,0.6,0.61
1,LSTM,0.72,0.72,0.715,0.717,0.72,0.72,0.72,0.72
2,BiSLTM,0.73,0.73,,,0.73,0.73,0.73,0.73
3,BiLSTM-Max,0.81,0.8,0.807,0.804,0.81,0.8,0.8,0.8


### SentEval Result Loading

In [23]:
result_paths_non_linear = result_paths

result_paths_linear_3 = {
    "AWE":"results_linear_3/AWESentenceEncoder_sentEval_2023-04-20-01-50-20.pt",
    "LSTM":"results_linear_3/LSTMEncoder_sentEval_2023-04-20-04-10-23.pt",
    "BiLSTM":"results_linear_3/BiLSTMEncoder_sentEval_2023-04-20-06-05-11.pt",
    "BiLSTM-Max":"results_linear_3/BiLSTMEncoder_pooling-max_sentEval_2023-04-20-08-20-35.pt"
}

result_paths_linear_2 = {
    "AWE":"results_linear_2/AWESentenceEncoder_sentEval.pt",
    "LSTM":"results_linear_2/LSTMEncoder_sentEval.pt",
    "BiLSTM":"results_linear_2/BiLSTMEncoder_sentEval.pt",
    "BiLSTM-Max":"results_linear_2/BiLSTMEncoder_pooling-max_sentEval.pt"
}

result_paths_non_linear_4 = {
    "LSTM":"results_nonlinear_4/LSTMEncoder_complex_sentEval_2023-04-20-16-45-53.pt",
    "BiLSTM-Max":"results_nonlinear_4/BiLSTMEncoder_pooling-max_complex_sentEval_2023-04-20-16-47-11.pt"
}



#Must Have the same keys
results_non_linear = {key: None for key in result_paths}
result_linear_3 = {key: None for key in result_paths}
result_linear_2 = {key: None for key in result_paths}
result_non_linear_4 = {key: None for key in result_paths_non_linear_4}

for model in result_paths:
    results_non_linear[model] = torch.load(result_paths_non_linear[model], map_location=torch.device(device))
    result_linear_3[model] = torch.load(result_paths_linear_3[model], map_location=torch.device(device))
    result_linear_2[model] = torch.load(result_paths_linear_2[model], map_location=torch.device(device))
    if model in result_paths_non_linear_4:
        result_non_linear_4[model] = torch.load(result_paths_non_linear_4[model], map_location=torch.device(device))

micro_acc_non_linear = {key: None for key in results_non_linear}
macro_acc_non_linear = {key: None for key in results_non_linear}
micro_acc_linear_3 = {key: None for key in result_linear_3}
macro_acc_linear_3 = {key: None for key in result_linear_3}
micro_acc_linear_2 = {key: None for key in result_linear_2}
macro_acc_linear_2 = {key: None for key in result_linear_2}

micro_acc_non_linear_4 = {key: None for key in result_non_linear_4}
macro_acc_non_linear_4 = {key: None for key in result_non_linear_4}



for model in results_non_linear:
    micro_acc_non_linear[model], macro_acc_non_linear[model] = calculate_micro_macro(results_non_linear[model])
    micro_acc_linear_3[model], macro_acc_linear_3[model] = calculate_micro_macro(result_linear_3[model])
    micro_acc_linear_2[model], macro_acc_linear_2[model] = calculate_micro_macro(result_linear_2[model])

    if model in micro_acc_non_linear_4:
        micro_acc_non_linear_4[model], macro_acc_non_linear_4[model] = calculate_micro_macro(result_non_linear_4[model])



micro_df = pd.DataFrame({'Model': list(results.keys()), 
                         'Non-Linear-3': list(micro_acc.values()),
                         'Non-Linear-4': [micro_acc_non_linear_4.get(model, 'N/A') for model in results.keys()],
                         'Linear-3': list(micro_acc_linear_3.values()),
                         'Linear-2': list(micro_acc_linear_2.values())})
micro_df = micro_df.round(2)
micro_df = micro_df[['Model', 'Non-Linear-3', 'Non-Linear-4', 'Linear-3', 'Linear-2']]


macro_df = pd.DataFrame({'Model': list(results.keys()), 
                         'Non-Linear-3': list(macro_acc.values()),
                         'Non-Linear-4': [macro_acc_non_linear_4.get(model, 'N/A') for model in results.keys()],
                         'Linear-3': list(macro_acc_linear_3.values()),
                         'Linear-2': list(macro_acc_linear_2.values())})
macro_df = macro_df.round(2)
macro_df = macro_df[['Model', 'Non-Linear-3', 'Non-Linear-4', 'Linear-3', 'Linear-2']]



### SentEval Result Visualization

#### Macro Accuracies

In [21]:
display(macro_df)

Unnamed: 0,Model,Non-Linear-3,Non-Linear-4,Linear-3,Linear-2
0,AWE,75.75,,75.75,75.75
1,LSTM,71.06,70.89875,71.06,71.16
2,BiLSTM,75.53,,75.67,75.83
3,BiLSTM-Max,79.32,79.70875,79.48,79.74


#### Micro Accuracies

In [19]:
display(micro_df)

Unnamed: 0,Model,Non-Linear-3,Non-Linear-4,Linear-3,Linear-2
0,AWE,77.45,,77.45,77.45
1,LSTM,71.88,71.55242,71.74,71.74
2,BiLSTM,76.58,,76.65,76.55
3,BiLSTM-Max,80.47,80.614457,80.34,80.56


### Result Analysis

As expected, when using an AWE encoder, the results for SentEval are constant indepentently of the model used as they have no trainable weights. For the LSTM encoder, we see that when using both linear classifiers and a larger non-linear classifier, the performance on SentEval decreases ever so slightly. It is interesting to see that for the BiLSTM using linear layers actually makes the encoders perform better on SentEval. We could reason that this could be attributed to the encoder being more complex than the LSTM encoder, but again, more testing should be performed to reach a meaningful conclusion.

And finally, for our BiLSTM encoder with max pooling we observe that we get a slightly worse performance with our smaller linear classifier, whereas with the rest of the classifiers performance on SentEval improves slightly.

With the limited amount of testing performed and the low variability in the obtained results, we could continue the experiments by using a much larger classifier and concentrating on the Linear-2 model while also analysing the individual accuracies on each separate task. Furthermore, we should perform multiple trainings with different seeds to ensure certainty in our results. Sadly, due to the limited time available for this assignment, further testing could not be performed and thus our analysis resulted in inconclusive results.