# Learning sentence representations from natural language iference data (SNLI)

## Results of all models on both SNLI and SentEval

In [1]:
checkpoint = "best_model_dir_1234/baseline_best.pth"
encoder = "baseline"

In [3]:
!python eval.py $checkpoint --senteval-vocab --encoder $encoder --snli --senteval

04/22 10:08:06 PM: Printing arguments : Namespace(checkpoint='best_model_dir_1234/baseline_best.pth', senteval_vocab=True, encoder='baseline', snli=True, senteval=True, seed=1111, num_workers=4, batch_size=64, sent_eval_path='./SentEval/data', tokenize=False)
04/22 10:08:06 PM: Setting seed...
04/22 10:08:06 PM: Vocab already exists. Loading from disk...
04/22 10:08:15 PM: Loading the model checkpoint from best_model_dir_1234/baseline_best.pth
04/22 10:08:15 PM: Loading the model checkpoint trained in SNLI dataset
04/22 10:08:15 PM: Evaluating the model on SNLI dataset
04/22 10:08:16 PM: Batch: 1/154, Loss: 0.7540, Acc: 64.0625
04/22 10:08:16 PM: Batch: 11/154, Loss: 0.7281, Acc: 68.7500
04/22 10:08:16 PM: Batch: 21/154, Loss: 0.8359, Acc: 56.2500
04/22 10:08:16 PM: Batch: 31/154, Loss: 0.6027, Acc: 78.1250
04/22 10:08:16 PM: Batch: 41/154, Loss: 0.7169, Acc: 73.4375
04/22 10:08:16 PM: Batch: 51/154, Loss: 0.8181, Acc: 60.9375
04/22 10:08:16 PM: Batch: 61/154, Loss: 0.8406, Acc: 56.250

In [None]:
checkpoint = "best_model_dir_1234/unilstm_best.pth"
encoder = "unilstm"

In [None]:
!python eval.py $checkpoint --senteval-vocab --encoder $encoder --snli --senteval

In [None]:
checkpoint = "best_model_dir_1234/bilstm_best.pth"
encoder = "bilstm"

In [None]:
!python eval.py $checkpoint --senteval-vocab --encoder $encoder --snli --senteval

In [None]:
checkpoint = "best_model_dir_1234/bilstm_best.pth"
encoder = "bilstm"

In [None]:
!python eval.py $checkpoint --senteval-vocab --encoder $encoder --snli --senteval

In [None]:
checkpoint = "best_model_dir_1234/bilstm-max_best.pth"
encoder = "bilstm-max"

In [None]:
!python eval.py $checkpoint --senteval-vocab --encoder $encoder --snli --senteval

#### Performance of sentence encoder architectures on SNLI and (aggregated) SentEval tasks (reported with macro and micro accuracies)

The table demonstartes the results of the models on the SNLI dev/validation and test sets, and the micro and macro averaged results on the SentEval tasks (computed from 'devacc').

| **Model** | **SNLI Dev** | **SNLI Test** | **Micro** | **Macro** |
|---|---|---|---|---|
| Baseline | 66.0536% | 66.1950% | 82.4852% | 80.5863% |
| LSTM | 80.4003% | 79.9165% | 80.2680% | 79.3212% |
| BiLSTM-Concat | 79.9431% |79.7638% | 83.0806% | 82.0075% |
| BiLSTM-Max | 84.2105% | 84.0798% | 84.5913% | 83.4000% |

## Analysis of the Results

### Which model is Better?
### Model performance

Using the BiLSTM encoder with max-pooling resulted in the best-performing model across both SNLI and SentEval tasks, as anticipated. The latter can be credited to the architecture of the model. A bidirectional LSTM can capture information from both preceding and succeeding contexts, enhancing its capability to learn richer contextual sentence representations. The max pooling mechanism enables the model to prioritize salient features in an input sequence, boosting its resilience against sentence length and structure variations. Contrastively, the baseline model solely relies on averaging word embeddings, demonstrating inferior performance compared to BiLSTM and BiLSTM (max). Finally, the unidirectional LSTM performs similarly to the BILSTM-Concat, but its simplicity limits the model in the transfer tasks benchmark (SNLI). The latter can be attributed to the fact that the model learns more simplified contextualized representation due to the single-direction mechanism.
### Failures

Whenever a deeper understanding of the language is required, all models fail (see next section on predicting entailment on tricky premises and hypothesis). For instance, negation, complex sentence structure, and dependencies (e.g., long-range coreference) create model failures. The baseline model will have the most challenging time in such cases, as it relies exclusively on the average word embeddings that lack any information of context and word order. 
The unidirectional LSTM is expected to struggle when capturing dependencies that require considering the context from both directions is essential.
Finally, BiLSTM-Concat and BiLSTM-Max models leverage bi-directionality and are the most robust methods, but they still need improvement. Long-range dependencies and several semantic relationships between premises and hypotheses requiring intricate handling can confuse the model.

### Analysis of sentence embeddings

All models learn sentence embeddings with a fixed dimensionality designed to capture the meaning and structure of a sentence. The baseline model captures the average of word meanings, thereby losing information about word order and context. The LSTM, BiLSTM, and BiLSTM (max) models are better at capturing the sequential character of sentences, the context in which words appear, and some word order. However, even the most complicated BiLSTM models have some information loss. For instance, they struggle to capture the intricacies of specific syntactic or semantic links (as seen in the section below on tricky entailment prediction).

### Predicting Entailment

#### Choosing the BILSTM model and load the checkpoint

In [2]:
checkpoint = "best_model_dir_1234/bilstm-max_best.pth"
encoder = "bilstm-max"

In [3]:
premise_1 = "Two men sitting in the sun"
hypothesis_1 = "Nobody is sitting in the shade"

In [4]:
! python predict.py $checkpoint --encoder $encoder --premise "$premise_1" --hypothesis "$hypothesis_1"

04/22 09:20:35 PM: Printing arguments : Namespace(checkpoint='best_model_dir_1234/bilstm-max_best.pth', encoder='bilstm-max', seed=1234, premise='Two men sitting in the sun', hypothesis='Nobody is sitting in the shade')
04/22 09:20:35 PM: Setting seed...
04/22 09:20:35 PM: Vocab already exists. Loading from disk...
04/22 09:20:43 PM: Loading the model checkpoint from best_model_dir_1234/bilstm-max_best.pth
04/22 09:20:44 PM: Loading the model checkpoint trained in SNLI dataset
04/22 09:20:44 PM: Premise: Two men sitting in the sun
04/22 09:20:44 PM: Hypothesis: Nobody is sitting in the shade
04/22 09:20:44 PM: Entailment prediction: Contradiction


In [5]:
premise_2 = "A man is walking a dog"
hypothesis_2 = "No cat is outside"

In [6]:
! python predict.py $checkpoint --encoder $encoder --premise "$premise_2" --hypothesis "$hypothesis_2"

04/22 09:20:48 PM: Printing arguments : Namespace(checkpoint='best_model_dir_1234/bilstm-max_best.pth', encoder='bilstm-max', seed=1234, premise='A man is walking a dog', hypothesis='No cat is outside')
04/22 09:20:48 PM: Setting seed...
04/22 09:20:48 PM: Vocab already exists. Loading from disk...
04/22 09:20:57 PM: Loading the model checkpoint from best_model_dir_1234/bilstm-max_best.pth
04/22 09:20:57 PM: Loading the model checkpoint trained in SNLI dataset
04/22 09:20:58 PM: Premise: A man is walking a dog
04/22 09:20:58 PM: Hypothesis: No cat is outside
04/22 09:20:58 PM: Entailment prediction: Contradiction


One probable explanation for not capturing the correct entailment label is the presence of negations in the hypotheses, which may cause the model to focus on the premise's opposite meaning. The models may be more sensitive to negation words like "nobody" and "no" in the hypothesis, causing them to believe a contradiction element exists. Furthermore, the model struggles to recognize the connection between various entities in the sentences, such as "men" and "nobody" or "dog" and "cat." This difficulty in capturing semantic relationships between items may cause the model to inaccurately judge the relationship between the premise and the hypothesis.

### Further Research Question
#### How does the size of the sentences (premise ad hypothesis affects the model)

In [4]:
! python eval_dataset_quartiles.py $checkpoint --encoder $encoder

04/22 09:21:52 PM: Printing arguments : Namespace(checkpoint='best_model_dir_1234/bilstm-max_best.pth', encoder='bilstm-max', seed=1234, num_workers=4, batch_size=64, quartile_list=[0.1, 0.9])
04/22 09:21:52 PM: Setting seed...
04/22 09:21:52 PM: Vocab already exists. Loading from disk...
04/22 09:22:01 PM: Loading the model checkpoint from best_model_dir_1234/bilstm-max_best.pth
04/22 09:22:02 PM: Loading the model checkpoint trained in SNLI dataset
Map: 100%|████████████████████████| 9824/9824 [00:00<00:00, 13293.10 examples/s]
Filter: 100%|█████████████████████| 9824/9824 [00:00<00:00, 38436.60 examples/s]
Filter: 100%|█████████████████████| 9824/9824 [00:00<00:00, 38473.92 examples/s]
Filter: 100%|█████████████████████| 9824/9824 [00:00<00:00, 38915.85 examples/s]
04/22 09:22:04 PM: Shortest dataset size: 822
04/22 09:22:04 PM: Middle dataset size: 7897
04/22 09:22:04 PM: Longest dataset size: 1105
04/22 09:22:04 PM: average sentence length (sum of premise + hypothesis) of the data

For this additional research question, we opt to use the BiLSTM-Max encoder as it is the prevailing model for the SNLI and SentEval tasks.
The BiLSTM-Max model tends to perform better on shorter sentences, achieving an accuracy of 84.4282% on the short dataset with an average sentence length of 12.77. The model's performance gradually decreases as the sentence length increases. The accuracy of the medium dataset, with an average sentence length of 22.60, is 84.4118%. For the extended dataset that consists of 39.17 sentence length on average, we get the biggest diminish in accuracy, reaching 81.4480%. Our implementation uses quartiles (from Numpy) to create the dataset with smaller and larger average sentence sizes.

These findings indicate that the BiLSTM (max) encoder is better at encoding shorter sentences than longer ones, which is unsurprising. As sentences grow in length, the model needs help to capture all of the necessary information and word associations. Longer phrases are more likely to cause the model to lose important information due to the inherent limits of LSTMs in dealing with long-range dependencies.
Our results indicate that the model's capacity to handle extended sentences could be improved. Possible alternatives include adopting more complex models, such as Transformer-based model architectures, for a more rich contextualized representation.