# ATCS Practical 1 - Demonstration & Analysis Notebook

This notebook contains a demonstration of the 4 models that have been pretrained on the SNLI task, with 4 different encoder architectures:
- **Baseline**: averaging word embeddings as sentence repr.
- **LSTM**: unidirectional RNN on the word embeddings, last hidden state as sentence repr.
- **BiLSTM**: bidirectional RNN, concatenation of the last hidden states of forward & backward layers as sentence repr.
- **BiLSTM-max**: similar to BiLSTM but we apply max pooling to the word-level hidden states instead of taking the last hidden states

After loading the corresponding models from the saved checkpoints, we can select a model and feed it two sentences; a premise and a hypothesis.
The NLI classifier will then predict the relation between the sentences - either **entailment**, **neutral** or **contradiction**.

Additionally, we summarize the performance of the models as evaluated on the validation and test sets of SNLI, but also using the [SentEval](https://github.com/facebookresearch/SentEval/) toolkit.
Finally, we perform a short error analysis to identify the strengths and weaknesses of each model, in order to find which model is most suitable in each context.

First, we have to define some variables for the following cells to be able to locate the required files.
- `DATA_DIR` is the directory containing the *aligned* glove embeddings, which are loaded dynamically into the encoders.
- `LOGS_DIR` is the directory containing the model checkpoints and the TensorBoard event files.
- `MODEL_ORDER` is the row order used to display the results.

In [1]:
# The path of the data directory (where the ALIGNED glove embeddings are)
DATA_DIR = "./data"

# The path of the tensorboard logs directory
LOGS_DIR = "./lisa_logs"

# The order in which the models should appear in the tables
MODEL_ORDER = ['baseline', 'lstm', 'bilstm', 'bilstm-max']

## Package Imports
Next, we need to import all required libraries and packages. All of these should be installed with the provided `environment.yml` file.

Additionally, we download the English model for the SpaCy tokenizer, which is used internally to prepare the batches.

In [2]:
from collections import defaultdict
from encoders import *
from glove import GloVeEmbeddings
from IPython.display import display
from models import Classifier
from pandas.io.formats.style import Styler
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
from torchtext.data.utils import get_tokenizer
from typing import List, Tuple

import glob, os, re, spacy, torch
import pandas as pd

if not spacy.util.is_package("en_core_web_sm"):
    print("Downloading SpaCy English model (small)")
    spacy.cli.download("en_core_web_sm")

## Model Demonstration
We can use the pretrained models to showcase how they would be used in a production system for inference.

Let's start by loading the GloVe embeddings from the previously specified data directory, and instantiate the tokenizer to be used later on.

In [3]:
glove = GloVeEmbeddings(DATA_DIR)
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

Reading pre-trained GloVe embeddings from disk


Next, we can search in the specified log directory to find all saved checkpoints, which the pretrained models will be loaded from.

In [4]:
CKPTS_GLOB = "*/*/checkpoints/*.ckpt"
CKPTS_PATTERN = r"([^\/]+)\/version_\d+.*\.ckpt"

EMBED_DIM = 300
LSTM_STATE_DIM = 2048

models = {}
for ckpt_name in glob.glob(os.path.join(LOGS_DIR, CKPTS_GLOB)):
    # Extract model name from checkpoint name
    res = re.search(CKPTS_PATTERN, ckpt_name)
    model_name = res.group(1)

    if model_name == "baseline":
        repr_dim = EMBED_DIM
        encoder = BaselineEncoder()
    else:
        repr_dim = LSTM_STATE_DIM

        if model_name == "lstm":
            encoder = LSTMEncoder(EMBED_DIM, LSTM_STATE_DIM)
        elif model_name == "bilstm":
            repr_dim *= 2
            encoder = BiLSTMEncoder(EMBED_DIM, LSTM_STATE_DIM)
        elif model_name == "bilstm-max":
            repr_dim *= 2
            encoder = MaxBiLSTMEncoder(EMBED_DIM, LSTM_STATE_DIM)
        else:
            print(f"Encountered unsupported encoder architecture '{model_name}'")
            continue

    model_args = {"embeddings": glove.vectors, "encoder": encoder}
    model = Classifier.load_from_checkpoint(ckpt_name, **model_args)
    model.load_embeddings(glove.vectors)
    models[model_name] = model

We simply define a function that takes a model name, a premise and a hypothesis as arguments, and returns the predicted relation class.

Inside this function, we can see the preprocessing pipeline that was applied to the dataset before training the NLI classifier.

In [5]:
INT_TO_CLASS = {
    0: "entailment",
    1: "neutral",
    2: "contradiction"
}

@torch.no_grad()
def inference(model_name: str, premise: str, hypothesis: str) -> str:
    if model_name not in models:
        raise Exception(f"Unknown encoder type '{model_name}'!")

    # Load model from dict
    model = models[model_name]

    # Lowercase + tokenize
    premise = tokenizer(premise.lower())
    hypothesis = tokenizer(hypothesis.lower())

    # Convert list of tokens to list of IDs
    premise = [glove.get_id(t) for t in premise]
    hypothesis = [glove.get_id(t) for t in hypothesis]

    # Convert to tensors with an extra dimension (batch_size=1)
    p = torch.IntTensor(premise).unsqueeze(0)
    h = torch.IntTensor(hypothesis).unsqueeze(0)

    # Count length of each sentence
    p_len = torch.LongTensor([len(premise)])
    h_len = torch.LongTensor([len(hypothesis)])

    logits = model(p, h, p_len, h_len)
    category = INT_TO_CLASS[logits.argmax().item()]
    return category

Finally, we have an **interactive** cell, meaning that we can play around with the 3 values.
- `MODEL_NAME` specifies which model to use; thus, it has to be one of the 4 supported architectures.
- `PREMISE` corresponds to the first of the two sentences that will be fed to the model.
- `HYPOTHESIS` corresponds to the second sentence.

In [6]:
# must be one of: 'baseline', 'lstm', 'bilstm', 'bilstm-max'
MODEL_NAME = 'bilstm-max'

PREMISE = 'The dog is eating.'
HYPOTHESIS = 'The dog sleeps.'

inference(MODEL_NAME, PREMISE, HYPOTHESIS)

'contradiction'

As we can see, our BiLSTM-max model correctly inferred that the sentences `The dog is eating.` and `The dog sleeps.` are contradicting.

You can see how the model's prediction will change when you try out different sentences.

We will perform a more extensive analysis focusing on the errors later on.

## Results Overview
Let's recreate the two tables from the original paper, which showcase the performance of our models in both the original and the transfer tasks.

First, we define a few functions that are used to modify the display style of the dataframes we will create later on (purely for aesthetic reasons).

In [7]:
def highlight_max(s: pd.Series) -> List[str]:
    is_max = s == s.max()
    return ['font-weight: bold' if cell else '' for cell in is_max]

def format_df(df: pd.DataFrame) -> Styler:
    dfs = Styler(df)
    dfs = dfs.apply(highlight_max)
    dfs = dfs.format("{:2.2f}")
    dfs = dfs.set_table_styles([
        dict(selector='thead th', props=[('text-align', 'center'), ('vertical-align', 'bottom')]),
        dict(selector='td', props=[('text-align', 'center'), ('padding', '0.5em 1.5em')]),
    ])
    return dfs

### Performance Comparison

Using the specified logs directory, we access the TensorBoard event files and extract the recorded validation/test accuracies for the SNLI dataset.

In [8]:
LOGS_GLOB = "*/*"
LOGS_PATTERN = r"([^\/]+)\/((?:version_\d+)|(?:eval))"

nli_df = pd.DataFrame(columns=['dev', 'test'])
for log_name in glob.glob(os.path.join(LOGS_DIR, LOGS_GLOB)):
    # Extract model & version name from logfile name
    res = re.search(LOGS_PATTERN, log_name)
    model_name = res.group(1)
    is_test = res.group(2) == "eval"

    # Read the TFEvents file
    ea = EventAccumulator(log_name)
    ea.Reload()

    if is_test:
        # Read the test_acc value
        acc = ea.Scalars('test_acc')[0].value
    else:
        # Read all val_acc values and pick the maximum
        acc = max(map(lambda e: e.value, ea.Scalars('val_acc')))

    # Convert accuracy to percentage
    acc *= 100

    col_name = 'test' if is_test else 'dev'
    if model_name not in nli_df.index:
        acc_df = pd.DataFrame.from_dict({col_name: [acc]})
        acc_df.index = [model_name]

        nli_df = pd.concat((nli_df, acc_df))
    else:
        nli_df.at[model_name, col_name] = acc

We create a helper function that calculates the micro and macro accuracy of the validation sets used in the SentEval tasks.

The input of this function is a dataframe corresponding to the SentEval results for a given model, along with its name.

In [9]:
def calculate_accuracy(df: pd.DataFrame, name: str) -> pd.DataFrame:
    # Filter out columns that don't have a validation accuracy
    # This is the case in non-classification tasks, such as SICK-R and STS14
    df = df.loc[:, df.loc['devacc'].notnull()]

    # Extract the validation accuracy for each task
    val_acc = df.loc['devacc']

    # Calculate the weighing factor for micro-accuracy
    n_val = df.loc['ndev']
    weight = n_val / n_val.sum()

    # Calculate the macro and micro accuracy
    macro = val_acc.mean()
    micro = (val_acc * weight).sum()

    # Return metrics as dataframe
    acc_dict = {'micro': [micro], 'macro': [macro]}
    acc_df = pd.DataFrame.from_dict(acc_dict)
    acc_df.index = [name]
    return acc_df

Using the specified logs directory once again, we access the SentEval JSON files and extract the (micro/macro) validation accuracy.

In [10]:
RESULTS_GLOB = "results_*.json"
RESULTS_PATTERN = r"results*_([^\.]+)\.json"

transfer_df = pd.DataFrame()
for results_file in glob.glob(os.path.join(LOGS_DIR, RESULTS_GLOB)):
    # Extract model name from file name
    res = re.search(RESULTS_PATTERN, results_file)
    model_name = res.group(1)

    # Convert json to dataframe
    df = pd.read_json(results_file)
    # Calculate accuracies and create dataframe row
    model_accs = calculate_accuracy(df, model_name)

    # Append row to transfer results dataframe
    transfer_df = pd.concat((transfer_df, model_accs))

Finally, we concatenate the two "subtables" to produce Table 3 from the original paper.

In [11]:
performance_df = pd.concat((nli_df, transfer_df), axis=1, keys=['NLI', 'Transfer'])
performance_df.reindex(MODEL_ORDER)

format_df(performance_df)

Unnamed: 0_level_0,NLI,NLI,Transfer,Transfer
Unnamed: 0_level_1,dev,test,micro,macro
baseline,65.72,65.33,80.7,79.09
lstm,81.45,81.26,78.19,77.56
bilstm,80.87,80.66,81.07,80.54
bilstm-max,84.37,83.85,82.53,82.06


From this table we can see that the BiLSTM architecture with max-pooling is the one that performs the best, both in the SNLI task and on SentEval.

However, it is interesting to notice that even the baseline approach is able to perform close to 80% accuracy in the transfer tasks.
More precisely, it seems to even overcome the LSTM in terms of performance, with a much less resource demanding pre-training.

Finally, the performance gains when applying max-pooling over using the last hidden states are significant.
Intuitively, it seems as if the max-pooling operation is able to "focus" on the important words in the sentence, which improves the representation.

### SentEval Comparison

By iterating over the SentEval results again, we now aggregate all performance information across all models to generate Table 4 from the original paper.

In [12]:
RESULTS_GLOB = "results_*.json"
RESULTS_PATTERN = r"results*_([^\.]+)\.json"

senteval_df = pd.DataFrame()
for results_file in glob.glob(os.path.join(LOGS_DIR, RESULTS_GLOB)):
    # Extract model name from file name
    res = re.search(RESULTS_PATTERN, results_file)
    model_name = res.group(1)

    # Convert json to dataframe
    df = pd.read_json(results_file)

    # Select accuracy for classification tasks (except MRPC)
    df_class = df.loc[['acc'], df.loc['acc'].notnull()].drop('MRPC', axis=1)
    df_class.index = [model_name]

    # Select accuracy and F1 score for the MRPC task
    mrpc_cols = pd.MultiIndex.from_product((['MRPC'], ['acc','f1']))
    mrpc_vals = df.loc[['acc', 'f1'], 'MRPC'].array
    df_mrpc = pd.DataFrame([mrpc_vals], columns=mrpc_cols, index=[model_name])

    # Select pearson value for the SICK-R task
    df_sickr = pd.DataFrame(df.loc[['pearson'], 'SICKRelatedness'])
    df_sickr.index = [model_name]

    # Select pearson value(s) for the STS14 task
    dict_sts14 = df.loc['all', 'STS14']['pearson']
    sts14_cols = pd.MultiIndex.from_product((['STS14'], dict_sts14.keys()))
    df_sts14 = pd.DataFrame([dict_sts14.values()], columns=sts14_cols, index=[model_name])

    # Concat all tasks to one dataframe row
    scores_df = pd.concat((df_class, df_mrpc, df_sickr, df_sts14), axis=1)

    # Append row to to SentEval scores dataframe
    senteval_df = pd.concat((senteval_df, scores_df), axis=0)

TASK_ORDER = [
    'MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC',
    ('MRPC', 'acc'), ('MRPC', 'f1'),
    'SICKRelatedness', 'SICKEntailment',
    ('STS14', 'mean'), ('STS14', 'wmean')
]

TASK_NAMES = TASK_ORDER.copy()
TASK_NAMES[4] = 'SST'
TASK_NAMES[6] = 'MRPC<br>accuracy' ; TASK_NAMES[7] = 'MRPC<br>F1-score'
TASK_NAMES[-4] = 'SICK-R' ; TASK_NAMES[-3] = 'SICK-E'
TASK_NAMES[-2] = 'STS14<br>average<br>pearson' ; TASK_NAMES[-1] = 'STS14<br>weighted<br>pearson'

senteval_df = senteval_df.reindex(MODEL_ORDER).T.reindex(TASK_ORDER).T
senteval_df.columns = TASK_NAMES  # rename to match the paper's order

format_df(senteval_df)

Unnamed: 0,MR,CR,SUBJ,MPQA,SST,TREC,MRPC accuracy,MRPC F1-score,SICK-R,SICK-E,STS14 average pearson,STS14 weighted pearson
baseline,75.11,79.31,90.57,84.77,78.25,81.0,72.52,81.24,0.8,78.73,0.45,0.46
lstm,72.49,76.72,86.61,85.03,76.72,78.6,72.41,81.51,0.86,84.45,0.55,0.56
bilstm,73.02,78.49,89.95,84.98,78.42,86.4,71.13,80.08,0.87,83.84,0.56,0.58
bilstm-max,75.4,81.14,91.45,85.47,79.19,87.2,74.09,81.72,0.88,85.24,0.63,0.65


In this more detailed table, we can verify that the BiLSTM architecture with max-pooling is indeed the superior amongst the ones evaluated.

However, it is still surprising to see the baseline encoder surpass the other two LSTM approaches in terms of performance in a few tasks.
Moreover, it appears like the baseline encoder is in practice able to generalize quite well to the transfer tasks that are dissimilar to SNLI,
while it lacks behind the other models in tasks such as SICK-R and STS14 which are under the category of semantic similarity and NLI.

Overall, the differences in performance between the uni- and the bi-directional LSTM encoders is not that significant, which is puzzling
given the theoretical superiority in processing a piece of text in both left-to-right and right-to-left manners. It is possible that in
the latter case, the captured information may sometimes be inducing more confusion rather than clarifying ambiguities.

## Error Analysis
Finally, we can take a closer look into some more concrete examples that showcase where each model outperforms the rest, and where each model fails.

Before we begin with our analysis, let's define a helper function that will call the inference method for all models.

In [13]:
def query_models(*pairs: Tuple[str, str]):
    indices, rows = [], defaultdict(list)
    for premise, hypothesis in pairs:
        indices += [f'{premise}<br>{hypothesis}']
        for model in MODEL_ORDER:
            rows[model] += [inference(model, premise, hypothesis)]

    df = pd.DataFrame.from_dict(rows)
    df.index = indices

    df = df.style.set_table_styles([
        dict(selector='th, td', props=[('text-align', 'center'), ('padding', '0.5em 1em')]),
    ])

    display(df)

In the following examples, we will be querying the models with three pairs of sentences:
- the first one having a true positive class of **entailment**,
- the second one being a **contradiction**, and
- the last one being a **neutral** pair.

It should be noted that these pairs were not engineered to be fully representative of each domain, as that would require some
closer insight into the actual sentences that comprise them. However, they can be considered possible sentences that could appear
in the corresponding datasets, so it should give us some general idea of how the models perform at each one.

### Movie Reviews (MR)

In [14]:
query_models(
    ("It was entertaining and full of suspense.", "The movie was enjoyable and kept me hooked!"),
    ("Boring, I would not recommend it...", "I definitely suggest you watch it!"),
    ("I really liked the cinematography.", "The actor was handsome.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
It was entertaining and full of suspense. The movie was enjoyable and kept me hooked!,neutral,entailment,contradiction,contradiction
"Boring, I would not recommend it... I definitely suggest you watch it!",entailment,contradiction,contradiction,contradiction
I really liked the cinematography. The actor was handsome.,neutral,neutral,neutral,neutral


In this domain, it appears that all models are able to detect neutrality, and all LSTM architectures detect contradiction.

However, for entailment only the unidirectional LSTM was able to correctly identify the true class.

All cases should be fairly simple for a human annotator, so it is surprising that the encoders fail to capture that information.

According to the ~75% accuracy reported on the table of the last section, these samples confirm that there is room for improvement overall.

### Customer Review (CR)

In [15]:
query_models(
    ("Nice design, but the functionality is limited.", "It is aesthetically pleasing, but I cannot do much with it."),
    ("This is revolutionary!", "I find it completely useless."),
    ("I recommend this product.", "I bought it because I couldn't find any alternatives.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
"Nice design, but the functionality is limited. It is aesthetically pleasing, but I cannot do much with it.",neutral,entailment,neutral,contradiction
This is revolutionary! I find it completely useless.,contradiction,contradiction,contradiction,contradiction
I recommend this product. I bought it because I couldn't find any alternatives.,neutral,neutral,contradiction,contradiction


In these cases we see even less accurate results, which is not fully in accordance with the slightly higher reported performance on the table.

While the contradiction is easy to predict, the entailment is only correctly identified again by the LSTM.

This was a surprise given the fact that the entailment pair is seemingly simple.

On an interesting note, the neutral pair is not detected by the BiLSTM architectures, despite them being more complex in theory.

Overall, so far we have seen a slight bias towards classifying pairs as a contradiction, which could mean that the models were not punished
for doing so during training, possibly due to bias-confirming samples.

### Subjectivity Status (SUBJ)

In [16]:
query_models(
    ("Everyone agrees that he is the best.", "His skills are admirable."),
    ("This is so pretty!", "It hardly draws my attention."),
    ("I love the colors of the rainbow.", "The sky is blue.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
Everyone agrees that he is the best. His skills are admirable.,neutral,entailment,entailment,entailment
This is so pretty! It hardly draws my attention.,entailment,contradiction,contradiction,contradiction
I love the colors of the rainbow. The sky is blue.,contradiction,contradiction,contradiction,neutral


In this domain we see more accurate predictions for entailment and contradiction, which confirms the 90% performance for the LSTM models.

However, for the selected samples we see that the baseline is not able to properly identify either pair, in contrast to its reported accuracy.

It is also interesting to see that the BiLSTM encoder with max-pooling is the only one that properly detects neutrality.
One possible explanation could be that the words used in both sentences were semantically similar (referring to colors),
which made the other architectures to miss the fact that the two sentences were about different entities.

### Opinion Polarity (MPQA)

In [17]:
query_models(
    ("Everyone appraised his good work.", "His efforts were highly appreciated."),
    ("Their actions were harshly criticized.", "They received positive feedback."),
    ("It was a good movie.", "The cinema was spacious.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
Everyone appraised his good work. His efforts were highly appreciated.,neutral,entailment,entailment,entailment
Their actions were harshly criticized. They received positive feedback.,entailment,neutral,contradiction,contradiction
It was a good movie. The cinema was spacious.,contradiction,neutral,neutral,neutral


In this case we see great performance for the BiLSTM approaches, as they are able to identify the class of most pairs.

However, given the simplicity of all sentences used, one would anticipate better performance for all models.

One possible explanation could be that because the sentences could be viewed as semantically related, given that they are
referring to characteristics or audience (dis)approval of a certain entity, the encoders have a hard time distinguishing
between cases where the two contradict each other or are simply pointing to different objects.

### Sentiment Analysis (SST)

In [18]:
query_models(
    ("I am feeling passionate today!", "I have so much motivation to get work done!"),
    ("I like sunbathing.", "I hate the sunlight."),
    ("This is a nice piece of cake.", "The house looks scary.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
I am feeling passionate today! I have so much motivation to get work done!,neutral,neutral,neutral,neutral
I like sunbathing. I hate the sunlight.,neutral,contradiction,contradiction,contradiction
This is a nice piece of cake. The house looks scary.,contradiction,contradiction,contradiction,contradiction


In this domain we see the most unexpected results, with all classifiers failing to identify the entailment pair.

Additionally, they seem heavily biased towards contradiction, as they also misjudge the neutral.

Although the LSTM encoders identify the true contradiction pair, it seems to be indiferrent to the baseline approach.

### Question Classification (TREC)

In [19]:
query_models(
    ("Is this the tallest mountain?", "Is this the highest peak?"),
    ("Is the sky blue?", "Is the sky black?"),
    ("Who won the championship?", "Who is the president of the United States?")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
Is this the tallest mountain? Is this the highest peak?,neutral,entailment,entailment,entailment
Is the sky blue? Is the sky black?,entailment,contradiction,contradiction,contradiction
Who won the championship? Who is the president of the United States?,contradiction,contradiction,neutral,contradiction


In this case, we observe results that mostly confirm the anticipated performance given the detailed SentEval table.

The LSTM architectures are mostly performing well, with some of them failing at the neutral case.

On the contrary, the baseline approach fails to correctly identify any of the given pairs.

Again, we see that some models have a difficult time discerning between similar semantic categories and the true relation of two sentences.

### Paraphrase Detection (MRPC)

In [20]:
query_models(
    ("The city vividly celebrates Christmas.", "There are festive events throughout the city."),
    ("The football stadium was packed.", "Nobody attended the football match."),
    ("The museum was definitely worth visiting.", "The sea was crystal clear.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
The city vividly celebrates Christmas. There are festive events throughout the city.,entailment,entailment,entailment,entailment
The football stadium was packed. Nobody attended the football match.,neutral,contradiction,contradiction,contradiction
The museum was definitely worth visiting. The sea was crystal clear.,contradiction,contradiction,contradiction,contradiction


While entailment and contradiction are generally easy to detect for almost all models, we see that none of them detect the neutral pair.

In general, the models seem to be biased _against_ the latter case, as we only see one prediction for "neutral", which was not even a correct classification.

### Semantic Similarity (SICK-R / STS14)

In [21]:
query_models(
    ("I am solving exercises for revision.", "I am studying for my exam."),
    ("I like sports!", "I am not very athletic."),
    ("Chemistry is my favorite class!", "My portfolio is empty.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
I am solving exercises for revision. I am studying for my exam.,neutral,contradiction,contradiction,neutral
I like sports! I am not very athletic.,neutral,contradiction,contradiction,contradiction
Chemistry is my favorite class! My portfolio is empty.,contradiction,contradiction,contradiction,contradiction


Once again, we see a scenario where contradiction dominates the predictions of all models.

This further demonstrates that the models come to their breaking point when trying to compare sentences that are semantically close,
as the similarity may often overshadow the subtle information that could shift the meaning from entailment to contradiction.

### Natural Language Inference (SICK-E)

In [22]:
query_models(
    ("Two kids are kicking a ball.", "The kids are playing football."),
    ("The dog is eating food.", "The dog is sleeping."),
    ("The weather is nice today!", "I ate pasta for lunch.")
)

Unnamed: 0,baseline,lstm,bilstm,bilstm-max
Two kids are kicking a ball. The kids are playing football.,contradiction,neutral,neutral,neutral
The dog is eating food. The dog is sleeping.,contradiction,contradiction,contradiction,contradiction
The weather is nice today! I ate pasta for lunch.,contradiction,contradiction,contradiction,contradiction


This is very similar to the previous case, only that now the results are totally unexpected as these sentences are the ones that look the most similar
to the actual training task on which these models were trained on. Once again, contradiction is dominant, while the models hesitate to predict the
entailment pair as such, and instead opt for a "neutral" classification.