# Demo notebook

In this notebook you will be able to run the model on the SNLI dataset and see some shotcomings of different models.

Let's get to it!
First let's import the necessary libaries.

In [1]:
import os
import spacy
import pytorch_lightning as pl
import numpy as np
from easydict import EasyDict as edict

import torch
import torch.nn as nn

import torchtext
from torchtext.data import Field
from torchtext.datasets import SNLI

from train import InferSent
from dataset import create_iterators

Load the data the Field operator that will enables us to preprocess the model.

In [2]:
TEXT, LABEL, train_iter, val_iter, test_iter = create_iterators(batch_size = 4, return_label = True) # feel free to change the batch size to your computer's capabilities 



Load the models

In [5]:
biLSTMPool_PATH = "/Users/blazejmanczak/Desktop/School/Year1/Block5/acts/Practical/saved_models/biLstmPool/biLstmPool824.ckpt"
biLstmPool = InferSent.load_from_checkpoint(biLSTMPool_PATH)

[INFO]: Freezing the embeddings


In [4]:
uniLSTM_PATH ="saved_models/uniLSTM/uniLSTM_8152.ckpt"
uniLstm = InferSent.load_from_checkpoint(uniLSTM_PATH)

[INFO]: Freezing the embeddings


In [4]:
word_embs_PATH = "saved_models/word_embs/word_embs642.ckpt"
#hparams_file = "saved_models/word_embs/hparams.yaml"
word_embs = InferSent.load_from_checkpoint(checkpoint_path = word_embs_PATH)

## Model evaluation 

We evalaute the model on SNLI task as well as the SentEval framework.
By running `python eval.py --checkpoint_path` we can obtain the results on the SentEval dataset.

An example evaluation in SNLI can be found below.
For the sake of time, below please find the table with the results below:

| Model | dim |NLI dev | NLI test | Transfer micro | Transfer macro |
| ---   | --- | ---  | --- | --- | --- |
| AWE |  | 300| 0.64 | 84.12 | 79.78 |
| uniLSTM | 2048 | 0.64| 0.64 | 83.38 | 80.05 |
| bi-LSTM | 4096 | 0.64| 0.64 | 85.94 | 82.63 |
| biLSTM-max | 4096 | 82.56| 82.40 | 86.98 | 83.82 

One can see that the for NLI the test accuracies are very close to the validation results. This might be caused by conservative early stopping with a patience of only 3. Once one looks at the training plots we see that training for a couple more epochs might have proven beneficial for the SNLI dataset.

However, this early stopping seems to benefit the performance on the transfer tasks.


In [None]:
AWE, uni-LSTM, bi-LSTM, bi-LSTM Max

In [4]:
### Model evaluation - SNLI

def evaluate_SNLI(model, iterator):
    
    acc = 0
    count = 0 # of examples
    
    for batch in iterator:
        preds = model(batch)
        labels = batch.label - 1
        acc += (preds.argmax(dim=-1) == labels).float().mean()
        count += batch.premise[1].shape[0] # add the batch size
        
    return acc/count
        

In [None]:
evaluate_SNLI(word_embs, test_iter)



In [15]:
train_iter

<torchtext.data.iterator.BucketIterator at 0x7f835e758d10>

### Model evaluation - SentEval

By running `python eval.py --checkpoint_path your_checkpoint_path` we get the dictionary with model performance for each dataset. To get the micro and macro aggregate scores we can run the following function:

In [18]:
sentEval_awe = {
                    'MR': {'devacc': 63.29, 'acc': 61.43, 'ndev': 74, 'ntest': 74},
                    'CR': {'devacc': 80.18, 'acc': 80.69, 'ndev': 3775, 'ntest': 3775},
                    'MPQA': {'devacc': 87.88, 'acc': 87.76, 'ndev': 10606, 'ntest': 10606},
                    'SUBJ': {'devacc': 99.6, 'acc': 99.6, 'ndev': 5020, 'ntest': 5020},
                    'SST2': {'devacc': 78.67, 'acc': 79.74, 'ndev': 872, 'ntest': 1821},
                    'TREC': {'devacc': 75.09, 'acc': 84.0, 'ndev': 5452, 'ntest': 500},
                    'MRPC': {'devacc': 72.94, 'acc': 72.12, 'f1': 80.84, 'ndev': 4076, 'ntest': 1725},
                    'SICKEntailment': {'devacc': 80.6, 'acc': 78.2, 'ndev': 500, 'ntest': 4927},
                    'SICKRelatedness': {'devpearson': 0.7978672893567289, 'pearson': 0.7992130424967328,
                                        'spearman': 0.7187625218088702, 'mse': 0.36772601686375367,'ndev': 500, 'ntest': 4927}}

sentEval_uni = {
                   'MR': {'devacc': 68.36, 'acc': 64.82, 'ndev': 74, 'ntest': 74},
                   'CR': {'devacc': 79.12, 'acc': 78.54, 'ndev': 3775, 'ntest': 3775},
                   'MPQA': {'devacc': 88.13, 'acc': 88.25, 'ndev': 10606, 'ntest': 10606},
                   'SUBJ': {'devacc': 99.61, 'acc': 99.58, 'ndev': 5020, 'ntest': 5020},
                   'SST2': {'devacc': 78.21, 'acc': 79.3, 'ndev': 872, 'ntest': 1821},
                   'TREC': {'devacc': 71.0, 'acc': 82.4, 'ndev': 5452, 'ntest': 500},
                   'MRPC': {'devacc': 72.96, 'acc': 71.25, 'f1': 79.76, 'ndev': 4076, 'ntest': 1725}, 
                   'SICKEntailment': {'devacc': 83.0, 'acc': 84.49, 'ndev': 500, 'ntest': 4927}, 
                   'SICKRelatedness': {'devpearson': 0.8571529614695601, 'pearson': 0.8623872347038861,
                                       'spearman': 0.798903697075263, 'mse': 0.26324771018968957, 'ndev': 500, 'ntest': 4927}}
sentEval_bi = {
                    'MR': {'devacc': 71.22, 'acc': 75.71, 'ndev': 74, 'ntest': 74},
                    'CR': {'devacc': 79.78, 'acc': 79.87, 'ndev': 3775, 'ntest': 3775},
                    'MPQA': {'devacc': 88.11, 'acc': 87.93, 'ndev': 10606, 'ntest': 10606},
                    'SUBJ': {'devacc': 99.6, 'acc': 99.6, 'ndev': 5020, 'ntest': 5020},
                    'SST2': {'devacc': 79.13, 'acc': 80.12, 'ndev': 872, 'ntest': 1821},
                    'TREC': {'devacc': 83.33, 'acc': 88.4, 'ndev': 5452, 'ntest': 500},
                    'MRPC': {'devacc': 74.44, 'acc': 71.88, 'f1': 80.34, 'ndev': 4076, 'ntest': 1725},
                    'SICKEntailment': {'devacc': 85.4, 'acc': 84.11, 'ndev': 500, 'ntest': 4927},
                    'SICKRelatedness': {'devpearson': 0.8673686615426182, 'pearson': 0.871446993869367,
                                        'spearman': 0.8117278599098882, 'mse': 0.24626234335946046, 'ndev': 500, 'ntest': 4927}}

sentEval_biPool = {
                    'MR': {'devacc': 72.74, 'acc': 70.71, 'ndev': 74, 'ntest': 74},
                     'CR': {'devacc': 82.58, 'acc': 82.3, 'ndev': 3775, 'ntest': 3775},
                     'MPQA': {'devacc': 88.88, 'acc': 89.04, 'ndev': 10606, 'ntest': 10606},
                     'SUBJ': {'devacc': 99.6, 'acc': 99.6, 'ndev': 5020, 'ntest': 5020},
                     'SST2': {'devacc': 80.62, 'acc': 81.05, 'ndev': 872, 'ntest': 1821},
                     'TREC': {'devacc': 84.78, 'acc': 89.2, 'ndev': 5452, 'ntest': 500},
                     'MRPC': {'devacc': 75.19, 'acc': 74.09, 'f1': 81.69, 'ndev': 4076, 'ntest': 1725},
                     'SICKEntailment': {'devacc': 86.2, 'acc': 85.69, 'ndev': 500, 'ntest': 4927},
                     'SICKRelatedness': {'devpearson': 0.8877437481402046, 'pearson': 0.8847378389815521,
                                         'spearman': 0.8247103664125432, 'mse': 0.2215079583066481, 'ndev': 500, 'ntest': 4927}}




In [28]:
def eval_senteval_dic(dic):
    
    datasets = list(dic.keys())
    
    accs = []
    counts = []
    result = {"micro": 0, "macro": 0}
    
    for ds in datasets:
        temp_dic = dic[ds]
        
        try:
            accs.append(temp_dic["devacc"])
            counts.append(temp_dic["ndev"])
            #accs.append(temp_dic["acc"])
            #counts.append(temp_dic["ntest"])
        except: # for metrics that don't have "dev_acc" as a key
            continue
            
    result["macro"] = np.mean(accs)
    result["micro"] = np.average(accs, weights = counts)
    
    return result

print("Baseline of avergiving the GloVe 840 word embeddings with SpaCy tokenization", eval_senteval_dic(sentEval_awe))
print("Uni-directional LSTM Transfer accuracies", eval_senteval_dic(sentEval_uni))
print("Bi-drectional LSTM Transfer accuracies", eval_senteval_dic(sentEval_bi))
print("Bi-directional LSTM with max pooling Transfer accuracies", eval_senteval_dic(sentEval_biPool))
    

Baseline of avergiving the GloVe 840 word embeddings with SpaCy tokenization {'micro': 84.11537777777777, 'macro': 79.78125}
Uni-directional LSTM Transfer accuracies {'micro': 83.37980905349795, 'macro': 80.04875}
Bi-drectional LSTM Transfer accuracies {'micro': 85.93779094650208, 'macro': 82.62625}
Bi-directional LSTM with max pooling Transfer accuracies {'micro': 86.97518288065842, 'macro': 83.82374999999999}


In [30]:
## Of the record: AWE without tokenization
dic = {
    'MRPC': {'ntest': 1725, 'f1': 81.21, 'acc': 72.64, 'devacc': 72.82, 'ndev': 4076},
    'CR': {'ndev': 3775, 'acc': 79.63, 'devacc': 80.29, 'ntest': 3775}, 
    'MPQA': {'ndev': 10606, 'acc': 88.0, 'devacc': 87.82, 'ntest': 10606},
    'SICKEntailment': {'ndev': 500, 'acc': 79.01, 'devacc': 81.0, 'ntest': 4927},
    'SST2': {'ndev': 872, 'acc': 79.85, 'devacc': 79.01, 'ntest': 1821}, 
    'SUBJ': {'ndev': 10000, 'acc': 91.69, 'devacc': 91.77, 'ntest': 10000}, 
    'MR': {'ndev': 10662, 'acc': 78.05, 'devacc': 78.01, 'ntest': 10662}, 
    'TREC': {'ndev': 5452, 'acc': 84.8, 'devacc': 76.1, 'ntest': 500}}

transfer_tasks = ['MR', 'CR', 'MPQA', 'SUBJ', 'SST2', 'TREC',
                      'MRPC', 'SICKEntailment', "SICKRelatedness"]
new_dic = {key:val for key,val in dic.items() if key in transfer_tasks }

print("Bonus: AWE without tokenization: ", eval_senteval_dic(new_dic))

Bonus: AWE without tokenization:  {'micro': 82.82142067344317, 'macro': 80.85249999999999}


Somewhat unexpectedly we see that just averging the GloVe word embeeddings is a very strong baseline, performing similarily to the uni-directional LSTM.
One should not that the difference in accuracies for SNLI is much larger. It shows that these highly parametrized models capture not only general-purpose sentence representation but also utilize some biases and artifiacts of the dataset.

## Running the model on our own example
It would be nice to see the model in action on an arbitrary example. For that we need a small utility function that preprocess the string we supply it.

In [5]:
def process_example(premise:str, hypothesis:str, model):
    """
    Processes one example.
    """
    
    premise = TEXT.process([TEXT.preprocess(premise)])
    hypothesis = TEXT.process([TEXT.preprocess(hypothesis)])
    
    d = edict({"premise": premise, "hypothesis":hypothesis})
    result = model(d).argmax(dim=-1)
    
    return result, LABEL.vocab.itos[1:][result.item()] # [1:] to omit the <unk> token

Now let's try it! Feel free to try your own!

In [None]:
premise = "I like this course."
hypothesis = "Amsterdam is a pretty city of course."

# premise = "Your premise"
# hypothesis = "Your hypothesis"

process_example(premise, hypothesis,
                 model= word_embs)

#process_example(premise, hypothesis, # kernel dies on my computer
#                model= biLstmPool)

## Error analysis: quantative

Different model architectures have an impact on shortcomings of certain models. One interesting analysis is the impact of the length of the hypothesis and premise on the models performance.

Can our model reliably encode the unusally short/long sentences?
Let's find out!

In [36]:
import matplotlib.pyplot as plt
from collections import Counter

In [98]:
lengths_premise = torch.Tensor([])
lengths_hypothesis = torch.Tensor([])

for batch in train_iter:
    lengths_premise = torch.cat((lengths_premise, batch.premise[1]))
    lengths_hypothesis = torch.cat((lengths_hypothesis, batch.hypothesis[1]))

print("Premise mean and std length ", torch.mean(lengths_premise).item(), torch.std(lengths_premise).item())
print("Hypothesis mean and std length ", torch.mean(lengths_hypothesis).item(), torch.std(lengths_hypothesis).item())

print("-"*20)
print("10 and 90 percentile of the premise length", np.percentile(lengths_premise.numpy(), 10), np.percentile(lengths_premise.numpy(), 90))
print("10 and 90 percentile of the hypothesis length", np.percentile(lengths_hypothesis.numpy(), 10), np.percentile(lengths_hypothesis.numpy(), 90))


Premise mean and std length  14.144338607788086 6.079199314117432
Hypothesis mean and std length  8.275193214416504 3.232281446456909
--------------------
10 and 90 percentile of the premise length 8.0 22.0
10 and 90 percentile of the premise length 5.0 12.0


In [103]:
def masked_accuracy(model, test_iter, threshold, length_source = "premise", comparison = "smaller"):
    """Calculates the accuracy for examples with lenght greater/smaller than threshold."""

    acc = 0
    count = 0

    for batch in test_iter:

        if length_source == "premise":
            if comparison == "smaller":
                mask = batch.premise[1] <= threshold
            else:
                mask = batch.premise[1] >= threshold
        else:
            if comparison == "smaller":
                mask = batch.hypothesis[1] <= threshold
            else:
                mask = batch.hypothesis[1] >= threshold

        if any(mask):
            batch = edict({"premise": (batch.premise[0][:, mask], batch.premise[1][mask]),
                          "hypothesis": ( batch.hypothesis[0][:, mask], batch.hypothesis[1][mask] ),
                          "label": batch.label[mask]})

            count += sum(mask)

            preds = model(batch).argmax(dim = -1)
            labels = batch.label - 1

            acc += (preds == labels).float().sum()

        #if count > 1000:
        #    break
        
    #print("Accuracy:", (acc/count *100).item(), "%")
    return (acc/count).item()    

In [104]:
models = [uniLstm]
values = [8,22, 5, 12] # 10th percentile premise
length_source = ["premise", "premise", "hypothesis", "hypothesis"]
comparison = ["smaller", "larger", "smaller", "largerr"]


for model in models:
    print("Running for model", model)
    for threshold,source, comp in zip(values, length_source, comparison):
        print(f"Accuracy for parameters threshold={threshold},source={source}, comparison={comp}")
        print(masked_accuracy(model, test_iter, threshold, source, comp))
    print("\n")
    print("-"*50)
    print("\n")
    

Running for model InferSent(
  (model): LstmEncoders(
    (embedding): Embedding(33672, 300)
    (lstm): LSTM(300, 2048)
  )
  (loss_module): CrossEntropyLoss()
  (dense): Linear(in_features=8192, out_features=512, bias=True)
  (classify): Linear(in_features=512, out_features=3, bias=True)
)
Accuracy for parameters threshold=8,source=premise, comparison smaller
Accuracy: 81.04906463623047 %
tensor(0.8105)
Accuracy for parameters threshold=22,source=premise, comparison larger
Accuracy: 77.4307632446289 %
tensor(0.7743)
Accuracy for parameters threshold=5,source=hypothesis, comparison smaller
Accuracy: 84.81548309326172 %
tensor(0.8482)
Accuracy for parameters threshold=12,source=hypothesis, comparison largerr
Accuracy: 75.4619369506836 %
tensor(0.7546)


In [107]:
models = [word_embs]
values = [8,22, 5, 12] # 10th percentile premise
length_source = ["premise", "premise", "hypothesis", "hypothesis"]
comparison = ["smaller", "larger", "smaller", "largerr"]


for model in models:
    print("Running for model", model)
    for threshold,source, comp in zip(values, length_source, comparison):
        print(f"Accuracy for parameters threshold={threshold},source={source}, comparison={comp}")
        print(masked_accuracy(model, test_iter, threshold, source, comp))
    print("\n")
    print("-"*50)
    print("\n")
    

Running for model InferSent(
  (model): AvgWordEmbeddings(
    (embedding): Embedding(33635, 300)
  )
  (loss_module): CrossEntropyLoss()
  (dense): Linear(in_features=1200, out_features=512, bias=True)
  (classify): Linear(in_features=512, out_features=3, bias=True)
)
Accuracy for parameters threshold=8,source=premise, comparison=smaller




Accuracy: 64.5516128540039 %
tensor(0.6455)
Accuracy for parameters threshold=22,source=premise, comparison=larger
Accuracy: 63.05244445800781 %
tensor(0.6305)
Accuracy for parameters threshold=5,source=hypothesis, comparison=smaller
Accuracy: 65.69873046875 %
tensor(0.6570)
Accuracy for parameters threshold=12,source=hypothesis, comparison=largerr
Accuracy: 61.12343215942383 %
tensor(0.6112)


--------------------------------------------------




For all models we see the general trend: for longer sentences, especially hypothesis, the models perform worse. This due to the fact that long-distance relationships are harder to encode and possibly because the longer sentences can be more convoluted.

As expected, we see that the uni-directional LSTM performs worse than the bi-directional counterparts due to increased capacity of encoding longer sequences.

## Error analysis: qualitative

Authors of InferSent paper hypothesised that the best performing model, bi-directional LSTM with max-pooling performed better because of the model's capability to make sharp choices on which part on which part of the sentence is more important for others.

This can be tested by engineering examples that mostly point to one sentence relation but have a subtle part such as a negation or word sense ambiguity.

In [119]:
a = 2
if a:
    print(2123)

2123


In [20]:
mask[None,:].shape

torch.Size([1, 16])

In [22]:
a[0][:,mask].shape

torch.Size([7, 13])

In [115]:
premise = "A soccer game with multiple males playing"
hypothesis = "Some men are playing sport which is not soccer"

premise = "In due course we will know the exam grades"
hypothesis = "We will have to wait forever for our course grades"

#premise = "Men are playing a funny game involving smiling cats. "
#hypothesis = "Two men are smiling and laughing at the cats playing on the floor."

#premise = "A man inspects the uniform of a figure in some East Asian country."
#hypothesis = "The man is sleeping"


process_example(premise, hypothesis,
                model= uniLstm)

(tensor([1]), 'contradiction')

In [37]:
LABEL.build_vocab(train)

In [41]:
LABEL.vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7ff4ae3a40d0>>,
            {'<unk>': 0, 'entailment': 1, 'contradiction': 2, 'neutral': 3})

In [79]:
LABEL.vocab.itos[1:]

['entailment', 'contradiction', 'neutral']

In [83]:
premise = "A soccer game with multiple males playing"
hypothesis = "Some men are playing sport which is not soccer"

#premise = "A man inspects the uniform of a figure in some East Asian country."
#hypothesis = "The man is sleeping"

#premise = "He play piano"
#hypothesis = "He does not play a piano"

process_example(premise, hypothesis,
                model= uniLstm)

(tensor([0]), 'entailment')

In [84]:
trainer

ModuleAttributeError: 'InferSent' object has no attribute 'test'