In [1]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.test_suite import TestSuite
from checklist.expect import Expect
from tqdm import tqdm

In [2]:
import sys
import spacy
import numpy as np
processor = spacy.load('en_core_web_sm')

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

In [3]:
model_path = '../trained_model_snli/'
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)


In [4]:
from transformers import pipeline

In [5]:
import transformers
transformers.__version__

'4.12.5'

In [6]:
pipe = pipeline('text-classification', model=model,
                        tokenizer=tokenizer, device=0)
pipe_all = pipeline('text-classification', model=model,
                        tokenizer=tokenizer, device=0, return_all_scores=True)

In [7]:
from datasets import load_dataset

dev_dataset = load_dataset('snli', split='validation')
dev_df = dev_dataset.to_pandas()
dev_df.head()

Reusing dataset snli (/home/eric/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


Unnamed: 0,premise,hypothesis,label
0,Two women are embracing while holding to go pa...,The sisters are hugging goodbye while holding ...,1
1,Two women are embracing while holding to go pa...,Two woman are holding packages.,0
2,Two women are embracing while holding to go pa...,The men are fighting outside a deli.,2
3,"Two young children in blue jerseys, one with t...",Two kids in numbered jerseys wash their hands.,0
4,"Two young children in blue jerseys, one with t...",Two kids at a ballgame wash their hands.,1


In [8]:
parsed_qs = [(row.premise, row.hypothesis) for _, row in dev_df.iterrows()]

Preprocess all the questions with spacy. This may take sometime.

In [9]:
processed_p = list(tqdm(processor.pipe(dev_df.premise, batch_size=64)))
processed_h = list(tqdm(processor.pipe(dev_df.hypothesis, batch_size=64)))
parsed_qs_spacy = [(p, q) for (p, q) in zip(processed_p, processed_h)]

10000it [00:10, 999.33it/s]
10000it [00:06, 1541.28it/s]


In [10]:
parsed_qs_spacy[0]

(Two women are embracing while holding to go packages.,
 The sisters are hugging goodbye while holding to go packages after just eating lunch.)

# Task and Model: QQP, BERT

For the purpose of this tutorial, we'll use Quora Question Pair as an example, with [a finetuned BERT model hosted by Textattack](https://huggingface.co/textattack/bert-base-uncased-QQP).
**Please note that this is not the model reported in the paper -- we finetuned that model locally.** 
Here, we instead use a model that is available online (loaded through [Huggingface Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)), so that you can easily follow the tutorial.

# Top-Down approach: the CheckList matrix

## Capabilities x Test Types

In tutorial #3, we talked about specific test types.  
In order to guide test ideation, it's useful to think of CheckList as a matrix of Capabilities x Test Types.  
*Capabilities* refers to general-purpose linguistic capabilities, which manifest in one way or another in almost any NLP application.   
We suggest that anyone CheckListing a model go through *at least* the following capabilities, trying to create MFTs, INVs, and DIRs for each if possible.
1. **Vocabulary + POS:** important words or groups of words (by part-of-speech) for the task
2. **Taxonomy**: synonyms, antonyms, word categories, etc
3. **Robustness**: to typos, irrelevant additions, contractions, etc
4. **Named Entity Recognition (NER)**: person names, locations, numbers, etc
5. **Fairness**
6. **Temporal understanding**: understanding order of events and how they impact the task
7. **Negation**
8. **Coreference** 
9. **Semantic Role Labeling (SRL)**: understanding roles such as agent, object, passive/active, etc
10. **Logic**: symmetry, consistency, conjunctions, disjunctions, etc

Notice that we are framing this as very top-down approach: you start with a list of capabilities and try to think of what kinds of tests can be created, based on the three test types. We'll talk about how to incorporate some bottom-up thinking later on.

We won't try to create tests for **all** of these capabilities (but we do have notebooks with tests for all of them in the repo), just one as an example. 
Anyway, let's create a test suite (used to save and aggregate tests):

In [11]:
suite = TestSuite()
editor = Editor()

## Capability: NER

Let's start with the NER capability.  
How do named entities impact duplicate question detection? 


In [12]:
i = 0
print(dev_df.iloc[i].label)
parsed_qs[i]

1


('Two women are embracing while holding to go packages.',
 'The sisters are hugging goodbye while holding to go packages after just eating lunch.')

### MFT
It seems that the model should be able to label as contradiction when name is changed.   
Let's write an MFT where we have two people that have the same last name, but different first names.  
Instead of running the test now, we'll add it to the suite and run all tests later.

In [13]:
t = editor.template((
    '{first_name} {last_name} is {mask} at {mask}',
    '{first_name2} {last_name} is {mask} at {mask}',
    ),
    remove_duplicates=True, 
    nsamples=300)
test = MFT(**t, labels=2, name='same adjectives, different people', capability = 'NER',
           description='Different first name, same adjective and last name')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


('Robin Perry is staying at home', 'Julia Perry is staying at home')
('Grace Ross is not at Disneyland', 'Rebecca Ross is not at Disneyland')


In [14]:
t = editor.template(('{first_name} likes to do {mask} in {mask}', '{first_name} does not like to do {mask} in {mask}'),
                remove_duplicates=True, 
                nsamples=300)
test = MFT(**t, labels=2, name='Negation contradiction', description='', capability='Negation')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

('Jason likes to do chores in advance', 'Jason does not like to do chores in advance')
('Alice likes to do dishes in public', 'Alice does not like to do dishes in public')


In [15]:
suite.tests

OrderedDict([('same adjectives, different people',
              <checklist.test_types.MFT at 0x7f2471dc8e50>),
             ('Negation contradiction',
              <checklist.test_types.MFT at 0x7f2479b38a10>)])

### INV
If you have two questions with the same named entity, changing the entity on both should not change whether the questions are duplicates or not.  
Let's write an INV for this.

Since we are dealing with pairs of questions, we have to write a wrapper to make sure the same name is changed on both:

In [16]:
import re
def change_gender(text):
    # not perfect... there is some ambiguity in that her -> his or him depending on context
    female_words = ['woman', 'women', 'she', 'her', 'hers', 'girl', 'girls', 'sister',  'sisters',  'daughter', 'daughters']
    male_words   = ['man',   'men',   'he',  'him', 'his',  'boy',  'boys',  'brother', 'brothers', 'son',      'sons']
    
    # completely swapping gender doesn't work yet. 
    ret = []
    for i, word in enumerate(male_words):
        swapped = re.sub(r'\b%s\b' % word, female_words[i], text, flags=re.I)
        if swapped != text:
            ret += [swapped]
    for i, word in enumerate(female_words):
        swapped = re.sub(r'\b%s\b' % word, male_words[i], text, flags=re.I)
        if swapped != text:
            ret += [swapped]
    return ret

change_gender("Two women having drinks with men and smoking cigarettes at the bar.")

['Two women having drinks with women and smoking cigarettes at the bar.',
 'Two men having drinks with men and smoking cigarettes at the bar.']

In [17]:
def change_gender_on_both(qs):
    q1, q2 = qs
    c1 = change_gender(q1.text)
    c2 = change_gender(q2.text)
    # Only include examples where the same name was changed on both questions
    return [(q1, q2) for q1, q2 in zip(c1, c2)]

In [18]:
t = Perturb.perturb(parsed_qs_spacy, change_gender_on_both, nsamples=200)
test = INV(**t, name='Change gender in both', capability='NER',
          description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test, overwrite=True)
print(t.data[0])
#print(t.data[0])

[('Three young girls are walking hand in hand in a crowd of people.', 'A group of girls try to make their way through the crowd at a concert.'), ('Three young boys are walking hand in hand in a crowd of people.', 'A group of boys try to make their way through the crowd at a concert.')]


### DIR
Conversely, if an entity is present on a pair the model predicts as a duplicate and we change it to something else on *only one* of the sentences, the prediction should change to non-duplicate.  
Let's write this as a DIR test:

In [19]:
def change_gender_on_one(qs):
    q1, q2 = qs
    c1 = change_gender(q1.text)
    c2 = change_gender(q2.text)
    # there needs to be gendered word in both
    if not c1 or not c2:
        return

    ret = []
    ret.extend([(q1_changed, str(q2)) for q1_changed in c1])
    ret.extend([(str(q1), q2_changed) for q2_changed in c2])
    return ret

We'll write an expectation function in two steps.  
First, we want the prediction to be 0.  
Second, we only want to include examples where the original prediction is one. We do this with a slice wrapper:

In [20]:
# we want changes to make the case go towards 2 (contradiction). 
expect_fn = Expect.eq(2)
expect_fn = Expect.slice_orig(expect_fn, lambda orig, *args: orig != 2)


Let's put it all together into a test:

In [21]:
t = Perturb.perturb(parsed_qs_spacy, change_gender_on_one, nsamples=200)
name = 'Change gender in one of the questions'
desc = 'Take non-contradictions. Change gender in one to make contradictions.'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
suite.add(test)
print(t.data[0][0])
print(t.data[0][1])
print(t.data[0][2])

('Two young girls are playing large stringed instruments behind music stands, with a window in the background.', 'The girls are musicians.')
('Two young boys are playing large stringed instruments behind music stands, with a window in the background.', 'The girls are musicians.')
('Two young girls are playing large stringed instruments behind music stands, with a window in the background.', 'The boys are musicians.')


# Running the suite, seeing results

When running the prediction, the Huggingface pipeline returns a dict with predicted label and probability:

In [23]:
example = ('The woman likes driving', 'The woman likes cars')
pipe([[example]])

[{'label': 'LABEL_1', 'score': 0.38768288493156433}]

We write a simple wrapper to make the output compatible with CheckList:

In [24]:
def pred_and_conf(data):
    data = [[d] for d in data]
    raw_preds = pipe_all(data)
    pp = np.array([[p[0]['score'], p[1]['score'], p[2]['score']] for p in raw_preds])
    preds = np.argmax(pp, axis=1)
    return preds, pp

In [25]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 298 examples
Running Negation contradiction
Predicting 300 examples
Running Change gender in both
Predicting 428 examples
Running Change gender in one of the questions
Predicting 706 examples


We can see a (text) summary of the results by calling `suite.summary()`

In [26]:
suite.summary()

NER

same adjectives, different people
Test cases:      298
Fails (rate):    132 (44.3%)

Example fails:
0.6 0.2 0.1 ('Victoria Howard is speaking at BYU', 'Maria Howard is speaking at BYU')
----
0.4 0.2 0.3 ('Melissa Bell is home at Thanksgiving', 'Marilyn Bell is home at Thanksgiving')
----
0.6 0.2 0.2 ('Margaret Jackson is away at work', 'Jennifer Jackson is away at work')
----


Change gender in both
Test cases:      200
Fails (rate):    24 (12.0%)

Example fails:
0.3 0.1 0.7 ('A young women, in a black shirt, is holding a bike, while a young boy is holding a skateboard.', 'A young woman is holding a bike while a young man is holding a skateboard.')
1.0 0.0 0.0 ('A young women, in a black shirt, is holding a bike, while a young girl is holding a skateboard.', 'A young woman is holding a bike while a young woman is holding a skateboard.')
0.9 0.0 0.0 ('A young men, in a black shirt, is holding a bike, while a young boy is holding a skateboard.', 'A young man is holding a bike while 

Or if we're using jupyter, we can use a nifty visualization that has all of the tests we created in a matrix.  
You can navigate the matrix and see results for individual tests (*The screenshot below is based on our locally finetuned model, so the numbers may not match with your results.*).

In [35]:
# from IPython.display import HTML, Image
# with open('visual_table_summary.gif','rb') as f:
#     display(Image(data=f.read(), format='png'))
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

## Bonus: testing Taxonomy

Let's create a few additional tests for the Taxonomy capability

In [36]:
tmp = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.synonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        tmp.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(', '.join([str(tuple(x)) for x in tmp][:50]))

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/home/eric/anaconda3/lib/python3.7/site-packages/pattern/text/en/../../../../multiprocessing/queues.py", line 232, in _feed
    close()
  File "/home/eric/anaconda3/lib/python3.7/site-packages/pattern/text/en/../../../../multiprocessing/connection.py", line 177, in close
    self._close()
  File "/home/eric/anaconda3/lib/python3.7/site-packages/pattern/text/en/../../../../multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/eric/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/eric/anaconda3/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eric/anaconda3/lib/python3.7/site-packages/pattern/text/en/../../../../multiprocessi

('grateful', 'thankful'), ('cautious', 'conservative', 'timid'), ('conservative', 'cautious'), ('organised', 'organized', 'direct', 'engineer'), ('stressed', 'stress'), ('mean', 'average', 'hateful'), ('smart', 'wise', 'chic', 'bright'), ('honest', 'good', 'true', 'reliable', 'honorable', 'fair'), ('hateful', 'mean'), ('worried', 'upset'), ('ambitious', 'challenging'), ('dependent', 'qualified'), ('rigid', 'strict', 'stiff', 'fixed'), ('demanding', 'exact'), ('important', 'significant', 'authoritative'), ('organized', 'organised', 'direct'), ('corrupt', 'corrupted'), ('miserable', 'poor', 'pathetic', 'suffering', 'wretched', 'low'), ('enlightened', 'educated', 'clear'), ('tolerant', 'resistant', 'liberal', 'kind'), ('vocal', 'outspoken'), ('thankful', 'grateful'), ('educated', 'enlightened'), ('intimidating', 'daunting'), ('unhappy', 'distressed'), ('thoughtful', 'attentive'), ('committed', 'attached'), ('strict', 'rigid', 'stern'), ('charitable', 'benevolent', 'sympathetic'), ('courag

Out of all of those, let's pick a few:

In [45]:
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

With these, we can create a simple MFT, where we expect the model to recognize these synonyms.  


In [46]:
t = editor.template(
    (
    'How can I become {moreless} {x[0]}?',
    'How can I become {moreless} {x[1]}?',
    ),
    x=synonyms,
    moreless=['more', 'less'],
    remove_duplicates=True, 
    nsamples=200)
name = 'How can I become more {synonym}?' 
desc = 'different (simple) templates where words are replaced with their synonyms'
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test)

Let's do the same with antonyms:

In [47]:
opps = []
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.antonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        opps.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(','.join([str(tuple(x)) for x in opps]))

('cautious', 'brave'),('conservative', 'progressive', 'liberal'),('smart', 'stupid'),('conspicuous', 'invisible'),('pessimistic', 'optimistic'),('dependent', 'independent'),('impatient', 'patient'),('powerless', 'powerful'),('corrupt', 'straight'),('unhappy', 'happy'),('courageous', 'fearful'),('evil', 'good'),('visible', 'invisible'),('optimistic', 'pessimistic'),('bad', 'good'),('fat', 'lean', 'thin'),('hungry', 'thirsty'),('individual', 'common'),('irresponsible', 'responsible'),('passive', 'active'),('insecure', 'secure'),('uncomfortable', 'comfortable'),('defensive', 'offensive'),('shy', 'confident'),('negative', 'positive'),('invisible', 'visible'),('active', 'passive'),('humble', 'proud'),('hopeful', 'hopeless'),('progressive', 'conservative'),('difficult', 'easy'),('specific', 'general'),('positive', 'negative'),('organic', 'functional'),('rude', 'civil', 'polite'),('emotional', 'intellectual'),('stupid', 'smart', 'intelligent')


In [48]:
opps

[['cautious', 'brave'],
 ['conservative', 'progressive', 'liberal'],
 ['smart', 'stupid'],
 ['conspicuous', 'invisible'],
 ['pessimistic', 'optimistic'],
 ['dependent', 'independent'],
 ['impatient', 'patient'],
 ['powerless', 'powerful'],
 ['corrupt', 'straight'],
 ['unhappy', 'happy'],
 ['courageous', 'fearful'],
 ['evil', 'good'],
 ['visible', 'invisible'],
 ['optimistic', 'pessimistic'],
 ['bad', 'good'],
 ['fat', 'lean', 'thin'],
 ['hungry', 'thirsty'],
 ['individual', 'common'],
 ['irresponsible', 'responsible'],
 ['passive', 'active'],
 ['insecure', 'secure'],
 ['uncomfortable', 'comfortable'],
 ['defensive', 'offensive'],
 ['shy', 'confident'],
 ['negative', 'positive'],
 ['invisible', 'visible'],
 ['active', 'passive'],
 ['humble', 'proud'],
 ['hopeful', 'hopeless'],
 ['progressive', 'conservative'],
 ['difficult', 'easy'],
 ['specific', 'general'],
 ['positive', 'negative'],
 ['organic', 'functional'],
 ['rude', 'civil', 'polite'],
 ['emotional', 'intellectual'],
 ['stupid', 

In [49]:
antonyms = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]

In [50]:
t = editor.template([(
    'How can I become more {x[0]}?',
    'How can I become less {x[1]}?',
    ),
    (
    'How can I become less {x[0]}?',
    'How can I become more {x[1]}?',
    )],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=300)
name = 'How can I become more X = How can I become less antonym(X)' 
desc = ''
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test)

It would be easy to turn the synonym one into an INV as well (we do this in another notebook), but let's end here after we run the suite again and see new results.

In [51]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 298 examples
Running Change same name in both questions
Predicting 2065 examples




Running Change name in one of the questions
Predicting 3912 examples
Running Comparison between two entities is not the same as asking about one
Predicting 376 examples
Running How can I become more {synonym}?
Predicting 200 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 600 examples


In [52]:
suite.summary()

Vocabulary

Comparison between two entities is not the same as asking about one
Test cases:      376
Fails (rate):    0 (0.0%)




Taxonomy

How can I become more {synonym}?
Test cases:      200
Fails (rate):    34 (17.0%)

Example fails:
0.2 ('How can I become more spiritual?', 'How can I become more religious?')
----
0.0 ('How can I become less vocal?', 'How can I become less outspoken?')
----
0.0 ('How can I become less vocal?', 'How can I become less outspoken?')
----


How can I become more X = How can I become less antonym(X)
Test cases:      600
Fails (rate):    344 (57.3%)

Example fails:
0.0 ('How can I become less optimistic?', 'How can I become more pessimistic?')
----
0.0 ('How can I become more cautious?', 'How can I become less brave?')
----
0.0 ('How can I become more impatient?', 'How can I become less patient?')
----




NER

same adjectives, different people
Test cases:      298
Fails (rate):    74 (24.8%)

Example fails:
0.9 ('Is Harry Cooper Missing?', 'Is Donald Co

In [36]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…