In [1]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
from checklist.test_suite import TestSuite
from checklist.expect import Expect
from tqdm import tqdm
import itertools

In [2]:
import sys
import spacy
import numpy as np
processor = spacy.load('en_core_web_sm')

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

In [3]:
model_path = '../trained_model_snli/'
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)


In [4]:
from transformers import pipeline

In [5]:
import transformers
transformers.__version__

'4.12.5'

In [6]:
pipe = pipeline('text-classification', model=model,
                        tokenizer=tokenizer, device=0)
pipe_all = pipeline('text-classification', model=model,
                        tokenizer=tokenizer, device=0, return_all_scores=True)

In [7]:
from datasets import load_dataset

dev_dataset = load_dataset('snli', split='validation')
dev_df = dev_dataset.to_pandas()
dev_df.head()

Reusing dataset snli (/home/eric/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


Unnamed: 0,premise,hypothesis,label
0,Two women are embracing while holding to go pa...,The sisters are hugging goodbye while holding ...,1
1,Two women are embracing while holding to go pa...,Two woman are holding packages.,0
2,Two women are embracing while holding to go pa...,The men are fighting outside a deli.,2
3,"Two young children in blue jerseys, one with t...",Two kids in numbered jerseys wash their hands.,0
4,"Two young children in blue jerseys, one with t...",Two kids at a ballgame wash their hands.,1


In [8]:
parsed_qs = [(row.premise, row.hypothesis) for _, row in dev_df.iterrows()]

Preprocess all the questions with spacy. This may take sometime.

In [9]:
processed_p = list(tqdm(processor.pipe(dev_df.premise, batch_size=64)))
processed_h = list(tqdm(processor.pipe(dev_df.hypothesis, batch_size=64)))
parsed_qs_spacy = [(p, q) for (p, q) in zip(processed_p, processed_h)]

10000it [00:10, 998.62it/s]
10000it [00:06, 1580.81it/s]


In [10]:
parsed_qs_spacy[0]

(Two women are embracing while holding to go packages.,
 The sisters are hugging goodbye while holding to go packages after just eating lunch.)

# Top-Down approach: the CheckList matrix

## Capabilities x Test Types

In tutorial #3, we talked about specific test types.  
In order to guide test ideation, it's useful to think of CheckList as a matrix of Capabilities x Test Types.  
*Capabilities* refers to general-purpose linguistic capabilities, which manifest in one way or another in almost any NLP application.   
We suggest that anyone CheckListing a model go through *at least* the following capabilities, trying to create MFTs, INVs, and DIRs for each if possible.
1. **Vocabulary + POS:** important words or groups of words (by part-of-speech) for the task
2. **Taxonomy**: synonyms, antonyms, word categories, etc
3. **Robustness**: to typos, irrelevant additions, contractions, etc
4. **Named Entity Recognition (NER)**: person names, locations, numbers, etc
5. **Fairness**
6. **Temporal understanding**: understanding order of events and how they impact the task
7. **Negation**
8. **Coreference** 
9. **Semantic Role Labeling (SRL)**: understanding roles such as agent, object, passive/active, etc
10. **Logic**: symmetry, consistency, conjunctions, disjunctions, etc

Notice that we are framing this as very top-down approach: you start with a list of capabilities and try to think of what kinds of tests can be created, based on the three test types. We'll talk about how to incorporate some bottom-up thinking later on.

We won't try to create tests for **all** of these capabilities (but we do have notebooks with tests for all of them in the repo), just one as an example. 
Anyway, let's create a test suite (used to save and aggregate tests):

In [11]:
suite = TestSuite()
editor = Editor()

## Capability: NER

Let's start with the NER capability.  
How do named entities impact duplicate question detection? 


In [12]:
i = 0
print(dev_df.iloc[i].label)
parsed_qs[i]

1


('Two women are embracing while holding to go packages.',
 'The sisters are hugging goodbye while holding to go packages after just eating lunch.')

### MFT
It seems that the model should be able to label as contradiction when name is changed.   
Let's write an MFT where we have two people that have the same last name, but different first names.  
Instead of running the test now, we'll add it to the suite and run all tests later.

In [13]:
t = editor.template((
    '{first_name} {last_name} is {mask} at {mask}',
    '{first_name2} {last_name} is {mask} at {mask}',
    ),
    remove_duplicates=True, 
    nsamples=300)
test = MFT(**t, labels=2, name='same adjectives, different people', capability = 'NER',
           description='Different first name, same adjective and last name')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


('Rachel Wright is still at home', 'Howard Wright is still at home')
('Stephanie Gordon is editor at Medium', 'Christopher Gordon is editor at Medium')


In [14]:
t = editor.template(('{first_name} likes to do {mask} in {mask}', '{first_name} does not like to do {mask} in {mask}'),
                remove_duplicates=True, 
                nsamples=300)
test = MFT(**t, labels=2, name='Negation contradiction', description='', capability='Negation')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

('Jason likes to do chores in advance', 'Jason does not like to do chores in advance')
('Alice likes to do dishes in public', 'Alice does not like to do dishes in public')


### INV
If you have two questions with the same named entity, changing the entity on both should not change whether the questions are duplicates or not.  
Let's write an INV for this.

Since we are dealing with pairs of questions, we have to write a wrapper to make sure the same name is changed on both:

In [15]:
import re
def change_gender(text):
    # not perfect... there is some ambiguity in that her -> his or him depending on context
    female_words = ['woman', 'women', 'she', 'her', 'hers', 'girl', 'girls', 'sister',  'sisters',  'daughter', 'daughters']
    male_words   = ['man',   'men',   'he',  'him', 'his',  'boy',  'boys',  'brother', 'brothers', 'son',      'sons']
    
    # completely swapping gender doesn't work yet. 
    ret = []
    for i, word in enumerate(male_words):
        swapped = re.sub(r'\b%s\b' % word, female_words[i], text, flags=re.I)
        if swapped != text:
            ret += [swapped]
    for i, word in enumerate(female_words):
        swapped = re.sub(r'\b%s\b' % word, male_words[i], text, flags=re.I)
        if swapped != text:
            ret += [swapped]
    return ret

change_gender("Two women having drinks with men and smoking cigarettes at the bar.")

['Two women having drinks with women and smoking cigarettes at the bar.',
 'Two men having drinks with men and smoking cigarettes at the bar.']

In [16]:
def change_gender_on_both(qs):
    q1, q2 = qs
    c1 = change_gender(q1.text)
    c2 = change_gender(q2.text)
    # to keep things simple, test only where single gender swapped sentence is generated
    if len(c1) == 1 and len(c2) == 1:
        return [(q1, q2) for q1, q2 in zip(c1, c2)]
    else:
        return 

In [17]:
t = Perturb.perturb(parsed_qs_spacy, change_gender_on_both, nsamples=200)
test = INV(**t, name='Change gender in both', capability='NER',
          description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test, overwrite=True)
print(t.data[0])
#print(t.data[0])

[('Three young girls are walking hand in hand in a crowd of people.', 'A group of girls try to make their way through the crowd at a concert.'), ('Three young boys are walking hand in hand in a crowd of people.', 'A group of boys try to make their way through the crowd at a concert.')]


### DIR
Conversely, if an entity is present on a pair the model predicts as a duplicate and we change it to something else on *only one* of the sentences, the prediction should change to non-duplicate.  
Let's write this as a DIR test:

In [18]:
def change_gender_on_one(qs):
    q1, q2 = qs
    c1 = change_gender(q1.text)
    c2 = change_gender(q2.text)
    # there needs to be gendered word in both
    if len(c1) == 1 and len(c2) == 1:
        ret = []
        ret.extend([(c1[0], str(q2))])
        ret.extend([(str(q1), c2[0])])
        return ret
    return []

We'll write an expectation function in two steps.  
First, we want the prediction to be 0.  
Second, we only want to include examples where the original prediction is one. We do this with a slice wrapper:

In [19]:
# we want changes to make the case go towards 2 (contradiction). 
expect_fn = Expect.eq(2)
expect_fn = Expect.slice_orig(expect_fn, lambda orig, *args: orig != 2)


Let's put it all together into a test:

In [20]:
t = Perturb.perturb(parsed_qs_spacy, change_gender_on_one, nsamples=200)
name = 'Change gender in one of the questions'
desc = 'Take non-contradictions. Change gender in one to make contradictions.'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
suite.add(test, overwrite=True)
print(t.data[0][0])
print(t.data[0][1])
print(t.data[0][2])

('Two young girls are playing large stringed instruments behind music stands, with a window in the background.', 'The girls are musicians.')
('Two young boys are playing large stringed instruments behind music stands, with a window in the background.', 'The girls are musicians.')
('Two young girls are playing large stringed instruments behind music stands, with a window in the background.', 'The boys are musicians.')


# Running the suite, seeing results

When running the prediction, the Huggingface pipeline returns a dict with predicted label and probability:

We write a simple wrapper to make the output compatible with CheckList:

In [21]:
def pred_and_conf(data):
    data = [[d] for d in data]
    raw_preds = pipe_all(data)
    pp = np.array([[p[0]['score'], p[1]['score'], p[2]['score']] for p in raw_preds])
    preds = np.argmax(pp, axis=1)
    return preds, pp

In [22]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 297 examples
Running Negation contradiction
Predicting 300 examples
Running Change gender in both
Predicting 400 examples
Running Change gender in one of the questions
Predicting 600 examples


We can see a (text) summary of the results by calling `suite.summary()`

In [23]:
suite.summary()

NER

same adjectives, different people
Test cases:      297
Fails (rate):    109 (36.7%)

Example fails:
0.5 0.3 0.2 ('Helen Gordon is back at ABC', 'Julia Gordon is back at ABC')
----
0.7 0.2 0.1 ('Samuel Price is staying at home', 'Frank Price is staying at home')
----
0.8 0.1 0.0 ('Dick Kennedy is home at night', 'Jack Kennedy is home at night')
----


Change gender in both
Test cases:      200
Fails (rate):    4 (2.0%)

Example fails:
0.0 0.2 0.8 ('A young child sits on some pillows with a green tray of chopped-up food in front of him.', 'A young man prepares to eat a meal.')
0.4 0.5 0.1 ('A young child sits on some pillows with a green tray of chopped-up food in front of her.', 'A young woman prepares to eat a meal.')

----
0.0 1.0 0.0 ('A man in a black hoodie watching a man in a red cap.', 'The man in the black hoodie is a robber.')
0.0 0.5 0.5 ('A woman in a black hoodie watching a woman in a red cap.', 'The woman in the black hoodie is a robber.')

----
0.0 0.0 1.0 ('A young b

Or if we're using jupyter, we can use a nifty visualization that has all of the tests we created in a matrix.  
You can navigate the matrix and see results for individual tests (*The screenshot below is based on our locally finetuned model, so the numbers may not match with your results.*).

In [24]:
# from IPython.display import HTML, Image
# with open('visual_table_summary.gif','rb') as f:
#     display(Image(data=f.read(), format='png'))
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'same adjectives, dif…

## Taxonomy

Let's create a few additional tests for the Taxonomy capability

In [25]:
tmp = []
x = editor.suggest('He is trying to become more {mask}.')
x += editor.suggest('He is trying to become less {mask}.')
for a in set(x):
    e = editor.synonyms('He is trying to become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        tmp.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(', '.join([str(tuple(x)) for x in tmp][:50]))

('understanding', 'savvy', 'read', 'understand'), ('engaged', 'busy'), ('religious', 'spiritual'), ('activist', 'militant'), ('professional', 'pro'), ('attentive', 'thoughtful'), ('rational', 'intellectual'), ('scared', 'frightened'), ('popular', 'democratic'), ('militant', 'competitive', 'activist'), ('dominant', 'prevalent'), ('crazy', 'sick', 'mad', 'disturbed', 'wild'), ('focused', 'center'), ('active', 'dynamic', 'alive'), ('productive', 'rich', 'fat'), ('civilized', 'polite', 'cultured'), ('fluent', 'liquid', 'fluid', 'smooth'), ('serious', 'sober', 'dangerous', 'severe', 'good'), ('ambitious', 'challenging'), ('transparent', 'lucid'), ('effective', 'efficient', 'good'), ('rude', 'natural', 'crude', 'primitive'), ('overweight', 'heavy'), ('conservative', 'cautious', 'bourgeois'), ('sober', 'serious'), ('humble', 'modest', 'small'), ('intelligent', 'healthy', 'sound'), ('reliable', 'authentic', 'honest', 'true'), ('annoying', 'irritating'), ('loyal', 'patriotic', 'firm', 'fast'), 

Out of all of those, let's pick a few:

In [26]:
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('radical', 'revolutionary'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
            ('strict', 'rigid'), ('careful', 'measured')
           ]


antonyms = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),
            ('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),
            ('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),
            ('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),
            ('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),
            ('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),
            ('irresponsible', 'responsible'),('courageous', 'fearful')]

With these, we can create a simple MFT, where we expect the model to recognize these synonyms.  


In [27]:
# moreless {synonym} should be contradiction

t = editor.template([
    (
    '{first_name} wants to be more {x[0]}?',
    '{first_name} wants to be less {x[1]}?'
    ),
      (
    '{first_name} wants to be less {x[0]}?',
    '{first_name} wants to be more {x[1]}?'
    )
    ],
    unroll=True,
    x=synonyms,
    remove_duplicates=True, 
    nsamples=200)
name = 'wants to be more/less {synonym}?' 
desc = 'different (simple) templates where words are replaced with their synonyms'
test = MFT(**t, labels=2, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)



# moreless {antonym} should be entailment
t = editor.template([
    (
    '{first_name} wants to be more {x[0]}',
    '{first_name} wants to be less {x[1]}'
    ),
      (
    '{first_name} wants to be less {x[0]}',
    '{first_name} wants to be more {x[1]}'
    )
    ],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=200)
name = 'wants to be more/less {antonym}?' 
desc = 'different (simple) templates where words are replaced with their antonym'
test = MFT(**t, labels=0, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)

In [28]:
# moreless {synonym} should be contradiction

t = editor.template([
    (
    '{first_name} is the most {x[0]} person in {mask}',
    '{first_name} is the least {x[1]} person in {mask}'
    ),
      (
    '{first_name} is the least {x[0]} person in {mask}.',
    '{first_name} is the most  {x[1]} person in {mask}.'
    )
    ],
    unroll=True,
    x=synonyms,
    remove_duplicates=True, 
    nsamples=200)
name = 'is the most/least {synonym} in {mask}' 
desc = 'different (simple) templates where words are replaced with their synonyms'
test = MFT(**t, labels=2, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)



# most/least {antonym} should be entailment
t = editor.template([
    (
    '{first_name} is the most {x[0]} person in {mask}',
    '{first_name} is the least {x[1]} person in {mask}'
    ),
      (
    '{first_name} is the least {x[0]} person in {mask}.',
    '{first_name} is the most  {x[1]} person in {mask}.'
    )
    ],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=200)
name = 'is the most/least {antonym} in {mask}' 
desc = 'different (simple) templates where words are replaced with their antonym'
test = MFT(**t, labels=0, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)

It would be easy to turn the synonym one into an INV as well (we do this in another notebook), but let's end here after we run the suite again and see new results.

In [29]:
import re
def replace_pairs(pairs):
    def replace_z(text):
        ret = []
        for x, y in pairs:
            t = re.sub(r'\b%s\b' % x, y, text )
            if t != text:
                ret.append(t)
            if y == 'smart':
                continue
            t = re.sub(r'\b%s\b' % y, x, text )
            if t != text:
                ret.append(t)
        return list(set(ret))
    return replace_z
def apply_and_pair(fn):
    def ret_fn(text):
        ret = fn(text)
        return [(text, r) for r in ret]
    return ret_fn


def apply_to_each_and_product(fn):
    def apply_to_one(x):
        p = fn(x)
        if not p:
            p = []
        return list(set([x] + p))
    def ret_fn(pair):
        p1 = apply_to_one(pair[0])
        p2 = apply_to_one(pair[1])
        return [x for x in itertools.product(p1, p2) if x != pair]
    return ret_fn



name = '(INV) Replace synonyms in real pairs'
desc = ''
t = Perturb.perturb(parsed_qs, apply_to_each_and_product(replace_pairs(synonyms)), nsamples=1000, keep_original=True)
test = INV(t.data, threshold=0.1, name=name, description=desc, capability='Taxonomy')
# test.run(new_pp, n=500, seed=1)
# test.summary(n=3)
suite.add(test, overwrite=True)

In [30]:
suite.run(pred_and_conf, overwrite=True)

Running same adjectives, different people
Predicting 297 examples
Running Negation contradiction
Predicting 300 examples
Running Change gender in both
Predicting 400 examples
Running Change gender in one of the questions
Predicting 600 examples
Running wants to be more/less {synonym}?
Predicting 400 examples
Running wants to be more/less {antonym}?
Predicting 400 examples
Running is the most/least {synonym} in {mask}
Predicting 400 examples




Running is the most/least {antonym} in {mask}
Predicting 400 examples
Running (INV) Replace synonyms in real pairs
Predicting 92 examples


In [31]:
suite.summary()

Taxonomy

wants to be more/less {synonym}?
Test cases:      400
Fails (rate):    0 (0.0%)


wants to be more/less {antonym}?
Test cases:      400
Fails (rate):    400 (100.0%)

Example fails:
0.0 0.0 1.0 ('Sue wants to be less bad', 'Sue wants to be more good')
----
0.0 0.0 1.0 ('Marilyn wants to be less passive', 'Marilyn wants to be more active')
----
0.0 0.0 1.0 ('Sharon wants to be more humble', 'Sharon wants to be less proud')
----


is the most/least {synonym} in {mask}
Test cases:      400
Fails (rate):    13 (3.2%)

Example fails:
0.5 0.1 0.4 ('Catherine is the most careful person in London', 'Catherine is the least measured person in London')
----
0.5 0.1 0.4 ('Victoria is the most careful person in life', 'Victoria is the least measured person in life')
----
0.5 0.1 0.4 ('Harry is the most humble person in NYC', 'Harry is the least modest person in NYC')
----


is the most/least {antonym} in {mask}
Test cases:      400
Fails (rate):    395 (98.8%)

Example fails:
0.0 0.0 0.9 

## Robustness

In [32]:
def wrap_apply_to_each(fn, both=False, *args, **kwargs):
    def new_fn(qs, *args, **kwargs):
        q1, q2 = qs
        ret = []
        fnq1 = fn(q1, *args, **kwargs)
        fnq2 = fn(q2, *args, **kwargs)
        if type(fnq1) != list:
            fnq1 = [fnq1]
        if type(fnq2) != list:
            fnq2 = [fnq2]
        ret.extend([(x, str(q2)) for x in fnq1])
        ret.extend([(str(q1), x) for x in fnq2])
        if both:
            ret.extend([(x, x2) for x, x2 in itertools.product(fnq1, fnq2)])
        return [x for x in ret if x[0] and x[1]]
    return new_fn
def wrap_apply_to_both(fn, *args, **kwargs):
    def new_fn(qs, *args, **kwargs):
        q1, q2 = qs
        ret = []
        fnq1 = fn(q1, *args, **kwargs)
        fnq2 = fn(q2, *args, **kwargs)
        if type(fnq1) != list:
            fnq1 = [fnq1]
        if type(fnq2) != list:
            fnq2 = [fnq2]
        ret.extend([(x, x2) for x, x2 in itertools.product(fnq1, fnq2)])
        return [x for x in ret if x[0] and x[1]]
    return new_fn

Typos

In [33]:
t = Perturb.perturb(parsed_qs, wrap_apply_to_each(Perturb.add_typos), nsamples=500)
test = INV(t.data, name='add one typo', capability='Robustness', description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test, overwrite=True)

Contractions

In [34]:
t = Perturb.perturb(parsed_qs, wrap_apply_to_each(Perturb.contractions, both=True), nsamples=500)
test = INV(**t, name='contractions', capability='Robustness', description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test)

In [35]:
suite.run(pred_and_conf, overwrite=True)
suite.summary()

Running same adjectives, different people
Predicting 297 examples
Running Negation contradiction
Predicting 300 examples
Running Change gender in both
Predicting 400 examples
Running Change gender in one of the questions
Predicting 600 examples
Running wants to be more/less {synonym}?
Predicting 400 examples
Running wants to be more/less {antonym}?
Predicting 400 examples
Running is the most/least {synonym} in {mask}
Predicting 400 examples
Running is the most/least {antonym} in {mask}
Predicting 400 examples
Running (INV) Replace synonyms in real pairs
Predicting 92 examples
Running add one typo
Predicting 1500 examples
Running contractions
Predicting 1038 examples
Taxonomy

wants to be more/less {synonym}?
Test cases:      400
Fails (rate):    0 (0.0%)


wants to be more/less {antonym}?
Test cases:      400
Fails (rate):    400 (100.0%)

Example fails:
0.0 0.0 1.0 ('Caroline wants to be less progressive', 'Caroline wants to be more conservative')
----
0.0 0.0 1.0 ('Hugh wants to be l

## Logic

In [36]:
# subset of:
#', '.join([str(x) for x in editor.suggest('The woman likes to {mask} and {mask}.')])


fverb = [('joke', 'sing'), ('jump', 'climb'), ('yell', 'point'), ('rock', 'ride'), ('run', 'travel'), 
         ('fight', 'lose'), ('cook', 'wine'), ('smoke', 'watch'), ('party', 'eat'), ('nap', 'drink'), 
         ('sweat', 'drink'), ('run', 'exercise'), ('cook', 'jam'), ('lie', 'rape'), ('clean', 'iron'), 
         ('bake', 'garden'), ('drink', 'live'), ('read', 'create'), ('cook', 'build'), ('go', 'flirt'), 
         ('talk', 'explain'), ('sit', 'browse'), ('fight', 'run'), ('crash', 'drive'), ('cry', 'argue'), 
         ('flirt', 'dance'), ('write', 'photograph'), ('drink', 'workout'), ('walk', 'fish'), ('drive', 'travel'),
         ('nap', 'talk'), ('drink', 'crash'), ('dance', 'fart'), ('talk', 'act'), ('shave', 'pray'), 
         ('stay', 'play'), ('dress', 'shower'), ('travel', 'train'), ('talk', 'hug'), ('nap', 'shower'), 
         ('party', 'sleep'), ('eat', 'smell'), ('stay', 'work'), ('come', 'leave'), 
         ('drink', 'complain'), ('cook', 'barbecue'), ('fish', 'skate'), ('speak', 'read'), 
         ('knit', 'dye'), ('shave', 'groom'), ('hunt', 'party'), ('live', 'eat'), ('dance', 'hug'),
         ('laugh', 'love'), ('lie', 'kill'), ('hug', 'pinch'), ('go', 'run'), ('pool', 'party'),
         ('dress', 'serve'), ('fight', 'wrestle'), ('sleep', 'cook'), ('fish', 'drink'), ('try', 'win'), 
         ('shop', 'party'), ('dance', 'skate'), ('talk', 'please'), ('climb', 'run'), ('read', 'rap'), 
         ('drink', 'quarrel'), ('joke', 'smile'), ('run', 'fast'), ('cook', 'gamble'), ('drink', 'deal'), 
         ('gamble', 'smoke'), ('smoke', 'relax'), ('hug', 'chat'), ('lie', 'swear'), ('smoke', 'fish'), 
         ('steal', 'kill'), ('climb', 'dance'), ('work', 'write'), ('hunt', 'hike'), ('laugh', 'relax'), 
         ('sing', 'eat'), ('go', 'walk'), ('scream', 'spit'), ('exercise', 'train'), ('smoke', 'cough'),
         ('drink', 'knit'), ('relax', 'reflect'), ('talk', 'speak'), ('fish', 'explore'), 
         ('hide', 'watch'), ('sing', 'entertain'), ('jump', 'slide'), ('drink', 'compete'), ('eat', 'learn'), 
         ('eat', 'wash'), ('sleep', 'sing'), ('cook', 'flirt'), ('tease', 'annoy'), ('sleep', 'fish'), 
         ('swim', 'paddle'), ('cook', 'think'), ('party', 'gossip'), ('dance', 'joke'), ('sit', 'drive'),
         ('cook', 'swim'), ('eat', 'garden'), ('sing', 'fly'), ('shave', 'eat'), ('sit', 'chew'), 
         ('play', 'laugh'), ('clean', 'dress'), ('eat', 'date'), ('text', 'Facebook'), ('drink', 'celebrate'), 
         ('surf', 'hike'), ('make', 'bake'), ('fly', 'swim'), ('drink', 'garden'),  
         ('paint', 'glue'), ('hide', 'wait'), ('dress', 'laugh'), ('shave', 'bald'), ('cook', 'recycle'), 
         ('run', 'write'), ('stand', 'cry'), ('sit', 'bleed'), ('stay', 'linger'), ('kill', 'eat'), 
         ('talk', 'type'), ('sleep', 'laugh'), ('chat', 'read'), ('smile', 'blush'), ('fight', 'kill'), 
         ('come', 'ride'), ('talk', 'study'), ('write', 'drink'), ('hide', 'talk'), ('dress', 'work'), 
         ('dance', 'DJ'), ('eat', 'kiss'), ('kiss', 'swear'), ('write', 'study'), ('dance', 'walk'),
         ('relax', 'eat'), ('drink', 'win'), ('stay', 'wait'), ('shoot', 'shoot'), ('cook', 'ski'),
         ('hide', 'kill'), ('eat', 'spend'), ('stretch', 'sweat'), ('chat', 'write'), ('play', 'run'), 
         ('relax', 'play'), ('read', 'vote'), ('eat', 'snack'), ('eat', 'dream'), ('rock', 'play'),
         ('kiss', 'lick'), ('yell', 'stomp'), ('chat', 'talk'), ('try', 'cook'), ('sit', 'pee'), 
         ('paint', 'model'), ('cook', 'invent'), ('pray', 'talk'), ('cry', 'smile'), ('kill', 'shoot'), 
         ('fly', 'surf'), ('dance', 'type'), ('cut', 'twist'), ('shower', 'drive'), ('stop', 'wait'),
         ('cry', 'fight'), ('drink', 'explore'), ('read', 'photograph'), ('knit', 'sing'), 
         ('fight', 'dance'), ('sit', 'pace'), ('sing', 'flirt'), ('run', 'hunt'), ('hug', 'play'), 
         ('sew', 'design'), ('live', 'breathe'), ('paint', 'design'), ('rock', 'groove'), ('relax', 'think'), 
         ('cook', 'tie'), ('flirt', 'joke'), ('sit', 'float'), ('come', 'flirt'), ('talk', 'bake'),
         ('sleep', 'fart'), ('mix', 'tell'), ('travel', 'hunt'), ('shoot', 'kill'), ('hunt', 'hide'), 
         ('hug', 'apologize'), ('yell', 'punch'), ('read', 'hunt'), ('yell', 'laugh'), ('sleep', 'chill'), 
         ('cry', 'sob'), ('stand', 'wait'), ('scream', 'bang'), ('shave', 'change'), ('mix', 'switch'), 
         ('sing', 'skate'), ('read', 'drive'), ('read', 'surf'), ('lie', 'twist'), ('crash', 'dance'),
         ('read', 'gamble'), ('sit', 'rock'), ('walk', 'explore'), ('fish', 'wrestle'), ('surf', 'surf'), 
         ('relax', 'chat'), ('move', 'explore'), ('dance', 'color'), ('dance', 'think'), ('chat', 'drink')]

In [37]:
fverb[0]

('joke', 'sing')

In [38]:
# A & B -> A
t = editor.template((
    'The woman likes to {fverb[0]} and {fverb[1]}',
    'She likes to {fverb[0]}',
    ),
    fverb=fverb,
    remove_duplicates=True, 
    nsamples=400)
test = MFT(**t, labels=0, name='both A and B entailment', capability = 'Logic', 
           description='A & B implies truth of A')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])


# A & B -> A
t = editor.template((
    'The woman likes to {fverb[0]} and {fverb[1]}',
    "She doesn't like to {fverb[0]}"
    ),
    fverb=fverb,
    remove_duplicates=True, 
    nsamples=400)
test = MFT(**t, labels=2, name='both A and B contradiction ', capability = 'Logic', 
           description='A & B implies not A is false')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])


# ~A & ~B -> ~A
t = editor.template((
    'The woman likes to neither {fverb[0]} nor {fverb[1]}',
    'She likes to {fverb[0]}',
    ),
    fverb=fverb,
    remove_duplicates=True, 
    nsamples=400)
test = MFT(**t, labels=2, name='neither nor contradiction', capability = 'Logic', 
           description='~A & ~B implies  ~A')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])



# ~A & ~B -> ~A
t = editor.template((
    'The woman likes to neither {fverb[0]} nor {fverb[1]}',
    "She doesn't like to {fverb[0]}"
    ),
    fverb=fverb,
    remove_duplicates=True, 
    nsamples=400)
test = MFT(**t, labels=0, name='neither nor entailment', capability = 'Logic', 
           description='~A & ~B does not implies ~A')
suite.add(test, overwrite=True)
print(t.data[0])
print(t.data[1])

('The woman likes to fly and swim', 'She likes to fly')
('The woman likes to kill and eat', 'She likes to kill')
('The woman likes to flirt and dance', "She doesn't like to flirt")
('The woman likes to try and win', "She doesn't like to try")
('The woman likes to neither jump nor climb', 'She likes to jump')
('The woman likes to neither come nor leave', 'She likes to come')
('The woman likes to neither fish nor drink', "She doesn't like to fish")
('The woman likes to neither laugh nor love', "She doesn't like to laugh")


In [39]:
suite.run(pred_and_conf, overwrite=True)
suite.summary()

Running same adjectives, different people
Predicting 297 examples
Running Negation contradiction
Predicting 300 examples
Running Change gender in both
Predicting 400 examples
Running Change gender in one of the questions
Predicting 600 examples
Running wants to be more/less {synonym}?
Predicting 400 examples
Running wants to be more/less {antonym}?
Predicting 400 examples
Running is the most/least {synonym} in {mask}
Predicting 400 examples
Running is the most/least {antonym} in {mask}
Predicting 400 examples
Running (INV) Replace synonyms in real pairs
Predicting 92 examples
Running add one typo
Predicting 1500 examples
Running contractions
Predicting 1038 examples
Running both A and B entailment
Predicting 400 examples
Running both A and B contradiction 
Predicting 400 examples
Running neither nor contradiction
Predicting 400 examples
Running neither nor entailment
Predicting 400 examples
Taxonomy

wants to be more/less {synonym}?
Test cases:      400
Fails (rate):    0 (0.0%)


want