# Running checklist test suite for SQuAD
Source: code from https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb with some changes
        

In [1]:
from checklist.test_suite import TestSuite
import numpy as np
from transformers import pipeline

In [2]:

model = pipeline('question-answering', model="distilbert-base-cased-distilled-squad")
model({
    'context': 'A new strain of flu that has the potential to become a pandemic has been identified by scientists.',
    'question': 'What has been discovered by scientists?'
})

{'score': 0.38112872838974,
 'start': 0,
 'end': 19,
 'answer': 'A new strain of flu'}

In [3]:
suite_path = './squad_suite2.pkl'
suite = TestSuite.from_file(suite_path)

In [5]:
def predconfs(context_question_pairs):
    preds = []
    confs = []
    for c, q in context_question_pairs:
        try:
            p = model(question=q, context=c, truncation=True, )
        except:
            print('Failed', q)
            preds.append(' ')
            confs.append(1)
        preds.append(p['answer'])
        confs.append(p['score'])
    return preds, np.array(confs)

def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [6]:
%%time
suite.run(predconfs, n=100, overwrite=True)

Running A is COMP than B. Who is more COMP?
Predicting 100 examples
Running A is COMP than B. Who is less COMP?
Predicting 100 examples
Running Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Predicting 1200 examples
Running size, shape, age, color
Predicting 400 examples
Running Profession vs nationality
Predicting 1000 examples
Running Animal vs Vehicle
Predicting 400 examples
Running Animal vs Vehicle v2
Predicting 400 examples
Running Synonyms
Predicting 400 examples
Running A is COMP than B. Who is antonym(COMP)? B
Predicting 400 examples
Running A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
Predicting 1600 examples
Running Question typo
Predicting 200 examples
Running Question contractions
Predicting 200 examples
Running Add random sentence to context
Predicting 300 examples
Running Change name everywhere
Predicting 1100 examples
Running Change location everywhere
Predicting 1100 examples
R

In [7]:
suite.summary(n=0)

Vocabulary

A is COMP than B. Who is more COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    3 (3.0%)


A is COMP than B. Who is less COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    100 (100.0%)


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      496
Test cases run:  100
Fails (rate):    100 (100.0%)




Taxonomy

size, shape, age, color
Test cases:      500
Test cases run:  100
Fails (rate):    100 (100.0%)


Profession vs nationality
Test cases:      500
Test cases run:  100
Fails (rate):    18 (18.0%)


Animal vs Vehicle
Test cases:      500
Test cases run:  100
Fails (rate):    60 (60.0%)


Animal vs Vehicle v2
Test cases:      498
Test cases run:  100
Fails (rate):    63 (63.0%)


Synonyms
Test cases:      449
Test cases run:  100
Fails (rate):    12 (12.0%)


A is COMP than B. Who is antonym(COMP)? B
Test cases:      498
Test cases run:  100
Fails (rate):    100 (100.0%)


A is more X than B. Who is more ant

In [8]:
suite.summary(format_example_fn=format_squad_with_context)

Vocabulary

A is COMP than B. Who is more COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    3 (3.0%)

Example fails:
C: Sarah is hotter than Rose.
Q: Who is hotter?
A: Sarah
P: Sarah is hotter than Rose

----
C: Janet is cooler than Harriet.
Q: Who is cooler?
A: Janet
P: Harriet

----
C: Claire is thinner than Anna.
Q: Who is thinner?
A: Claire
P: Anna

----


A is COMP than B. Who is less COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Eric is weaker than Leslie.
Q: Who is less weaker?
A: Leslie
P: Eric

----
C: Catherine is older than William.
Q: Who is less older?
A: William
P: Catherine

----
C: Jean is smarter than Ian.
Q: Who is less smarter?
A: Ian
P: Jean

----


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      496
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Charlie is enthusiastic about the project. Dave is particularly enthusiastic about the pro

In [10]:
!jupyter nbextension install --py --sys-prefix checklist.viewer

Installing C:\Users\fgmal\AppData\Local\Programs\Python\Python311\Lib\site-packages\checklist\viewer\static -> viewer
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\bundle.js
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\bundle.js.map
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\custom.js
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\extension.js
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\index.js
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\index.js.map
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\viewer\__init__.py
Up to date: C:\Users\fgmal\AppData\Local\Programs\Python\Python311\share\jupyter\nbextensions\

In [11]:
!jupyter nbextension enable checklist.viewer --py --sys-prefix

Enabling notebook extension viewer/extension...
      - Validating: ok


In [12]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'A is COMP than B. Wh…