In [1]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": None,
        "transition": None,
        "start_slideshow_at": "selected",
        "leap_motion": {
            "naturalSwipe"  : True,     # Invert swipe gestures
            "pointerOpacity": 0.5,      # Set pointer opacity to 0.5
            "pointerColor"  : "#d80000" # Red pointer"nat.png"
        },
        "header": "<h3>Francisco Perez-Sorrosal</h3>",
        "footer": "<h3>Machine Learning/Deep Learning</h3>",
        "scroll": True,
        "enable_chalkboard": True
     }
)

{'start_slideshow_at': 'selected',
 'leap_motion': {'naturalSwipe': True,
  'pointerOpacity': 0.5,
  'pointerColor': '#d80000'},
 'header': '<h3>Francisco Perez-Sorrosal</h3>',
 'footer': '<h3>Machine Learning/Deep Learning</h3>',
 'scroll': True,
 'enable_chalkboard': True}

In [None]:
pip install emoji --upgrade

In [None]:
import emoji
print(emoji.emojize('Presenting stuff is easy!!! :thumbs_up:'))

In [None]:
# Emojis http://getemoji.com/

# Beyond Accuracy: Behavioral Testing of NLP models with CheckList

## Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

### [https://arxiv.org/abs/2005.04118](https://arxiv.org/abs/2005.04118)

### [pdf](https://arxiv.org/pdf/2005.04118.pdf)


---

Francisco Perez-Sorrosal | 14 Sep 2020


# Summary (What do the authors sell)

**Best paper award in ACL 2020**

Presents **Checklist**, a _methodology and a tool_ for testing NLP models

💡 Based on behavioral testing (software engineering)

💡 Composed of:

1. matrix of general linguistic capabilities and test types (to ease comprehensive test ideation)

2. tool to generate test cases

💡 Illustrates the methodology on 3 use cases

💡 Analyze user behaviour while using the tool to justify its utility

💡 Reminds me the early papers on SW quality or papers from the [SEI](https://www.sei.cmu.edu/publications/technical-papers/index.cfm) but applied to NLP

- [The CERT Function Extraction Experiment: Quantifying FX Impact on Software  Comprehension and Verification, Collins et al, 2005](https://kilthub.cmu.edu/articles/The_CERT_Function_Extraction_Experiment_Quantifying_FX_Impact_on_Software_Comprehension_and_Verification/6585095/1)

# (General) Current Evaluation Scenario for Models

💡Standard paradigm for evaluation:

 - Using train-validation-test splits to estimate model's the accuracy/any other metric
    
💡While this is useful, eval DSs are:
    
 - often not comprehensive
 - contain the same biases as the training data

💡So, by summarizing the model performance as a single aggregate statistic:
 
 - The performance on the real-world data can be overestimated
 - Makes difficult to figure out where the model is failing, 
 - and what is more, how to **fix it**


## Authors' Claims and Findings

- Commercial/Research models claim to show performance (accuracy) at the Human level

### However, Checklist...
- Reveals a variety of severe bugs/linguistic phenomena not detected in commercial and research models
- Examples:

  1. negation
  2. named entities
  3. coreferences
  4. semantic role labeling
  5. et cetera
  
- Model developers generate double the tests they were generating before

# Behavioral Testing

💡 Behavior Driven Development (BDD) -> Test Driven Development (TDD). 

💡 __GOAL__: Decouple tests from implementation

💡 BDD relies in human-readable descriptions of SW requirements to assess software.

 - Defines use cases as stories in English-like DSL
 - **Story**: Transfers change balances
   1. As a bank user
   2. When I send money from one of my accounts to another
   3. I want the account balances to update

### Example

 - Assuming I have `$100` in my balance of account A <p> And the balance of my account B is `$10`
    
 - If I transfer `$50` from my Account A to my account B
    
 - Then, in the end, I should have `$50` in the balance of my account A <p> And `$60` in the balance of my account B

# Scenarios They Choose to Demonstrate Applicability

- Sentiment Analysis

- Duplicate Question Detection

- Machine Comprehension



__In all of them the model is treated as a _Black-Box_, which allows comparison of different model implementations, models trained on different data, etc.__


## Real-world Field Studies Description

### It looks like they do a kind of a clinical-study here

1. Sentiment Analysis Model
2. Other NLP practitioners


# Example

### Capability Matrix

- Potential tests structured as a matrix
    - Rows -> capabilities
    - Cols -> test types

![Capability Matrix](images/cap_matrix.png)

The matrix works as a guide, prompting users to test each capability with different test types.


# Example: Model’s Negation capability test

![Capability Matrix](images/cap_matrix.png)

---

- Minimum Functionality test (MFT), i.e. simple test cases designed to target a specific behavior
  1. Generate simple examples filling in a template (“I {NEGATION} {POS_VERB} the {THING}.”) with pre-built lexicons
  2. Compute the model’s failure rate on such examples
![Test Case A](images/test_case_a.png)

# Example: Named entity recognition (NER) 

![Capability Matrix](images/cap_matrix.png)

---

 - Tested with an Invariance test (INV) 
   1. perturbations that should not change the output of the model
![Sentiment Analysis INV](images/test_case_b.png)

 - Changing location should not change sentiment

# Example: Vocabulary

![Capability Matrix](images/cap_matrix.png)

---

 - Use a Directional Expectation test (DIR) 
   - perturbations to the input with known expected results 

![Test Case A](images/test_case_c.png)

 - Add negative phrases and check sentiment does not become more positive

# Tool Description

- Users have to __identify the language capabilities__ of the tasks at hand
- then __create tests__ to evaluate the model

## Capabilities

1. e.g. Sentiment Analysis

  - Identify workds with positive, neutral, negative sentiment
  - Check the behaviour of those words in examples "I had a good flight", "I didn't have a good flight"

![Sentiment Analysis](images/test_case_a.png)

2. Duplicate question detection

  - Understand when modifiers differentiate questions
  - e.g. "accredited" word in "Is John a teacher?", "Is John an accredited teacher?"
  
3. Machine comprehension

  - Relate comparatives and superlatives
  - e.g. Context: Mary is smarter than John. -> Q: "Who's the smartest?" -> A: "Mary"
  
### (Suggested) Capabilities

1. Vocabulary + Part of Speech
2. Taxonomy (synonyms/antonims)
3. Robustness (to typos)
4. NER
5. Fairness
6. Temporal (e.g. for a sequence of events, which is first, last, etc.)
7. Negation
8. Correference
9. Semantic Role Labeling (agent, object, etc)
10. Logic (conjunction, disjunction, simetry, etc)


## Test Types

![Capability Matrix](images/cap_matrix.png)

1. Minimum Functionality Tests
  - Equivalent to Unit Tests in SWE
  - Consists on creating Small and focused __TEST Datasets__
  - Useful for __detecting when models use shortcuts when handling complex inputs instead of generalizing__
  - Vocabulary and PoS example
  
2. Invariance
  - Equivalent to Metamorphic tests in SWE
    ![Metamorphic test example](images/metamorphic_test.png)  
  - Add label-preserving perturbations to the input whilst expecting the same ouput
  - Capability dependent
  - Addressed to Unlabeled data
  - Examples:
    1. For Robustness -> Adding new typos
    2. For NER -> changing location
    
    ![Sentiment Analysis INV](images/test_case_b.png)


3. Directional Expectation
  - Similar to INV test type, but label is expected to change in some way ??? <- Not very clear what it means
  - Example:
    - A sentiment does not become more positive when adding extra stuff
    ![Sentiment Analysis DIR](images/test_case_c.png)
  - Addressed to Unlabeled data


# Test Cases Generation (https://github.com/marcotcr/checklist)

1. From Scratch

  - Creativity implied
  - Will represent a solid test base

2. Perturbing an existing DS


3. Templates

    - Example:
      "I didn't love the food" -> "I {NEGATION} {POS_VERB} the {THING}" where:
      1. {NEGATION} is "can't" "didn't"...
      2. {POS_VERB} is "love", "like"...
      3. {THING} is "food", "flight", "service"...
    - Then use a cartesian product to generate all the posible test cases
    
    - Can be expanded with a masked language model like ROBERTA to generate stuff automatically
      1. e.g. "I really {mask} the flight" is filled by ROBERTA with "enjoyed", "regret", "loved"...
    ![Tool Mask](images/tool_mask.png)      

# Sentiment Analysis Results

![Tests for Sentiment Analysis](images/table1.png)

# Quora Question Pair

![Quora Question Pair](images/table2.png)

# Machine Comprehension

![Machine Comprehension](images/table3.png)

# User Studies

* Kind of clinical study to justify their approach works
* Two scenarios:
    - Commercial system
    - User study: MFT for Newbies

## Commercial System
* **Goal**: Compare to a well-established baseline: Commercial Microsoft Sentiment Analysis Model
    - Already battletested (bugs found/fixed...)
    - More comprehensive evaluation than research systems
* **Methodology**
    - Users invited to 5 hour session
    - Checklist methodology explained first
    - Use the methodology to write their own tests (with help from Checklist team to reduce learning curve)
    - Brainstorm session: 30 tests: 20 MFT, 5 INV, 5 DIR
    - Implemented 20 of them (66%)
* **Findings**
    - Checklist team:
        - Overlap great % of tests:
            - "Studends" end up implemneting most of the tests the Checklist team implemented in advance (Maybe induced unconscious bias???)
            - But also new capabilities
    - MS team:
        - They tested new capabilities (This may be due to just brainstorming with new ppl)
        - They tested capabilities they considered but not in their benchmarks (Due to lack of time maybe???)
        - They tested stuff already had but much thoroughly/systematically

## User Study: MFT
* **Goal**: Users with NO experience can gain insights and find bugs
* **Methodology**
    - 18 participants: 8 industry/10 academia
    - Intermediate NLP experience
    - Bert finetuned on QQP. Access to validation set to create tests.
    - Write only MFT (just a subset of Checklist)
    - Users divided into 3 categories:
        1. Unaided (**Control group**) -> No instructions
        2. Cap only -> Short descriptions of capabilities in 
        3. Cap with templ -> 2. + template and fill-in tools
* **Findings**
    - Authors claim users write many more tests
    - Users tested more capabilities with cap only and cap + templ
        - They tested all the capabilities in Table 2 + extra new capabilities (Not exactly true: 10/7.8 (GT/Tested)
        - Unaided and Cap only only tested couldn't find more bugs because they lacked the test-case variety
    - Subjective evaluation of severity of bugs found... (well... ok...)
        - They claim that the severity of bugs discovered by Unaided ppl is lower (to me, this is due to the fact they tested less capabitilies, and they probably started with the easiests)

# User Study Results

![User Study](images/table4.png)

# Take Aways

💡 Ensure models fulfill *benchmark accuracy* **is not enough to evaluate model quality in NLP**

💡 They use SWE techniques to illustrate the "bad" quality of models that have passed the existing benchmarks in 3 different tasks

💡 They claim their methodology and tools are easy to follow/use

💡 Utility is guaranteed

    - Found errors in battletested public comercial models
    - Shown how users (both expert and newcomers) can benefit from the framework almost immediately
     
💡 Goal: Improve quality of current NLP model evaluations, so...

    - Tool and all stuff described in the paper is is open sourced: https://github.com/marcotcr/checklist
    - They plan the community to start growing by sharing their experiences through new test suites and capabilities

# Reflections/Open Questions

- I think this is pretty interesting paper:

  - Sometimes we focus on a measure and we put it as a single goal, we distort the measure (what is call Capmbell's law in eductaion. Originally: "But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways")
  - This methodology allows to alleviate/avoid that


- As it happened with TDD/BDD in the SWE, in most of the companies/environments probably:

  1. Ppl will be reluctant to apply this as they think they already know the underlying problem (Dunning-Krugger effect?)
  2. Extra burden that will be avoided by managers (again D-K effect)


- Who would be the responsible/s for implementing this in real projects?
  - Data Scientists???
  - Project Managers???
  - Engineers???
  - Donald The Duck???
  - Rene Descartes???
  - My 96 year aunt???
  
  - *PC Answer*: "All of them, of course!" -> **Reality**: None of them will do


- Probably the stakeholders involved will continue living in their "island of happiness"
  - Embarrassing questions that can come out of these tests, so...
  - "If you don't have do those tests, you don't have a problem"

# PDF Paper

In [2]:
from IPython.display import IFrame
IFrame("./2005.04118.pdf", width=1500, height=1200)