In [1]:
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update(
    "rise",
    {
        "theme": None,
        "transition": None,
        "start_slideshow_at": "selected",
        "leap_motion": {
            "naturalSwipe"  : True,     # Invert swipe gestures
            "pointerOpacity": 0.5,      # Set pointer opacity to 0.5
            "pointerColor"  : "#d80000" # Red pointer"nat.png"
        },
        "header": "<h3>Francisco Perez-Sorrosal</h3>",
        "footer": "<h3>Deep Learning Reading Group</h3>",
        "scroll": True,
        "enable_chalkboard": True
     }
)

{'start_slideshow_at': 'selected',
 'leap_motion': {'naturalSwipe': True,
  'pointerOpacity': 0.5,
  'pointerColor': '#d80000'},
 'header': '<h3>Francisco Perez-Sorrosal</h3>',
 'footer': '<h3>Deep Learning Reading Group</h3>',
 'scroll': True,
 'enable_chalkboard': True}

In [None]:
pip install emoji --upgrade

In [None]:
import emoji
print(emoji.emojize('Presenting stuff is easy!!! :thumbs_up:'))

In [None]:
# Emojis http://getemoji.com/

# Beyond Accuracy: Behavioral Testing of NLP models with CheckList

## Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

### [https://arxiv.org/abs/2005.04118](https://arxiv.org/abs/2005.04118)

### [pdf](https://arxiv.org/pdf/2005.04118.pdf)

---

Francisco Perez-Sorrosal | 7 Sep 2020


# Summary (What do they sell in the paper)

**Best paper award in ACL 2020**

Presents **Checklist**, a _methodology and a tool_ for testing NLP models

💡 Based on behavioral testing (software engineering)

💡 Composed of:

1. matrix of general linguistic capabilities and test types (to ease comprehensive test ideation)

2. tool to generate test cases

💡 Illustrates the methodology on 3 use cases

# (General) Current Evaluation Scenario for Models

💡Standard paradigm for evaluation:

 - Using train-validation-test splits to estimate model's the accuracy/any other metric
    
💡While this is useful, eval DSs are:
    
 - often not comprehensive
 - contain the same biases as the training data

💡So, by summarizing the model performance as a single aggregate statistic:
 
 - The performance on the real-world data can be overestimated
 - Makes difficult to figure out where the model is failing, 
 - and what is more, how to **fix it**


## Authors' Claims and Findings

- Commercial/Research models claim to show performance at the Human level

### However Checklist:
- Reveals a variety of severe bugs/linguistic phenomena not detected in commercial and research models
- Examples:

  1. negation
  2. named entities
  3. coreferences
  4. semantic role labeling
  5. et cetera
  
- Model developers generate double the tests they were generating before

# Behavioral Testing

💡 Behavior Driven Development (BDD) -> Test Driven Development (TDD). 

💡 __GOAL__: Decouple tests from implementation

💡 BDD relies in human-readable descriptions of SW requirements to assess software.

 - Defines use cases as stories in English-like DSL
 - **Story**: Transfers change balances
   1. As a bank user
   2. When I send money from one of my accounts to another
   3. I want the account balances to update

### Example

 - Assuming I have `$100` in my balance of account A <p> And the balance of my account B is `$10`
    
 - If I transfer `$50` from my Account A to my account B
    
 - Then, in the end, I should have `$50` in the balance of my account A <p> And `$60` in the balance of my account B

# Scenarios They Choose to Demonstrate Applicability

- Sentiment Analysis

- Duplicate Question Detection

- Machine Comprehension



__In all of them the model is treated as a _Black-Box_, which allows comparison of different model implementations, models trained on different data, etc.__


## Real-world Field Studies Description

### It looks like they do a kind of a clinical-study here

1. Sentiment Analysis Model
2. Other NLP practitioners


# Example

### Capability Matrix

- Potential tests structured as a matrix
    - Rows -> capabilities
    - Cols -> test types

![Capability Matrix](images/cap_matrix.png)

The matrix works as a guide, prompting users to test each capability with different test types.


# Example: Model’s Negation capability test

![Capability Matrix](images/cap_matrix.png)

---

- Minimum Functionality test (MFT), i.e. simple test cases designed to target a specific behavior
  1. Generate simple examples filling in a template (“I {NEGATION} {POS_VERB} the {THING}.”) with pre-built lexicons
  2. Compute the model’s failure rate on such examples
![Test Case A](images/test_case_a.png)

# Example: Named entity recognition (NER) 

![Capability Matrix](images/cap_matrix.png)

---

 - Tested with an Invariance test (INV) 
   1. perturbations that should not change the output of the model
![Test Case A](images/test_case_b.png)

 - Changing location should not change sentiment

# Example: Vocabulary

![Capability Matrix](images/cap_matrix.png)

---

 - Use a Directional Expectation test (DIR) 
   - perturbations to the input with known expected results 

![Test Case A](images/test_case_c.png)

 - Add negative phrases and check sentiment does not become more positive

# Tool Description

- Users have to identify the __language capabilities__ of the tasks at hand
- then create tests to evaluate the model

## Capabilities



In [11]:
from IPython.display import IFrame
IFrame("./2005.04118.pdf", width=1500, height=1200)