# Robustness Checking
This tutorial walks through how to use `DaCy`/`SpaCy` augmenters to evalutate robustness of any NLP pipeline. As an example we'll start out by evaluating SpaCy small and DaCy small on the test set of [DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dane). DaNE is the Danish Dependency treebank tagged for part-of-speech tags, dependency relations and named entities. Lastly we will show how to use this framework on any other type of model using [DaNLP's BERT](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/ner.md#-bert-bert) as an example. 

Let us start of with installing the required packages and loading the models and dataset we wish to test on.


### Installing packages

In [12]:
#!pip install dacy
#!python -m spacy download da_core_news_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/Users/au561649/.virtualenvs/dacy_tutorials/bin/python -m pip install --upgrade pip' command.[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting da-core-news-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/da_core_news_sm-3.1.0/da_core_news_sm-3.1.0-py3-none-any.whl (18.8 MB)
[K     |████████████████████████████████| 18.8 MB 7.6 MB/s 
You should consider upgrading via the '/Users/au561649/.v

## Loading models and data

In [14]:
import spacy
import dacy

from dacy.datasets import dane

# load the DaNE test set
test = dane(splits=["test"])

# load models
spacy_small = spacy.load("da_core_news_sm")
dacy_small = dacy.load("small")

## Estimating performance
Evaluating models already in the `SpaCy` framework is very straightforward. Simply call the `score` function on your nlp pipeline and choose which metrics you want to calculate performance for. `score` is a wrapper for `SpaCy.scorer.Scorer` that outputs a nicely formatted dataframe. `score` calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.

In [15]:
from dacy.score import score

spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])

In [16]:
spacy_baseline

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,...,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,2.503153,0.715746,0.62724,0.668577,0.660377,0.578512,0.61674,0.795699,0.822222,0.808743,...,0.679426,0.72619,0.378882,0.497959,0.73107,0.640732,0.682927,0.948357,0.948357,0


In [17]:
dacy_baseline

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,...,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,9.524686,0.752768,0.731183,0.741818,0.621212,0.677686,0.648221,0.865591,0.894444,0.879781,...,0.792271,0.734513,0.515528,0.605839,0.795122,0.745995,0.769776,0.0,0.977471,0


### Estimating robustness and biases
To obtain performance estimates on augmented data, simply provide a list of augmenters as the `augmenters` argument. 

In [18]:
from dacy.augmenters import create_pers_augmenter
from dacy.datasets import female_names
from spacy.training.augment import create_lower_casing_augmenter

In [19]:
lower_aug = create_lower_casing_augmenter(level=1)
female_name_dict = female_names()
# Augmenter that replaces names with random Danish female names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns
female_aug = create_pers_augmenter(
    female_name_dict,
    patterns=["fn,ln", "abbpunct,ln"],
    force_pattern_size=False,
    keep_name=False,
)

spacy_aug = score(
    test,
    apply_fn=spacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)
dacy_aug = score(
    test,
    apply_fn=dacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)

In [20]:
import pandas as pd

pd.concat([spacy_baseline, spacy_aug])

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,...,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,2.503153,0.715746,0.62724,0.668577,0.660377,0.578512,0.61674,0.795699,0.822222,0.808743,...,0.679426,0.72619,0.378882,0.497959,0.73107,0.640732,0.682927,0.948357,0.948357,0
0,2.603184,0.708738,0.261649,0.382199,0.757143,0.438017,0.554974,0.698413,0.244444,0.36214,...,0.462585,0.681818,0.093168,0.163934,0.683824,0.212815,0.324607,0.923838,0.923838,0
0,2.329518,0.689507,0.577061,0.628293,0.683168,0.570248,0.621622,0.753086,0.677778,0.71345,...,0.669856,0.67033,0.378882,0.484127,0.691257,0.578947,0.630137,0.9469,0.9476,0


In [21]:
pd.concat([dacy_baseline, dacy_aug])

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,...,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,9.524686,0.752768,0.731183,0.741818,0.621212,0.677686,0.648221,0.865591,0.894444,0.879781,...,0.792271,0.734513,0.515528,0.605839,0.795122,0.745995,0.769776,0.0,0.977471,0
0,8.417889,0.628809,0.40681,0.494015,0.51145,0.553719,0.531746,0.68,0.472222,0.557377,...,0.685393,0.608696,0.086957,0.152174,0.695652,0.366133,0.47976,0.0,0.954046,0
0,7.461947,0.724395,0.697133,0.710502,0.630769,0.677686,0.653386,0.830409,0.788889,0.809117,...,0.796117,0.65873,0.515528,0.578397,0.7543,0.702517,0.727488,0.0,0.977218,0


In the second row, we see that `SpaCy small` is very vulnerable to lower casing as NER recall drops from 0.66 to 0.38. `DaCy small` is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models. 

To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the `k` argument in `score` to run the augmenter multiple times.

In [22]:
from dacy.augmenters import create_keyboard_augmenter

key_05_aug = create_keyboard_augmenter(
    doc_level=1, char_level=0.05, keyboard="QWERTY_DA"
)

spacy_key = score(
    test, apply_fn=spacy_small, score_fn=["ents", "pos"], augmenters=[key_05_aug], k=5
)

In [23]:
spacy_key

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,...,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,2.139297,0.610879,0.523297,0.563707,0.611111,0.454545,0.521327,0.66,0.733333,0.694737,...,0.565657,0.569767,0.304348,0.396761,0.610825,0.542334,0.574545,0.852554,0.852554,0
1,2.016432,0.596195,0.505376,0.547042,0.523256,0.371901,0.434783,0.662983,0.666667,0.66482,...,0.631579,0.548387,0.31677,0.401575,0.612403,0.542334,0.575243,0.843463,0.843463,1
2,2.024655,0.589041,0.539427,0.563143,0.519231,0.446281,0.48,0.682292,0.727778,0.704301,...,0.580952,0.544554,0.341615,0.419847,0.60688,0.565217,0.585308,0.852157,0.852157,2
3,1.888028,0.613821,0.541219,0.575238,0.598039,0.504132,0.547085,0.664921,0.705556,0.684636,...,0.59,0.578947,0.341615,0.429687,0.617949,0.551487,0.58283,0.848536,0.848536,3
4,1.842343,0.579655,0.541219,0.559778,0.513761,0.46281,0.486957,0.647059,0.733333,0.6875,...,0.583732,0.557895,0.329193,0.414062,0.597087,0.562929,0.579505,0.840738,0.840738,4


In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the `dacy_paper_replication.py` script you can find the exact script used to evaluate the robustness of Danish NLP models in the [DaCy paper]().

# Evaluating custom models
Evaluating models not in the `SpaCy` framework requires the user to write an `apply_fn` that takes a series of SpaCy `Example`s as input, and applies their model to it and returns list of examples `Example`. 

The following shows how to write one for DanNLP's BERT named entity recognition model. `add_iob` adds the entities to the predicted `Doc`.

In [24]:
# !pip install danlp==0.0.11
# !pip install gensim==3.8.3

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
You should consider upgrading via the '/Users/au561649/.virtualenvs/dacy_tutorials/bin/python -m pip install --upgrade pip' command.[0m


In [28]:
from danlp.models import load_bert_ner_model

from typing import List, Iterable
from spacy.lang.da import Danish
from spacy.tokens import Doc, Span
from spacy.training import Example

# load model
bert_model = load_bert_ner_model()

# instantiate empty Danish Spacy NLP pipeline for tokenization
nlp_da = Danish()


def apply_bert_model(examples: Iterable[Example]) -> List[Example]:
    e = []
    for example in examples:
        doc = nlp_da(example.reference.text)  # tokenize using SpaCy
        tokens, labels = bert_model.predict([t.text for t in doc])
        doc = add_iob(doc, labels)
        e.append(Example(doc, example.reference))
    return e


def add_iob(doc: Doc, iob: List[str]) -> Doc:
    """A helper function for adding iob tags to Doc

    Args:
        doc (Doc): A SpaCy doc
        iob (List[str]): a list of tokens on the IOB format

    Returns:
        Doc: A doc with the spans to the new IOB
    """
    ent = []
    for i, label in enumerate(iob):

        # turn OOB labels into spans
        if label == "O":
            continue
        iob_, ent_type = label.split("-")
        if (i - 1 >= 0 and iob_ == "I" and iob[i - 1] == "O") or (
            i == 0 and iob_ == "I"
        ):
            iob_ = "B"
        if iob_ == "B":
            start = i
        if i + 1 >= len(iob) or iob[i + 1].split("-")[0] != "I":
            ent.append(Span(doc, start, i + 1, label=ent_type))
    doc.set_ents(ent)
    return doc

In [29]:
danlp_bert = score(test, apply_fn=apply_bert_model, score_fn=["ents"])

In [30]:
danlp_bert

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,k
0,42.290364,0.855072,0.634409,0.728395,0.0,0.0,0.0,0.917582,0.927778,0.922652,0.788991,0.895833,0.839024,0.821138,0.627329,0.711268,0.855072,0.810069,0.831962,0


If you are in doubt how to create an apply function for your model you can find more inspiration in [`papers/DaCy../apply_fns`](https://github.com/centre-for-humanities-computing/DaCy/tree/main/papers/DaCy-A-Unified-Framework-for-Danish-NLP/apply_fns). This folder contains apply functions for DaNLP's BERT, Flair, NERDA, and Polyglot. Otherwise, feel free to open an issue on Github. 