# Robustness Checking
This tutorial walks through how to use `DaCy`/`SpaCy` augmenters to evalutate robustness of any NLP model. We'll start out by evaluating SpaCy and DaCy small on the testset of DaNE, before showing how to use this framework on any other type of model using DaNLP's BERT as an example. 



In [None]:
import spacy
import dacy
import pandas as pd

from spacy.training import Corpus, Example
from spacy.tokens import Doc, Span

from dacy.score import score
from dacy.datasets import dane

In [None]:
test = dane(splits=["test"])
spacy_small = spacy.load("da_core_news_sm")
# small/medium/large can be used instead of da_dacy_SIZE_tft-0.0.0
dacy_small = dacy.load("small")

Evaluating models already in the `SpaCy` framework is very straightforward. Simply call the `score` function on your nlp pipeline and choose which metrics you want to calculate performance for. `score` is a wrapper for `SpaCy.scorer.Scorer` that outputs a nicely formatted dataframe. `score` calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.

In [None]:
spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])



In [None]:
spacy_baseline

In [None]:
dacy_baseline

To obtain performance estimates on augmented data, simply provide the augmenter(s) in the `augmenters` argument

In [None]:
from dacy.augmenters import create_pers_augmenter
from dacy.datasets import male_names
from spacy.training.augment import create_lower_casing_augmenter

In [None]:
lower_aug = create_lower_casing_augmenter(level=1)
male_name_dict = male_names()
# Augmenter that replaces names with random Danish male names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns
male_aug = create_pers_augmenter(male_name_dict, 
                                 patterns=["fn,ln","abbpunct,ln"], 
                                 force_pattern_size=False,
                                 keep_name=False)

spacy_aug = score(test, 
                  apply_fn=spacy_small,
                  score_fn=["ents", "pos"],
                  augmenters=[lower_aug, male_aug])
dacy_aug = score(test,
                 apply_fn=dacy_small,
                 score_fn=["ents", "pos"],
                 augmenters=[lower_aug, male_aug])


In [None]:
pd.concat([spacy_baseline, spacy_aug])

In [None]:
pd.concat([dacy_baseline, dacy_aug])

In the second row, we see that `SpaCy small` is very vulnerable to lower casing as NER recall drops from 0.63 to 0.09. `DaCy small` is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models. 

To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the `k` argument in `score` to run the augmenter multiple times.

In [None]:
from dacy.augmenters import create_keyboard_augmenter

key_05_aug = create_keyboard_augmenter(doc_level=1, char_level=0.05, keyboard="QWERTY_DA")

spacy_key = score(test, 
                  apply_fn=spacy_small,
                  score_fn=["ents", "pos"],
                  augmenters=[key_05_aug],
                  k=5)

In [None]:
spacy_key

In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the `dacy_paper_replication.py` script you can find the exact script used to evaluate the robustness of Danish NLP models in the DaCy paper.

# Custom Models
Evaluating models not in the `SpaCy` framework requires to write an `apply_fn` that takes a Spacy `Example` as input, applies your model to it, turns it into a `Doc`, and returns an `Example`. 

The following shows how to write one for DanNLP's BERT named entity recognition model. `add_iob` adds the entities to the predicted `Doc`

In [None]:
from danlp.models import load_bert_ner_model
from typing import List
from spacy.lang.da import Danish
# load model
bert_model = load_bert_ner_model()
# instantiate empty Dansih Spacy NLP pipeline
nlp_da = Danish()

def apply_bert_model(example: Example) -> Example:
    doc = nlp_da(example.reference.text)
    tokens, labels = bert_model.predict([t.text for t in doc])
    doc = add_iob(doc, labels)
    return Example(doc, example.reference)


def add_iob(doc: Doc, iob: List[str]) -> Doc:
    """Add iob tags to Doc

    Args:
        doc (Doc): A SpaCy doc
        iob (List[str]): a list of tokens on the IOB format

    Returns:
        Doc: A doc with the spans to the new IOB
    """
    ent = []
    for i, label in enumerate(iob):

        # turn OOB labels into spans
        if label == "O":
            continue
        iob_, ent_type = label.split("-")
        if (i - 1 >= 0 and iob_ == "I" and iob[i - 1] == "O") or (
            i == 0 and iob_ == "I"
        ):
            iob_ = "B"
        if iob_ == "B":
            start = i
        if i + 1 >= len(iob) or iob[i + 1].split("-")[0] != "I":
            ent.append(Span(doc, start, i + 1, label=ent_type))
    doc.set_ents(ent)
    return doc

In [None]:
danlp_bert = score(test, apply_fn=apply_bert_model, score_fn=["ents"]

If you are in doubt how to create an apply function for your model you can find inspiration in `papers/DaCy../apply_fns`. This folder contains apply functions for DaNLP's BERT, Flair, NERDA, and Polyglot. Otherwise, feel free to open an issue on Github. 