# Tutorial: Evaluating Robustness
This tutorial walks through how to use `Augmenty`/`SpaCy` augmenters to evalutate robustness of any NLP pipeline. As an example we'll start out by evaluating SpaCy small and DaCy small on the test set of [DaNE](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dane). DaNE is the Danish Dependency treebank tagged for part-of-speech tags, dependency relations and named entities. Lastly we will show how to use this framework on any other type of model using [DaNLP's BERT](https://github.com/alexandrainst/danlp/blob/master/docs/docs/tasks/ner.md#-bert-bert) as an example. 

Let us start of with installing the required packages and loading the models and dataset we wish to test on.

### Installing packages

To get started we will first need to install a few packages:

```bash
# install models
pip install dacy
python -m spacy download da_core_news_sm

# install augmentation library
pip install "augmenty>=1.0.2,<1.1.0"
```

## Loading models and data

In [7]:
import spacy
import dacy

from dacy.datasets import dane

# load the DaNE test set
test = dane(splits=["test"])

# load models
spacy_small = spacy.load("da_core_news_sm")
dacy_small = dacy.load("small")



## Estimating performance
Evaluating models already in the `SpaCy` framework is very straightforward. Simply call the `score` function on your nlp pipeline and choose which metrics you want to calculate performance for. `score` is a wrapper for `SpaCy.scorer.Scorer` that outputs a nicely formatted dataframe. `score` calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.

In [8]:
from dacy.score import score

spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])

  matches = self.matcher(doc, allow_missing=True, as_spans=False)


In [9]:
spacy_baseline

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,...,ents_per_type_ORG_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,1.638683,0.720408,0.632616,0.673664,0.649485,0.520661,0.577982,0.653846,0.708333,0.68,...,0.551724,0.793651,0.833333,0.813008,0.737913,0.663616,0.698795,0.949103,0.949103,0


In [10]:
dacy_baseline

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,...,ents_per_type_PER_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,12.794494,0.774312,0.756272,0.765186,0.656,0.677686,0.666667,0.736364,0.84375,0.786408,...,0.90027,0.773109,0.571429,0.657143,0.809524,0.778032,0.793466,0.98002,0.0,0


### Estimating robustness and biases
To obtain performance estimates on augmented data, simply provide a list of augmenters as the `augmenters` argument. 

In [11]:
from augmenty.span.entities import create_per_replace_augmenter_v1
from dacy.datasets import female_names
from spacy.training.augment import create_lower_casing_augmenter

In [12]:

lower_aug = create_lower_casing_augmenter(level=1)
female_name_dict = female_names()
# Augmenter that replaces names with random Danish female names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns

patterns = [["firstname"], ["firstname", "lastname"], ["firstname", "firstname", "lastname"]]
female_aug = create_per_replace_augmenter_v1(female_name_dict, patterns, level=0.1)

spacy_aug = score(
    test,
    apply_fn=spacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)
dacy_aug = score(
    test,
    apply_fn=dacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)

  matches = self.matcher(doc, allow_missing=True, as_spans=False)


In [13]:
import pandas as pd

pd.concat([spacy_baseline, spacy_aug])

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,...,ents_per_type_ORG_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,1.638683,0.720408,0.632616,0.673664,0.649485,0.520661,0.577982,0.653846,0.708333,0.68,...,0.551724,0.793651,0.833333,0.813008,0.737913,0.663616,0.698795,0.949103,0.949103,0
0,1.679577,0.673267,0.243728,0.357895,0.741935,0.380165,0.502732,0.653846,0.354167,0.459459,...,0.153846,0.626866,0.233333,0.340081,0.642857,0.20595,0.311958,0.920288,0.920288,0
0,1.375843,0.720408,0.632616,0.673664,0.649485,0.520661,0.577982,0.653846,0.708333,0.68,...,0.551724,0.793651,0.833333,0.813008,0.737913,0.663616,0.698795,0.949103,0.949103,0


In [14]:
pd.concat([dacy_baseline, dacy_aug])

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,...,ents_per_type_PER_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,12.794494,0.774312,0.756272,0.765186,0.656,0.677686,0.666667,0.736364,0.84375,0.786408,...,0.90027,0.773109,0.571429,0.657143,0.809524,0.778032,0.793466,0.98002,0.0,0
0,13.063498,0.727088,0.639785,0.680648,0.614754,0.619835,0.617284,0.714286,0.78125,0.746269,...,0.805797,0.686869,0.42236,0.523077,0.764228,0.645309,0.699752,0.974477,0.0,0
0,12.604465,0.774312,0.756272,0.765186,0.656,0.677686,0.666667,0.736364,0.84375,0.786408,...,0.90027,0.773109,0.571429,0.657143,0.809524,0.778032,0.793466,0.98002,0.0,0


In the second row, we see that `SpaCy small` is very vulnerable to lower casing as NER recall drops from 0.66 to 0.38. `DaCy small` is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models. 

To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the `k` argument in `score` to run the augmenter multiple times.

In [18]:
from augmenty.character.replace import create_keystroke_error_augmenter_v1

key_05_aug = create_keystroke_error_augmenter_v1(level=0.5, keyboard="da_qwerty.v1")

spacy_key = score(
    test, apply_fn=spacy_small, score_fn=["ents", "pos"], augmenters=[key_05_aug], k=5
)

In [19]:
spacy_key

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,...,ents_per_type_ORG_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,pos_acc,tag_acc,k
0,4.706608,0.110333,0.112903,0.111603,0.04,0.041322,0.04065,0.101852,0.114583,0.107843,...,0.089655,0.162679,0.188889,0.174807,0.130045,0.132723,0.13137,0.331013,0.331013,0
1,3.668705,0.118068,0.11828,0.118174,0.08,0.082645,0.081301,0.127451,0.135417,0.131313,...,0.089744,0.160221,0.161111,0.160665,0.129032,0.128146,0.128588,0.329741,0.329741,1
2,4.472808,0.086342,0.098566,0.09205,0.038217,0.049587,0.043165,0.099099,0.114583,0.10628,...,0.079755,0.122549,0.138889,0.130208,0.102083,0.112128,0.10687,0.326347,0.326347,2
3,5.064881,0.119816,0.139785,0.129032,0.060403,0.07438,0.066667,0.165289,0.208333,0.184332,...,0.101266,0.146018,0.183333,0.162562,0.13745,0.157895,0.146965,0.336281,0.336281,3
4,4.583604,0.116239,0.121864,0.118985,0.051471,0.057851,0.054475,0.085714,0.09375,0.089552,...,0.100334,0.179612,0.205556,0.19171,0.135857,0.139588,0.137698,0.330561,0.330561,4


In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the `dacy_paper_replication.py` script you can find the exact script used to evaluate the robustness of Danish NLP models in the [DaCy paper]().

# Evaluating custom models
Evaluating models not in the `SpaCy` framework requires the user to write an `apply_fn` that takes a series of SpaCy `Example`s as input, and applies their model to it and returns list of examples `Example`. 

The following shows how to write one for the NERDA model for named entity recognition. Notice that we replace the tokenizer with the spaCy tokenizer (where they use the NLTK) it turns out that this provides a better performance.

We will start out by installing the package and downloading the model. Then we will define an apply function which converts the models tags to spacy annotations.

In [17]:
# !pip install NERDA

In [17]:
from NERDA.precooked import DA_BERT_ML
import ssl

model = DA_BERT_ML()
# to download the danlp and nerda you will have to set up a certificate:
ssl._create_default_https_context = ssl._create_unverified_context
model.download_network()
model.load_network()

Device automatically set to: cpu


Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



        Model loaded. Please make sure, that you're running the latest version 
        of 'NERDA' otherwise the model is not guaranteed to work.
        


In [20]:
from typing import Iterable, List
from spacy.tokens import Doc, Span
from spacy.training import Example

# set up a danish tokenization pipeline
nlp_da = spacy.blank("da")


def add_iob(doc: Doc, iob: List[str]) -> Doc:
    """A helper function for adding iob tags to Doc

    Args:
        doc (Doc): A SpaCy doc
        iob (List[str]): a list of tokens on the IOB format

    Returns:
        Doc: A doc with the spans to the new IOB
    """
    ent = []
    for i, label in enumerate(iob):

        # turn OOB labels into spans
        if label == "O":
            continue
        iob_, ent_type = label.split("-")
        if (i - 1 >= 0 and iob_ == "I" and iob[i - 1] == "O") or (
            i == 0 and iob_ == "I"
        ):
            iob_ = "B"
        if iob_ == "B":
            start = i
        if i + 1 >= len(iob) or iob[i + 1].split("-")[0] != "I":
            ent.append(Span(doc, start, i + 1, label=ent_type))
    doc.set_ents(ent)
    return doc


def apply_nerda(examples: Iterable[Example]) -> List[Example]:
    sentences = []
    docs_y = []
    for example in examples:
        # tokenization
        # they use NLTK for their tokenization,
        # but turns out that the spacy tokenizer provides better results
        sentences.append([t.text for t in nlp_da(example.reference.text)])
        docs_y.append(example.reference)

    # ner
    labels = model.predict(sentences=sentences)

    examples_ = []
    for doc_y, label, words in zip(docs_y, labels, sentences):
        if len(label) < len(words):
            label += ["O"] * (len(words) - len(label))

        doc = Doc(nlp_da.vocab, words=words)
        doc = add_iob(doc, iob=label)
        examples_.append(Example(doc, doc_y))
    return examples_

In [21]:
nerda = score(test, apply_fn=apply_nerda, score_fn=["ents"])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [22]:
nerda

Unnamed: 0,wall_time,ents_p,ents_r,ents_f,ents_per_type_LOC_p,ents_per_type_LOC_r,ents_per_type_LOC_f,ents_per_type_MISC_p,ents_per_type_MISC_r,ents_per_type_MISC_f,ents_per_type_PER_p,ents_per_type_PER_r,ents_per_type_PER_f,ents_per_type_ORG_p,ents_per_type_ORG_r,ents_per_type_ORG_f,ents_excl_MISC_ents_p,ents_excl_MISC_ents_r,ents_excl_MISC_ents_f,k
0,195.918393,0.819231,0.763441,0.790353,0.747826,0.895833,0.815166,0.756757,0.694215,0.724138,0.942197,0.905556,0.923513,0.768595,0.57764,0.659574,0.836186,0.782609,0.808511,0


If you are in doubt how to create an apply function for your model you can find more inspiration in [`papers/DaCy../apply_fns`](https://github.com/centre-for-humanities-computing/DaCy/tree/main/papers/DaCy-A-Unified-Framework-for-Danish-NLP/apply_fns). This folder contains apply functions for DaNLP's BERT, Flair, NERDA, and Polyglot. Otherwise, feel free to open an issue on the GitHub. 