# Named Entity Recognition

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KennethEnevoldsen/DaCy/blob/master/docs/performance.ner.ipynb)


This page examines the performance of competing models for Danish named entity recognition over multiple datasets. Performance is not limited to 
accuracy, but also includes domain generalization, biases and robustness. This page is also a notebook, which open and replicate the results.

## State-of-the-Art comparison
To our knowledge there exists three datasets for Danish named entity recognition;

1) DaNE {cite}`hvingelby2020dane`, which uses the simple annotation scheme of CoNLL 2003 {cite}`missing` with the entities; *person*, *location*, *organization*, and *miscellaneus*.
2) DANSK {cite}`missing`, which uses the extensive annotation scheme similar to that of OntoNotes 5.0 {cite}`missing` including more that 16 entity types.
3) and DAN+ {cite}`missing`, which also uses the annotation scheme of CoNLL 2003, but allows for nested entities for instance *Aarhus Universitet*, where *Aarhus* is a location and *Aarhus Universitet* is an organization.

In this comparison we will be examing performance on DaNE and DANSK, but as no known models have been trained on Danish nested entities, we will not be comparing performance on DAN+.


```{admonition} Measuring Performance
Typically when measuring performance on these benchmark it is normal to feed the model the gold standard tokens. While this allows for easier comparisons of modules and architectures, it inflates the performance metrics. Further, it does not proberly reflect what you are really interested in:
*the performance you can expect when you apply the model to data of a similar type*. Therefore we estimate the model is given no prior knowledge of the data, and only the raw text is fed to the model. Thus the performance metrics might be slightly different compared to e.g. DaNLP.
```

### DaNE: Simple Named Entity Recognition
As already stated DaNE uses an extraction from the CoNLL 2003 dataset, which is as follows {cite}`hvingelby2020dane`:


| Entity | Description |
|--------------|-------------|
| LOC          | includes locations like cities, roads and mountains, as well as both public and commercial places like specific buildings or meeting points, but also abstract places. |
| PERSON | consists of names of people, fictional characters, and animals. The names includes aliases. |
| ORG | can be summarized as all sorts of organizations and collections of people, ranging from companies, brands, political movements, governmental bodies and clubs. |
| MISC | is a broad category of e.g. events, languages, titles and religions, but this tag also includes words derived from one of the four tags as well as words for which one part is from one of the three other tags. |

Here is an example from the dataset:

In [1]:
import spacy 
from spacy.tokens import Span
from spacy import displacy
text = """To kendte russiske historikere Andronik Mirganjan og Igor Klamkin tror ikke, at Rusland kan udvikles uden en "jernnæve"."""
nlp = spacy.blank("da")
doc = nlp(text)
doc.ents = [ # type: ignore
    Span(doc, 2, 3, label="MISC"),
    Span(doc, 4, 6, label="PERSON"),
    Span(doc, 7, 9, label="PERSON"),
    Span(doc, 13, 14, label="LOC"),
]

displacy.render(doc, style="ent")

The table below shows the performance of Danish language processing pipelines scored on the DaNE test set. The best scores in each category are highlighted with bold and the second best is underlined.

In [None]:
from pathlib import Path
import pandas as pd
from performance_testing_utils.ner_sota_utils import apply_models, MDL_GETTER_DICT
from dacy.datasets import dane
from spacy.training import Example


def apply_models(
def apply_models(
    mdl_name: str, nlp: Language, examples: list[Example]
) -> pd.DataFrame:
    texts = [example.reference.text for example in examples]
    docs = nlp.pipe(texts)
    for doc, example in zip(docs, examples):
        example.predicted = doc
    return




In [2]:
from pathlib import Path
import pandas as pd
from performance_testing_utils.ner_sota_utils import apply_models, MDL_GETTER_DICT, create_table
from dacy.datasets import dane
from spacy.training import Example

force=False
save_folder = Path("performance_tables/ner")
save_folder.mkdir(exist_ok=True, parents=True)
nlp = spacy.blank("da")
examples: list[Example]= list(dane(splits = ["test"])(nlp)) # type : ignore


tables = []
for model_name, getter in MDL_GETTER_DICT.items():
    print("Running model:", model_name)
    model_name_ = model_name.replace("/", "_")
    save_path = save_folder / f"{model_name_}_sota_dane.csv"
    if not save_path.exists() or force:
        nlp = getter()
        result_df = apply_models(model_name, nlp, examples)
        result_df.to_csv(save_path, index=False)
    else:
        print("- Already exists, loading in dataframe")
        result_df = pd.read_csv(save_path)
    tables.append(result_df)

df = pd.concat(tables)


Running model: saattrupdan/nbailab-base-ner-scandi
- Already exists, loading in dataframe
Running model: da_dacy_large_trf-0.2.0
- Already exists, loading in dataframe
Running model: da_dacy_medium_trf-0.2.0
- Already exists, loading in dataframe
Running model: da_dacy_small_trf-0.2.0
- Already exists, loading in dataframe
Running model: alexandrainst/da-ner-base
- Already exists, loading in dataframe
Running model: da_core_news_trf-3.5.0
- Already exists, loading in dataframe
Running model: da_core_news_lg-3.5.0
- Already exists, loading in dataframe
Running model: da_core_news_md-3.5.0
- Already exists, loading in dataframe
Running model: da_core_news_sm-3.5.0
- Already exists, loading in dataframe


In [3]:
create_table(df)

Unnamed: 0_level_0,F1,F1,F1,F1,F1
Models,Average,Misc.,Organization,Location,Person
saattrupdan/nbailab-base-ner-scandi,"86.2 (82.3, 89.3)","78.8 (72.6, 87.3)","80.3 (74.7, 85.9)","88.3 (82.7, 93.4)","95.1 (92.0, 97.5)"
da_dacy_large_trf-0.2.0,"85.4 (81.6, 88.7)","79.7 (71.7, 85.3)","78.9 (73.1, 85.2)","89.4 (83.5, 94.4)","92.6 (89.6, 95.3)"
da_dacy_medium_trf-0.2.0,"84.7 (80.5, 88.7)","79.0 (72.5, 84.9)","78.6 (70.8, 84.9)","86.4 (79.4, 92.6)","92.3 (88.8, 95.6)"
da_dacy_small_trf-0.2.0,"82.4 (79.5, 85.1)","75.1 (68.6, 81.1)","75.2 (71.3, 79.9)","83.8 (78.2, 88.4)","92.2 (90.0, 95.0)"
alexandrainst/da-ner-base,"70.4 (66.6, 74.1)",,"64.7 (56.0, 71.0)","84.9 (77.3, 91.1)","90.0 (87.0, 93.1)"
da_core_news_trf-3.5.0,"78.7 (74.2, 82.1)","69.0 (59.8, 75.8)","67.8 (60.1, 73.8)","81.8 (72.7, 88.8)","91.3 (88.1, 94.8)"
da_core_news_lg-3.5.0,"74.7 (71.2, 77.7)","64.2 (54.9, 72.2)","63.1 (55.7, 70.1)","81.6 (75.6, 88.5)","85.6 (80.5, 89.6)"
da_core_news_md-3.5.0,"70.9 (67.1, 74.2)","61.6 (52.8, 69.8)","58.5 (51.8, 65.8)","75.6 (67.7, 82.1)","82.8 (79.1, 86.7)"
da_core_news_sm-3.5.0,"64.1 (60.0, 67.5)","58.1 (50.4, 66.5)","49.0 (42.0, 56.7)","61.1 (53.3, 69.5)","79.8 (74.7, 83.9)"


```{note}
Note that `saattrupdan/nbailab-base-ner-scandi` is available in DaCy using `nlp.add_pipe("dacy/ner")`
```

```{admonition} You are missing a model
:note:

These tables are continually updated and thus we try to limit the number of models to only the most relevant Danish models. Therefore models like Polyglot with strict requirements and consistently worse performance are excluded. If you want to see a specific model, please open an issue on GitHub.
```




## DANSK: Fine-grained Named Entity Recognition

DANSK is annotated from the Danish Gigaword Corpus {cite}`missing` and a wide variety of domains including conversational, legal, news, social media, web content,  wiki's and Books. Dansk follows includes the following labels:


|  Entity        |             Description                                         |
| -------- | ---------------------------------------------------- |
| PERSON   | People, including fictional                          |
| NORP     | Nationalities or religious or political groups       |
| FACILITY | Building, airports, highways, bridges, etc.          |
| ORGANIZATION | Companies, agencies, institutions, etc.              |
| GPE      | Countries, cities, states.                           |
| LOCATION | Non-GPE locations, mountain ranges, bodies of water  |
| PRODUCT  | Vehicles, weapons, foods, etc. (not services)        |
| EVENT    | Named hurricanes, battles, wars, sports events, etc. |
| WORK OF ART | Titles of books, songs, etc.                         |
| LAW      | Named documents made into laws                       |
| LANGUAGE | Any named language                                   |

As well as annotation for the following concepts:

|   Entity       |   Description                                         |
| -------- | ------------------------------------------- |
| DATE     | Absolute or relative dates or periods       |
| TIME     | Times smaller than a day                    |
| PERCENT  | Percentage (including "*"%)                |
| MONEY    | Monetary values, including unit             |
| QUANTITY | Measurements, as of weight or distance      |
| ORDINAL  | "first", "second"                           |
| CARDINAL | Numerals that do no fall under another type |


We have here opted to create an interactive chart over a table as with the number of labels it quickly becomes unruly. The chart is interactive and you can select the label you want to compare the models on. You can also hover over the dots the see the exact values.

In [4]:
from performance_testing_utils.ner_sota_utils import apply_models, MDL_FINE_GETTER_DICT, dansk, create_dansk_viz

force=False
train, dev, test = dansk()
examples = [Example(x, x) for x in test]

tables = []
for model_name, getter in MDL_FINE_GETTER_DICT.items():
    print("Running model:", model_name)
    model_name_ = model_name.replace("/", "_")
    save_path = save_folder / f"{model_name_}_sota_dansk.csv"
    if not save_path.exists() or force:
        nlp = getter()
        result_df = apply_models(model_name, nlp, examples, decimals=1)
        result_df.to_csv(save_path, index=False)
    else:
        print("- Already exists, loading in dataframe")
    result_df = pd.read_csv(save_path)
    tables.append(result_df)

df = pd.concat(tables)


Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-ec592bb9b8d7fe08/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-ec592bb9b8d7fe08/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-ec592bb9b8d7fe08/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


Running model: da_dacy_large_ner_fine_grained-0.1.0
- Already exists, loading in dataframe
Running model: da_dacy_medium_ner_fine_grained-0.1.0
- Already exists, loading in dataframe
Running model: da_dacy_small_ner_fine_grained-0.1.0
- Already exists, loading in dataframe


In [6]:
create_dansk_viz(df)

AttributeError: 'float' object has no attribute 'split'

## Biases

To examine the biases in Danish models we use augmentation to replace names in the Danish dataset DaNE {cite}`hvingelby2020dane`, this approach
is similar to that introduced in the initial DaCy paper {cite}`enevoldsen2021dacy`.

Here is a short example of how the augmentation might look like:


````{admonition} Example

```{admonition} Original
:class: note


Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```

```{admonition} Female name augmentation
:class: important

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```
````


In [None]:
from pathlib import Path
import pandas as pd
from performance_testing_utils.ner_bias_utils import apply_models, MDL_GETTER_DICT, apply_models, create_table, get_augmenters
from dacy.datasets import dane

force = False
augmenters = get_augmenters()
save_folder = Path("performance_tables/ner")
save_folder.mkdir(exist_ok=True, parents=True)
dataset = dane(splits = "test")

tables = []
for model_name, getter in MDL_GETTER_DICT.items():
    print(model_name)
    model_name_ = model_name.replace("/", "_")
    save_path = save_folder / f"{model_name_}_bias.csv"
    if not save_path.exists() or force:
        nlp = getter()
        result_df = apply_models([(model_name, nlp)], dataset, augmenters, n_rep=20)  # type: ignore
        result_df.to_csv(save_path, index=False)
    else:
        print("- Already exists, loading in dataframe")
        result_df = pd.read_csv(save_path)
    tables.append(result_df)

df = pd.concat(tables)


In [None]:
create_table(df, augmenters=augmenters)

## Generalization
To examine model generalization, we utilize the [DANSK](https://huggingface.co/datasets/chcaa/DANSK) dataset. This dataset is annotated across many different domains including fiction, web content, social media, wikis, news, legal and conversational data. The original dataset includes annotations corresponding to the ontonotes standard (see [getting started](https://centre-for-humanities-computing.github.io/DaCy/tutorials/basic.html#fine-grained-ner) for the full list). To test the generalization we here convert the annotations to the CoNLL-2003 format using the labels `Person`, `Location`, `Organization`. As CoNLL-2003, `Location` includes cities, roads, mountains, abstract places, specific buildings, and meeting points. Thus the `GPE` (geo-political entity) were converted to `Location`. The `MISC` category in CoNLL-2003 is a diverse category meant to denote all names not in other categories (encapsulating both e.g. events and adjectives such as ”2004 World Cup” and ”Italian”), and is therefore not included.

In [None]:
from performance_testing_utils.generalization_utils import dansk, convert_to_conll_2003, MDL_GETTER_DICT, evaluate_generalization, create_generation_viz

train, dev, test = dansk()
convert_to_conll_2003(train)
convert_to_conll_2003(dev)
convert_to_conll_2003(test)

dataset = train + dev + test

assert set([e.label_ for doc in dataset for e in doc.ents]) == set(["PER", "LOC", "ORG"])

save_folder = Path("performance_tables/ner")
save_folder.mkdir(exist_ok=True, parents=True)

tables = []
# create domains datasets
domains = {}
for doc in dataset:
    domain = doc._.meta["dagw_domain"]
    if domain not in domains:
        domains[domain] = []
    domains[domain].append(doc)

for mdl, getter in MDL_GETTER_DICT.items():
    mdl_name = mdl.replace("/", "_")
    save_path = save_folder / f"{mdl_name}_generalization.csv"
    if not save_path.exists():
        nlp = getter()
        result_df = evaluate_generalization(mdl_name =mdl, mdl=nlp, domains_dataset_dict=domains)
        result_df.to_csv(save_path, index=False)
    else:
        print(f"- {mdl} already exists, loading in dataframe")
    result_df = pd.read_csv(save_path) # always load in dataframe to ensure the same representation
    tables.append(result_df)


In [None]:
df = pd.concat(tables)
chart = create_generation_viz(df)
chart

In [None]:
df

In [None]:
df = pd.concat(tables)
df = df[df["Domain"] != "danavis"]
df = df[df["Domain"] != "dannet"]
df = df[df["Domain"].notnull()]

# convert CI to numeric from string
df["Average F1 CI"] = df["Average F1 CI"].apply(lambda x: x[1:-1].split(" "))
df["Average F1 CI Lower"] = df["Average F1 CI"].apply(lambda x: x[0])
df["Average F1 CI Upper"] = df["Average F1 CI"].apply(lambda x: x[1])
df["Average F1 CI Lower"] = pd.to_numeric(df["Average F1 CI Lower"])
df["Average F1 CI Upper"] = pd.to_numeric(df["Average F1 CI Upper"])


In [None]:
import altair as alt

selection = alt.selection_point(
    fields=["Domain"],
    bind="legend",
    value=[{"Domain": "All"}],
)
bind_checkbox = alt.binding_checkbox(
    name="Scale point size by number of documents: ",
)
param_checkbox = alt.param(bind=bind_checkbox)

base = (
    alt.Chart(df)
    .mark_point(filled=True)
    .encode(
        x=alt.X("Average F1", title="F1"),
        y=alt.Y("Model", sort=list(MDL_GETTER_DICT.keys())),
        color="Domain",
        size=alt.condition(param_checkbox, "Number of docs", alt.value(100), legend=None),
        tooltip=[
            "Model",
            "Domain",
            "Average F1",
            "Person F1",
            "Location F1",
            "Organization F1",
        ],
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)
error_bars = (
    alt.Chart(df)
    .mark_errorbar(ticks=False)
    .encode(
        # x='Average F1 CI Lower',
        x=alt.X("Average F1 CI Lower", title="F1"),
        x2="Average F1 CI Upper",
        # y="Model",
        y=alt.Y("Model", sort=list(MDL_GETTER_DICT.keys())),
        color="Domain",
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)

chart =  base + error_bars

chart.add_params(selection, param_checkbox).properties(width=800, height=400)


In [None]:
sp_sm = spacy.load("da_core_news_sm")
sp_lg = spacy.load("da_core_news_lg")

mdls = [
    ("spaCy (da_core_news_sm)", sp_sm),
    ("spaCy (da_core_news_lg)", sp_lg),
]
from spacy.training import Example
from spacy.scorer import Scorer
import random
import numpy as np

scorer = Scorer()

def no_misc_getter(doc, attr):
    for ent in doc.ents:
        if ent.label_ != "MISC":
            yield ent

def bootstrap(examples, n_rep=100):
    scores = []
    for i in range(n_rep):
        sample = random.choices(examples, k=len(examples))
        score = scorer.score_spans(sample, getter=no_misc_getter, attr="ents")
        scores.append(score)
    return scores

def compute_mean_and_ci(scores):

    ent_f = [score["ents_f"] for score in scores]
    per_f = [score["ents_per_type"].get("PER", {"f": None})["f"] for score in scores]
    loc_f = [score["ents_per_type"].get("LOC", {"f": None})["f"] for score in scores]
    org_f = [score["ents_per_type"].get("ORG", {"f": None})["f"] for score in scores]

    nam = ["Average F1", "Person F1", "Location F1", "Organization F1"]

    d = {}
    for n, f in zip(nam, [ent_f, per_f, loc_f, org_f]):
        f = [x for x in f if x is not None]
        if len(f) == 0:
            d[n] = {
                "mean": None,
                "ci": None
            }
            continue
        d[n] = {
            "mean": np.mean(f),
            "ci": np.percentile(f, [2.5, 97.5])
        }
    return d


all_examples = {}
rows= []
for mdl_name, mdl in mdls:
    all_examples[mdl_name] = []
    for domain in domains:
        docs = domains[domain]
        model_pred = mdl.pipe([doc.text for doc in docs])
        examples = [Example(predicted=x, reference=y) for x, y in zip(model_pred, docs)]
        all_examples[mdl_name].extend(examples)

        bs_score = bootstrap(examples)
        score = compute_mean_and_ci(bs_score)


        row = {
            "Model": mdl_name,
            "Domain": domain,
            "Average F1": score["Average F1"]["mean"],
            "Person F1": score["Person F1"]["mean"],
            "Location F1": score["Location F1"]["mean"],
            "Organization F1": score["Organization F1"]["mean"],
            "Average F1 CI": score["Average F1"]["ci"],

            "Number of docs": len(docs),
            
        }
        rows.append(row)

# across domains
for mdl in all_examples:
    examples = all_examples[mdl]
    bs_score = bootstrap(examples)
    score = compute_mean_and_ci(bs_score)

    row = {
        "Model": mdl,
        "Domain": "All",
        "Average F1": score["Average F1"]["mean"],
        "Person F1": score["Person F1"]["mean"],
        "Location F1": score["Location F1"]["mean"],
        "Organization F1": score["Organization F1"]["mean"],
        "Average F1 CI": score["Average F1"]["ci"],
        "Number of docs": len(examples),
    }
    rows.append(row)

# write to file
import pandas as pd
df = pd.DataFrame(rows)
df.to_csv("ner_performance.csv", index=False)

In [None]:
import pandas as pd
import altair as alt

df = pd.DataFrame(rows)

# filter out domains
df = df[df["Domain"] != "danavis"]
df = df[df["Domain"] != "dannet"]
df = df[df["Domain"].notnull()]

df['Average F1 CI Lower'] = df['Average F1 CI'].apply(lambda x: x[0])
df['Average F1 CI Upper'] = df['Average F1 CI'].apply(lambda x: x[1])
df['Average F1 CI Lower'] = pd.to_numeric(df['Average F1 CI Lower'])
df['Average F1 CI Upper'] = pd.to_numeric(df['Average F1 CI Upper'])



selection = alt.selection_point(fields=['Domain'], bind='legend', value=[{'Domain': 'All'}]) # does not work

bind_checkbox = alt.binding_checkbox(name='Scale point size by number of documents: ')
param_checkbox = alt.param(bind=bind_checkbox)

base = alt.Chart(df).mark_point(filled=True).encode(
    # x='Average F1',
    x=alt.X('Average F1', title="F1"),
    y='Model',
    color='Domain',
    size=alt.condition(
        param_checkbox,
        'Number of docs',
        alt.value(100)
    ),
    tooltip=["Model", "Domain", "Average F1", "Person F1", "Location F1", "Organization F1"],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
)
error_bars = alt.Chart(df).mark_errorbar(ticks=False).encode(
    # x='Average F1 CI Lower',
    x = alt.X('Average F1 CI Lower', title="F1"),
    x2='Average F1 CI Upper',
    y='Model',
    color='Domain',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
)

chart = error_bars + base

chart.add_params(selection, param_checkbox).properties(
    width=800,
    height=400
)

```{note}
The F1 in the figure denotes the mean bootstrapped F1 score with a 95% confidence interval. The F1 score is calculated on all of the DANSK dataset.
```

## Robustness

In the paper [DaCy: A Unified Framework for Danish NLP](https://github.com/centre-for-humanities-computing/DaCy/blob/main/papers/DaCy-A-Unified-Framework-for-Danish-NLP/readme.md) we conducted a series on augmentation on the DaNE test set to estimate the robustness and biases of DaCy and other Danish language processing pipelines. This page represents only parts of the paper. We recommend reading the paper for a more thorough and nuanced overview.

Let's start by examining a couple of the augmentations, namely changing out names or introducing plausible keystroke errors.

````{admonition} Example

```{note} Original

Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```

```{important} Female name augmentation

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```
````

The underlying assumption of making these augmentations is that the tags of the tokens do not change with augmentation. In our case, this includes that "Anna Østergaard" is still a person and that "vonde" can still be considered a verb based on its context.

Based on this, we can assume that if a model performs worse on a certain set of names or with minor spelling variations or errors, we can conclude that the model is vulnerable to such input. For instance, if the model has a hard time when replacing æ, ø, and å with ae, oe, and aa, it might not be ideal to apply to historic texts.

As seen in the example above, while text with 5% keystroke is still readable. However, 15% keystroke errors tests the limit of what humans and models can reasonably be expected to comprehend.

```{important}
**15% keytype errors**

Peter Schmeichel mejer ogsp, at ddt danske landshoof anbo 202q tilhårer gerfenatop0en of lan vinde sen kpmkendw lamp mod England.
```




The following tables show a detailed breakdown of performance for named entity recognition, part-of-speech tagging, and dependency parsing. These show some general trends, some of which include:

- Spelling variations and abbreviated first names consistently reduce performance of all models on all tasks.
- Even simple replacements of æ, ø, and å with ae, oe, and aa lead to notable performance degradation.
- In general, larger models handle augmentations better than small models with DaCy large performing the best.
- The BiLSTM-based models (Stanza and Flair) perform competitively under augmentations and are only consistently outperformed by DaCy large.


In [None]:
from performance_testing_utils.ner_bias_utils import apply_models, MDL_GETTER_DICT, apply_models, create_table
from performance_testing_utils.ner_robustness_utils import get_augmenters

force = False
augmenters = get_augmenters()
save_folder = Path("performance_tables/ner")
save_folder.mkdir(exist_ok=True, parents=True)
dataset = dane(splits = "test")

tables = []
for model_name, getter in MDL_GETTER_DICT.items():
    print(model_name)
    model_name_ = model_name.replace("/", "_")
    save_path = save_folder / f"{model_name_}_bias.csv"
    if not save_path.exists() or force:
        nlp = getter()
        result_df = apply_models([(model_name, nlp)], dataset, augmenters, n_rep=20)  # type: ignore
        result_df.to_csv(save_path, index=False)
    else:
        print("- Already exists, loading in dataframe")
        result_df = pd.read_csv(save_path)
    tables.append(result_df)

df = pd.concat(tables)

In [None]:
create_table(df, augmenters=augmenters)

So how does these augmentations look? The following shows an example of the augmentation using a sample for DaNE.


In [None]:
from dacy.datasets import dane
import spacy
import augmenty

nlp = spacy.blank("da")
test_corpus = dane(splits = "test")

example = next(test_corpus(nlp)) # extract first example
doc = example.reference  # extract the reference/gold standard document

print(doc)

for augmentation_name, augmenter in augmenters:
    print(augmentation_name)
    augmented_docs = augmenty.docs([doc], augmenter)
    for augmented_doc in augmented_docs:
        print("\t-", augmented_doc)



# References

```{bibliography}
```