# Named Entity Recognition


This page examines the performance of competing models for Danish named entity recognition over multiple datasets. Performance is not limited to 
accuracy, but also includes domain generalization, biases and robustness. This page is also a notebook, which can be opened and run to replicate the results.

## State-of-the-Art comparison
To our knowledge there exists three datasets for Danish named entity recognition;

1) DaNE {cite}`hvingelby2020dane`, which uses the simple annotation scheme of CoNLL 2003 {cite}`sang2003introduction` with the entities; *person*, *location*, *organization*, and *miscellaneus*.
2) [DANSK](https://huggingface.co/datasets/chcaa/DANSK), which uses the extensive annotation scheme similar to that of OntoNotes 5.0 {cite}`weischedel2013ontonotes` including more that 16 entity types.
3) and DAN+ {cite}`plank2021dan+`, which also uses the annotation scheme of CoNLL 2003, but allows for nested entities for instance *Aarhus Universitet*, where *Aarhus* is a location and *Aarhus Universitet* is an organization.

In this comparison we will be examing performance on DaNE and DANSK, but as no known models have been trained on Danish nested entities, we will not be comparing performance on DAN+.


```{admonition} Measuring Performance
Typically when measuring performance on these benchmark it is normal to feed the model the gold standard tokens. While this allows for easier comparisons of modules and architectures, it inflates the performance metrics. Further, it does not proberly reflect what you are really interested in:
*the performance you can expect when you apply the model to data of a similar type*. Therefore we estimate the model is given no prior knowledge of the data, and only the raw text is fed to the model. Thus the performance metrics might be slightly different compared to e.g. DaNLP.
```

### DaNE: Simple Named Entity Recognition
As already stated DaNE uses an extraction from the CoNLL 2003 dataset, which is as follows {cite}`hvingelby2020dane`:




| Entity | Description |
|--------------|-------------|
| LOC          | includes locations like cities, roads and mountains, as well as both public and commercial places like specific buildings or meeting points, but also abstract places. |
| PERSON | consists of names of people, fictional characters, and animals. The names includes aliases. |
| ORG | can be summarized as all sorts of organizations and collections of people, ranging from companies, brands, political movements, governmental bodies and clubs. |
| MISC | is a broad category of e.g. events, languages, titles and religions, but this tag also includes words derived from one of the four tags as well as words for which one part is from one of the three other tags. |

Here is an example from the dataset:

In [1]:
import spacy
from spacy.tokens import Span
from spacy import displacy

text = """To kendte russiske historikere Andronik Mirganjan og Igor Klamkin tror ikke, at Rusland kan udvikles uden en "jernnæve"."""
nlp = spacy.blank("da")
doc = nlp(text)
doc.ents = [  # type: ignore
    Span(doc, 2, 3, label="MISC"),
    Span(doc, 4, 6, label="PERSON"),
    Span(doc, 7, 9, label="PERSON"),
    Span(doc, 13, 14, label="LOC"),
]

In [24]:
displacy.render(doc, style="ent")

The table below shows the performance of Danish language processing pipelines scored on the DaNE test set. The best scores in each category are highlighted with bold and the second best is underlined.

In [2]:
from evaluation.models import MODELS
from evaluation.utils import apply_models

In [2]:
dane = {}
for mdl_name, model_getter in MODELS.items():
    mdl_results = apply_models(mdl_name, model_getter, dataset="dane", splits=["test"])
    dane[mdl_name] = mdl_results["test"]


dane (test): Loading prediction for da_dacy_large_trf-0.2.0
dane (test): Loading prediction for da_dacy_medium_trf-0.2.0
dane (test): Loading prediction for da_dacy_small_trf-0.2.0
dane (test): Loading prediction for da_dacy_large_ner_fine_grained-0.1.0
dane (test): Loading prediction for da_dacy_medium_ner_fine_grained-0.1.0
dane (test): Loading prediction for da_dacy_small_ner_fine_grained-0.1.0
dane (test): Loading prediction for saattrupdan/nbailab-base-ner-scandi
dane (test): Loading prediction for alexandrainst/da-ner-base
dane (test): Loading prediction for da_core_news_trf-3.5.0
dane (test): Loading prediction for da_core_news_lg-3.5.0
dane (test): Loading prediction for da_core_news_md-3.5.0
dane (test): Loading prediction for da_core_news_sm-3.5.0
dane (test): Loading prediction for openai/gpt-3.5-turbo (02/05/23)
dane (test): Running openai/gpt-4 (02/05/23)


In [3]:
# normalize labels to match the dataset
for mdl in dane:
    if "openai" not in mdl:
        continue
    examples = dane[mdl]["examples"]
    mapping = {
        "PERSON": "PER",
        "ORGANISATION": "ORG",
        "LOCATION": "LOC",
    }
    for e in examples:
        ents = e.x.ents
        for ent in ents:
            ent.label_ = mapping[ent.label_]

        e.x.ents = ents
    

In [3]:
import pandas as pd
from evaluation.utils import create_dataframe

def highlight_max(s: pd.Series) -> list:
    """Highlight the maximum in a Series with bold text."""
    # convert to str for comparison
    s = s.astype(str)
    is_max = s == s.max()
    return ["font-weight: bold" if v else "" for v in is_max]


def underline_second_max(s: pd.Series) -> list:
    """Underline the second maximum in a Series."""
    is_second_max = s == s.sort_values(ascending=False).iloc[1]
    return ["text-decoration: underline" if v else "" for v in is_second_max]


def create_table(
    df: pd.DataFrame,
    caption="F1 score with 95% confidence interval calculated using bootstrapping with 100 samples.",
):
    # replace index with range
    df.index = range(len(df))  # type: ignore

    col_names = [("", "Models")] + [("F1", col) for col in df.columns[1:]]
    super_header = pd.MultiIndex.from_tuples(col_names)
    df.columns = super_header

    s = df.style.apply(highlight_max, axis=0, subset=df.columns[1:])
    s = s.apply(underline_second_max, axis=0, subset=df.columns[1:])

    # Add a caption
    s = s.set_caption(caption)

    # Center the header and left align the model names
    s = s.set_properties(subset=df.columns[1:], **{"text-align": "right"})

    super_header_style = [
        {"selector": ".level0", "props": [("text-align", "center")]},
        {"selector": ".col_heading", "props": [("text-align", "center")]},
    ]
    # Apply the CSS style to the styler
    s = s.set_table_styles(super_header_style)  # type: ignore
    s = s.set_properties(subset=[("", "Models")], **{"text-align": "left"})
    # remove the index
    s = s.hide(axis="index")

    # smaller font size
    s = s.set_table_attributes('style="font-size: 0.65em"')
    return s

In [5]:
from multiprocessing import Pool
with Pool(8) as p:
    tables = p.starmap(
        create_dataframe,
        [(dane[mdl]["examples"], mdl, 1, 500) for mdl in dane if "fine_grained" not in mdl],
    )

In [6]:
df = pd.concat(tables)
# sort columns
df = df[["Models", "Average", "Location", "Person", "Organization", "Misc."]]
df_average = df["Average"]
create_table(
    df,
    "F1 score with 95% confidence interval calculated using bootstrapping with 500 samples.",
)

Unnamed: 0_level_0,F1,F1,F1,F1,F1
Models,Average,Location,Person,Organization,Misc.
da_dacy_large_trf-0.2.0,"85.4 (81.2, 88.9)","89.5 (84.0, 94.7)","92.6 (89.0, 95.4)","79.0 (72.5, 84.6)","79.0 (70.8, 86.0)"
da_dacy_medium_trf-0.2.0,"84.9 (81.0, 88.5)","86.8 (81.2, 92.3)","92.7 (89.2, 95.6)","78.7 (71.8, 85.0)","78.7 (70.6, 86.1)"
da_dacy_small_trf-0.2.0,"82.7 (79.3, 85.9)","84.2 (78.3, 89.8)","92.2 (88.5, 95.1)","75.9 (69.3, 81.7)","75.7 (68.8, 81.8)"
saattrupdan/nbailab-base-ner-scandi,"86.3 (82.4, 89.7)","88.6 (83.0, 93.3)","95.1 (92.4, 97.8)","80.3 (73.6, 85.8)","78.6 (69.4, 86.0)"
alexandrainst/da-ner-base,"70.7 (66.2, 75.2)","84.8 (77.8, 91.0)","90.3 (86.3, 93.9)","64.7 (57.0, 71.3)",
da_core_news_trf-3.5.0,"79.0 (75.1, 82.3)","82.1 (75.5, 88.5)","91.6 (88.2, 94.5)","68.0 (61.0, 75.2)","69.0 (61.1, 77.3)"
da_core_news_lg-3.5.0,"74.6 (70.8, 78.1)","81.6 (75.3, 88.2)","85.5 (81.1, 89.9)","62.7 (54.8, 70.3)","64.4 (55.9, 72.8)"
da_core_news_md-3.5.0,"71.2 (66.9, 75.2)","76.8 (69.9, 83.6)","82.6 (77.8, 87.0)","58.2 (49.6, 66.7)","61.8 (52.6, 70.6)"
da_core_news_sm-3.5.0,"64.4 (59.7, 68.5)","61.6 (52.2, 69.9)","80.1 (74.9, 85.1)","49.0 (39.0, 57.5)","58.4 (49.8, 67.1)"
openai/gpt-3.5-turbo (02/05/23),"57.5 (52.3, 62.2)","50.7 (41.9, 59.2)","81.9 (76.8, 86.5)","55.7 (47.1, 63.7)",


It is worth mentioning that while the `da_dacy_large_trf-0.2.0` and `saattrupdan/nbailab-base-ner-scandi` performs similarly they have their independent strength and weaknesses. The large DaCy model is a multi-task model performing named-entity recognition as only one of its many tasks and thus if you wish to use one of those we would recommend that model. On the other hand the `nbailab-base-ner-scandi` is trained on multiple Scandinavian languages and thus might be ideal if your dataset might contain these languages as well. `saattrupdan/nbailab-base-ner-scandi` is available in DaCy using `nlp.add_pipe("dacy/ner")`.

```{admonition} You are missing a model
:class: note

These tables are continually updated and thus we try to limit the number of models to only the most relevant Danish models. Therefore models like Polyglot with strict requirements and consistently worse performance are excluded. If you want to see a specific model, please open an issue on GitHub.
```

### DANSK: Fine-grained Named Entity Recognition

DANSK is annotated from the Danish Gigaword Corpus {cite}`derczynski2021danish` and a wide variety of domains including conversational, legal, news, social media, web content,  wiki's and Books. Dansk follows includes the following labels:





|  Entity        |             Description                                         |
| -------- | ---------------------------------------------------- |
| PERSON   | People, including fictional                          |
| NORP     | Nationalities or religious or political groups       |
| FACILITY | Building, airports, highways, bridges, etc.          |
| ORGANIZATION | Companies, agencies, institutions, etc.              |
| GPE      | Countries, cities, states.                           |
| LOCATION | Non-GPE locations, mountain ranges, bodies of water  |
| PRODUCT  | Vehicles, weapons, foods, etc. (not services)        |
| EVENT    | Named hurricanes, battles, wars, sports events, etc. |
| WORK OF ART | Titles of books, songs, etc.                         |
| LAW      | Named documents made into laws                       |
| LANGUAGE | Any named language                                   |

As well as annotation for the following concepts:

|   Entity       |   Description                                         |
| -------- | ------------------------------------------- |
| DATE     | Absolute or relative dates or periods       |
| TIME     | Times smaller than a day                    |
| PERCENT  | Percentage (including "*"%)                |
| MONEY    | Monetary values, including unit             |
| QUANTITY | Measurements, as of weight or distance      |
| ORDINAL  | "first", "second"                           |
| CARDINAL | Numerals that do no fall under another type |


We have here opted to create an interactive chart over a table as with the number of labels it quickly becomes unruly. The chart is interactive and you can select the label you want to compare the models on. You can also hover over the dots the see the exact values.

In [4]:
from functools import partial
from evaluation.models import openai_model_loader_fine_ner
MODELS_ = MODELS.copy()
MODELS_["openai/gpt-3.5-turbo (02/05/23)"] = partial(openai_model_loader_fine_ner, model="gpt-3.5-turbo")
MODELS_["openai/gpt-4 (02/05/23)"] = partial(openai_model_loader_fine_ner, model="gpt-4")


In [6]:
dansk = {}
for mdl_name, model_getter in MODELS_.items():
    if "openai" in mdl_name:
        splits=["test"]
    else:
        splits=["train", "dev", "test"]
    mdl_results = apply_models(
        mdl_name, model_getter, dataset="dansk", splits=splits
    )
    dansk[mdl_name] = mdl_results

dansk (train): Loading prediction for da_dacy_large_trf-0.2.0
dansk (dev): Loading prediction for da_dacy_large_trf-0.2.0
dansk (test): Loading prediction for da_dacy_large_trf-0.2.0
dansk (train): Loading prediction for da_dacy_medium_trf-0.2.0
dansk (dev): Loading prediction for da_dacy_medium_trf-0.2.0
dansk (test): Loading prediction for da_dacy_medium_trf-0.2.0
dansk (train): Loading prediction for da_dacy_small_trf-0.2.0
dansk (dev): Loading prediction for da_dacy_small_trf-0.2.0
dansk (test): Loading prediction for da_dacy_small_trf-0.2.0
dansk (train): Loading prediction for da_dacy_large_ner_fine_grained-0.1.0
dansk (dev): Loading prediction for da_dacy_large_ner_fine_grained-0.1.0
dansk (test): Loading prediction for da_dacy_large_ner_fine_grained-0.1.0
dansk (train): Loading prediction for da_dacy_medium_ner_fine_grained-0.1.0
dansk (dev): Loading prediction for da_dacy_medium_ner_fine_grained-0.1.0
dansk (test): Loading prediction for da_dacy_medium_ner_fine_grained-0.1.0
d

Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-8622a47955f5c4cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-8622a47955f5c4cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-8622a47955f5c4cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [None]:
with Pool(8) as p:
    tables = p.starmap(
        create_dataframe,
        [(dansk[mdl]["test"]["examples"], mdl, 1, 100, 2000) for mdl in dansk if "fine_grained" in mdl],
    )


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [None]:
import altair as alt


def create_dansk_viz(df: pd.DataFrame):
    plot_df = df.melt(
        id_vars=["Models"],
        var_name="Label",
        value_name="F1 string",
    )

    # Convert the score value to a float
    plot_df["F1"] = plot_df["F1 string"].apply(
        lambda x: float(x.split()[0]) if not isinstance(x, float) else x
    )
    plot_df["CI Lower"] = plot_df["F1 string"].apply(
        lambda x: float(x.split("(")[1].split(",")[0])
    )
    plot_df["CI Upper"] = plot_df["F1 string"].apply(
        lambda x: float(x.split(",")[1].split(")")[0])
    )

    selection = alt.selection_point(
        fields=["Label"],
        bind="legend",
        value=[{"Label": "Average"}],
    )

    base = (
        alt.Chart(plot_df)
        .mark_point(filled=True, size=100)
        .encode(
            x=alt.X("F1", title="F1"),
            y="Models",
            color="Label",
            tooltip=[
                "Models",
                "Label",
                alt.Tooltip("F1 string", title="F1"),
            ],
            opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
            # only show the tooltip when when the label is selected
        )
    )
    error_bars = (
        alt.Chart(plot_df)
        .mark_errorbar(ticks=False)
        .encode(
            x=alt.X("CI Lower", title="F1"),
            x2="CI Upper",
            y="Models",
            color="Label",
            opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
        )
    )

    chart = base + error_bars

    return chart.add_params(selection).properties(width=400, height=300)

In [None]:
dansk_df = pd.concat(tables)
create_dansk_viz(dansk_df)

In [None]:
_df = dansk_df
_df = _df.set_index("Models")
ent_columns = sorted(
    [
        "Event",
        "Organization",
        "Language",
        "Person",
        "Ordinal",
        "NORP",
        "Work of Art",
        "Facility",
        "Law",
        "Location",
        "Product",
        "GPE",
    ]
)
non_ent_columns = sorted(["Cardinal", "Date", "Money", "Percent", "Quantity", "Time"])
columns_to_keep = ent_columns + non_ent_columns + ["Average"]

_df = _df[columns_to_keep]

In [None]:
table = _df.T
iidx = pd.MultiIndex.from_arrays(
    [
        ["Entities"] * len(ent_columns)
        + ["Non-Entities"] * len(non_ent_columns)
        + ["Average"],
        ent_columns + non_ent_columns + ["Average"],
    ]
)
table.index = iidx

mdl_names = ["Large 0.1.0", "Medium 0.1.0", "Small 0.1.0"]
header = pd.MultiIndex.from_arrays(
    [["Fine-grained Models"] * len(mdl_names), mdl_names]
)
table.columns = header

In [None]:
# convert to latex using styler
style = table.style.format_index(escape="latex", axis=1).format_index(
    escape="latex", axis=0
)


# highlight the maximum
def italicize_second_max(s: pd.Series) -> list:
    """Italicize the second maximum in a Series."""
    is_second_max = s == s.sort_values(ascending=False).iloc[1]
    # check if the second maximum is the same as the maximum
    same_as_max = s == s.max()

    if same_as_max.sum() > 1:
        # if there are more than one maximum, don't italicize
        return ["font-style: normal" for v in is_second_max]
    return ["font-style: italic" if v else "" for v in is_second_max]


style = style.apply(highlight_max, axis=1)
# style = style.apply(underline_second_max, axis=1)
style = style.apply(italicize_second_max, axis=1)

# apply the CSS style
super_header_style = [
    {"selector": ".level0", "props": [("text-align", "center")]},
    {"selector": ".col_heading", "props": [("text-align", "center")]},
]
style = style.set_table_styles(super_header_style)  # type: ignore


# add caption
caption = "F1 score with 95% confidence interval calculated using bootstrapping with 100 samples."
style = style.set_caption(caption)
# font size
style = style.set_table_attributes('style="font-size: 0.8em"')
style

# latex = style.to_latex(
#         hrules=True,
#         convert_css=True,
#     )

# print(latex)

Unnamed: 0_level_0,Unnamed: 1_level_0,Fine-grained Models,Fine-grained Models,Fine-grained Models
Unnamed: 0_level_1,Unnamed: 1_level_1,Large 0.1.0,Medium 0.1.0,Small 0.1.0
Entities,Event,"43.5 (27.0, 56.0)","64.2 (50.0, 79.4)","46.1 (27.8, 62.4)"
Entities,Facility,"69.8 (54.3, 84.4)","72.3 (56.2, 84.6)","55.5 (36.2, 70.5)"
Entities,GPE,"90.6 (87.2, 93.1)","88.0 (82.7, 92.1)","79.6 (73.0, 84.6)"
Entities,Language,"74.5 (60.0, 83.3)","51.9 (23.3, 100.0)","45.9 (13.3, 93.3)"
Entities,Law,"54.2 (38.1, 72.5)","59.3 (37.4, 77.3)","57.6 (39.6, 75.1)"
Entities,Location,"75.3 (66.9, 83.8)","72.5 (62.1, 80.8)","65.6 (55.4, 74.1)"
Entities,NORP,"84.8 (76.9, 90.8)","78.2 (68.6, 85.8)","73.3 (62.9, 81.5)"
Entities,Ordinal,"37.8 (22.5, 51.2)","68.7 (49.1, 82.6)","68.5 (47.6, 83.1)"
Entities,Organization,"79.5 (74.9, 83.1)","80.5 (78.1, 84.2)","79.1 (75.7, 82.3)"
Entities,Person,"85.9 (82.7, 88.8)","84.8 (80.6, 88.2)","86.8 (83.2, 90.1)"


## Domain Generalization
For the domains generalization benchmark we utilize the [DANSK](https://huggingface.co/datasets/chcaa/DANSK) dataset. This dataset is annotated across many different domains including fiction, web content, social media, wikis, news, legal and conversational data.
As some models are trained on DANSK (`da_dacy_{size}_ner_fine_grained-{version}`) these models are tested on the test set using all of the
labels. 

In [14]:
from evaluation.utils import evaluate_generalization

In [15]:
tables = []
for mdl_name in dansk:
    if "fine_grained" not in mdl_name:
        continue

    table = evaluate_generalization(examples=dansk[mdl_name]["test"]["examples"], mdl_name=mdl_name, n_rep=100, n_samples=1000)
    tables.append(table)


In [16]:
df = pd.concat(tables)
df = df[df["Domain"] != "dannet"]
df = df[df["Domain"].notnull()]

In [17]:
# create altair viz
selection = alt.selection_point(
    fields=["Domain"],
    bind="legend",
    value=[{"Domain": "All"}],
)
bind_checkbox = alt.binding_checkbox(
    name="Scale point size by number of documents: ",
)
param_checkbox = alt.param(bind=bind_checkbox)

sort_order = list(dansk.keys())

base = (
    alt.Chart(df)
    .mark_point(filled=True)
    .encode(
        x=alt.X("Average", title="F1", scale=alt.Scale(domain=[0.0, 1.0])),
        y=alt.Y("Model", sort=sort_order),
        color="Domain",
        size=alt.condition(
            param_checkbox, "Number of docs", alt.value(100), legend=None
        ),
        tooltip=[
            "Model",
            "Domain",
            "Average F1",
        ],
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)
error_bars = (
    alt.Chart(df)
    .mark_errorbar(ticks=False)
    .encode(
        x=alt.X("Average Lower CI", title="F1"),
        x2="Average Upper CI",
        y=alt.Y("Model", sort=sort_order),
        color="Domain",
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)

chart = base + error_bars

chart = chart.add_params(selection, param_checkbox).properties(
    title="DANSK test set performance",
)


chart.properties(width=400, height=300)


### Domain generalization using CoNLL-2003 format
To test the generalization we here convert the annotations to the CoNLL-2003 format using the labels `Person`, `Location`, `Organization`. As CoNLL-2003, `Location` includes cities, roads, mountains, abstract places, specific buildings, and meeting points. Thus the `GPE` (geo-political entity) were converted to `Location`. The `MISC` category in CoNLL-2003 is a diverse category meant to denote all names not in other categories (encapsulating both e.g. events and adjectives such as ”2004 World Cup” and ”Italian”), and is therefore not included.

In [18]:
from evaluation.utils import convert_to_conll_2003, create_row_conll2003

In [19]:
tables = []
for mdl_name in dansk:
    if "fine_grained" in mdl_name:
        continue
    examples = dansk[mdl_name]["test"]["examples"]
    examples += dansk[mdl_name]["dev"]["examples"]
    examples += dansk[mdl_name]["train"]["examples"]

    
    examples = convert_to_conll_2003(examples)
    table = evaluate_generalization(mdl_name, examples, n_rep=100, n_samples=1000, create_row_fn=create_row_conll2003)
    tables.append(table)

tables = pd.concat(tables, axis=0)

In [20]:
df = tables
df = df[df["Domain"] != "dannet"]   # type: ignore
df = df[df["Domain"].notnull()]

In [21]:
# create altair viz
selection = alt.selection_point(
    fields=["Domain"],
    bind="legend",
    value=[{"Domain": "All"}],
)
bind_checkbox = alt.binding_checkbox(
    name="Scale point size by number of documents: ",
)
param_checkbox = alt.param(bind=bind_checkbox)

sort_order = list(dansk.keys())

base = (
    alt.Chart(df)
    .mark_point(filled=True)
    .encode(
        x=alt.X("Average", title="F1", scale=alt.Scale(domain=[0.0, 1.0])),
        y=alt.Y("Model", sort=sort_order),
        color="Domain",
        size=alt.condition(
            param_checkbox, "Number of docs", alt.value(100), legend=None
        ),
        tooltip=[
            "Model",
            "Domain",
            "Average F1",
            "Person F1",
            "Location F1",
            "Organization F1",
        ],
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)
error_bars = (
    alt.Chart(df)
    .mark_errorbar(ticks=False)
    .encode(
        x=alt.X("Average Lower CI", title="F1"),
        x2="Average Upper CI",
        y=alt.Y("Model", sort=sort_order),
        color="Domain",
        opacity=alt.condition(selection, alt.value(1), alt.value(0.0)),
    )
)

chart = base + error_bars

chart = chart.add_params(selection, param_checkbox).properties(
    width=400, height=300,
    title="Generalization to Unseen Domains",
)


chart

## Biases

To examine the biases in Danish models we use augmentation to replace names in the Danish dataset DaNE {cite}`hvingelby2020dane`, this approach
is similar to that introduced in the initial DaCy paper {cite}`enevoldsen2021dacy`.

Here is a short example of how the augmentation might look like:


````{admonition} Example

```{admonition} Original
:class: note


Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```

```{admonition} Female name augmentation
:class: important

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```
````


In [22]:
gbiases = {}
for mdl_name, model_getter in MODELS.items():
    if "fine_grained" in mdl_name:
        continue
    mdl_results = apply_models(
        mdl_name, model_getter, dataset="gender_bias_dane", splits=["test"]
    )
    gbiases[mdl_name] = mdl_results

gender_bias_dane (test): Loading prediction for saattrupdan/nbailab-base-ner-scandi
gender_bias_dane (test): Loading prediction for da_dacy_large_trf-0.2.0
gender_bias_dane (test): Loading prediction for da_dacy_medium_trf-0.2.0
gender_bias_dane (test): Loading prediction for da_dacy_small_trf-0.2.0
gender_bias_dane (test): Loading prediction for alexandrainst/da-ner-base
gender_bias_dane (test): Loading prediction for da_core_news_trf-3.5.0
gender_bias_dane (test): Loading prediction for da_core_news_lg-3.5.0
gender_bias_dane (test): Loading prediction for da_core_news_md-3.5.0
gender_bias_dane (test): Loading prediction for da_core_news_sm-3.5.0


In [23]:
from collections import defaultdict
def augmentation_specific_examples(examples):
    aug_group = defaultdict(list)
    for example in examples:
        aug_name = example.y._.meta["augmenter"]
        aug_group[aug_name].append(example)
    return aug_group

In [24]:
tables = []
for mdl in gbiases:
    print(mdl)
    examples = gbiases[mdl]["test"]["examples"]

    aug_group = augmentation_specific_examples(examples)
    for aug_name, _examples in aug_group.items():
        _examples = convert_to_conll_2003(_examples) # also removes misc.
        table = create_dataframe(_examples, mdl, n_rep=100, n_samples=1000)
        table["Augmentation"] = aug_name
        tables.append(table)

saattrupdan/nbailab-base-ner-scandi
da_dacy_large_trf-0.2.0
da_dacy_medium_trf-0.2.0
da_dacy_small_trf-0.2.0
alexandrainst/da-ner-base
da_core_news_trf-3.5.0
da_core_news_lg-3.5.0
da_core_news_md-3.5.0
da_core_news_sm-3.5.0


In [25]:
df = pd.concat(tables)


In [26]:
# create the table
def create_table(df, model_order: list[str], baseline=df_average):

    table_df = df[["Models", "Augmentation", "Average"]]

    table_df = table_df.pivot(index="Models", columns="Augmentation", values="Average")

    # order the models column
    table_df = table_df.reindex(model_order)

    # add baseline
    table_df["Baseline"] = list(baseline)
    # order the columns
    table_df = table_df[["Baseline"] + list(table_df.columns[:-1])]


    # create augmentation superheader

    aug_superheader = [("", "Baseline")]
    for aug_name in table_df.columns[1:]:
        aug_superheader.append(("Augmentation", aug_name))

    aug_superheader = pd.MultiIndex.from_tuples(aug_superheader)
    table_df.columns = aug_superheader
    df = table_df.reset_index()
    s = df.style.apply(highlight_max, axis=0, subset=df.columns[1:])
    s = s.apply(underline_second_max, axis=0, subset=df.columns[1:])

    # Add a caption
    s = s.set_caption("F1 score for each augmentation with 95% confidence interval calculated over 100 repetitions")

    # Center the header and left align the model names
    s = s.set_properties(subset=df.columns[1:], **{"text-align": "right"})

    super_header_style = [
        {"selector": ".level0", "props": [("text-align", "center")]},
        {"selector": ".col_heading", "props": [("text-align", "center")]},
    ]
    # Apply the CSS style to the styler
    s = s.set_table_styles(super_header_style)  # type: ignore
    # s = s.set_properties(subset=[("", "Models")],
    #                       **{"text-align": "left"})
    # remove the index
    s = s.hide(axis="index")
    # smaller font
    s = s.set_table_attributes('style="font-size: 0.65em"')
    return s



In [27]:
create_table(df, model_order=list(gbiases.keys()))

Models,Unnamed: 1_level_0,Augmentation,Augmentation,Augmentation,Augmentation
Unnamed: 0_level_1,Baseline,Danish Names,Female Names,Male Names,Muslim Names
saattrupdan/nbailab-base-ner-scandi,"86.3 (82.4, 89.7)","89.0 (86.8, 91.1)","88.9 (86.9, 91.1)","88.9 (86.9, 91.1)","88.1 (85.9, 90.4)"
da_dacy_large_trf-0.2.0,"85.4 (81.2, 88.9)","87.7 (85.2, 90.4)","87.8 (85.2, 90.2)","87.5 (84.3, 90.3)","85.6 (82.9, 88.3)"
da_dacy_medium_trf-0.2.0,"84.9 (81.0, 88.5)","86.2 (83.9, 88.8)","86.1 (83.8, 89.1)","86.1 (83.6, 89.2)","84.2 (81.7, 87.4)"
da_dacy_small_trf-0.2.0,"82.7 (79.3, 85.9)","82.4 (79.6, 85.3)","82.2 (79.9, 84.7)","82.1 (79.2, 85.2)","81.2 (78.6, 83.7)"
alexandrainst/da-ner-base,"70.7 (66.2, 75.2)","81.5 (78.2, 84.4)","81.6 (78.3, 84.4)","81.5 (78.2, 84.4)","79.8 (76.7, 82.4)"
da_core_news_trf-3.5.0,"79.0 (75.1, 82.3)","80.7 (77.2, 83.1)","80.9 (78.1, 83.8)","80.6 (77.3, 83.8)","78.7 (75.8, 81.1)"
da_core_news_lg-3.5.0,"74.6 (70.8, 78.1)","78.3 (75.5, 80.7)","78.5 (75.9, 81.1)","78.4 (75.4, 81.2)","68.2 (65.4, 71.2)"
da_core_news_md-3.5.0,"71.2 (66.9, 75.2)","75.7 (71.9, 78.7)","75.6 (72.2, 79.1)","75.5 (72.3, 78.9)","64.6 (60.5, 68.1)"
da_core_news_sm-3.5.0,"64.4 (59.7, 68.5)","58.8 (55.5, 62.0)","59.1 (56.2, 62.6)","59.1 (56.4, 62.3)","53.4 (50.2, 56.4)"


## Robustness

In the paper *'DaCy: A Unified Framework for Danish NLP'* {cite}`enevoldsen2021dacy` we conducted a series on augmentation on the DaNE test set to estimate the robustness and biases of DaCy and other Danish language processing pipelines. This page represents only parts of the paper. We recommend reading the paper for a more thorough and nuanced overview.

The augmentation we will be using in this test are performed on the DaNE test set and include the following:

- **Spelling Error**: Intended to similar domains with inconsistent spelling, OCR errors, conversational data, etc.. The augmentation includes a series of smaller augmentation:
  - Keystroke error: The augmentation is used to introduce errors by replacing a character with a character that is close on the keyboard.
  - Character swap: The augmentation is used to introduce errors by swapping two neighboring characters.
  - Token swap: The augmentation is used to introduce errors by swapping two neighboring tokens.
- **Inconsistent Casing**: This augmentation is used to simulate inconsistent casing in the language and uses two different methods by either randomly capitalizing or lowercasing tokens.
- **Synonym Augmentation**: This augmentation is used to simulate the variation and slight grammatical errors in the language and uses two different methods:
  - Wordnet Synonym replacement: The augmentation replaces a token with a synonym in WordNet while respecting its syntactic role.
  - Embedding Synonym replacement: This augmentation replaces a token with a synonym which tends to appear in similar contexts.
- **Inconsistent Spacing**: This augmentation is used to simulate inconsistent spacing in the language and uses two different methods by either randomly adding or removing spaces.
- **Historical Spelling**: This augmentation is used to simulate historical spelling in Danish including ASCII spellings of the letters Æ (Ae), Ø (Oe), and Å (Aa) as well as uppercasing nouns.

For all of the augmentations the probability of an augmentation is set to augment 5% of the spaces where the targeted augmentation can take place. The augmentations are performed using the [augmenty](https://kennethenevoldsen.github.io/augmenty/index.html).

The underlying assumption of making these augmentations is that the annotations of the tokens do not change with augmentation. This can naturally sometimes be the case. A single letter *"hun læste gåden"* (*she read the puzzle*) and *"hun løste gåden"* (*she solved the puzzle*) have quite a different meaning. So while we expect the performance to drop the degree of the drop is interesting to examine and often in comparison to the other models.

In [28]:
robustness = {}
for mdl_name, model_getter in MODELS.items():
    if "fine_grained" in mdl_name:
        continue
    mdl_results = apply_models(
        mdl_name, model_getter, dataset="robustness_dane", splits=["test"]
    )
    robustness[mdl_name] = mdl_results

robustness_dane (test): Loading prediction for saattrupdan/nbailab-base-ner-scandi
robustness_dane (test): Loading prediction for da_dacy_large_trf-0.2.0
robustness_dane (test): Loading prediction for da_dacy_medium_trf-0.2.0
robustness_dane (test): Loading prediction for da_dacy_small_trf-0.2.0
robustness_dane (test): Loading prediction for alexandrainst/da-ner-base
robustness_dane (test): Loading prediction for da_core_news_trf-3.5.0
robustness_dane (test): Loading prediction for da_core_news_lg-3.5.0
robustness_dane (test): Loading prediction for da_core_news_md-3.5.0
robustness_dane (test): Loading prediction for da_core_news_sm-3.5.0


In [29]:
tables = []
for mdl in robustness:
    print(mdl)
    examples = robustness[mdl]["test"]["examples"]

    aug_group = augmentation_specific_examples(examples)
    for aug_name, _examples in aug_group.items():
        _examples = convert_to_conll_2003(_examples) # also removes misc.
        table = create_dataframe(_examples, mdl, n_rep=100, n_samples=1000)
        table["Augmentation"] = aug_name
        tables.append(table)

saattrupdan/nbailab-base-ner-scandi
da_dacy_large_trf-0.2.0
da_dacy_medium_trf-0.2.0
da_dacy_small_trf-0.2.0
alexandrainst/da-ner-base
da_core_news_trf-3.5.0
da_core_news_lg-3.5.0
da_core_news_md-3.5.0
da_core_news_sm-3.5.0


In [30]:
robustness_df = pd.concat(tables)

create_table(robustness_df, model_order=list(robustness.keys()))

Models,Unnamed: 1_level_0,Augmentation,Augmentation,Augmentation,Augmentation,Augmentation
Unnamed: 0_level_1,Baseline,Historical Spelling,Inconsistent Casing,Inconsistent Spacing,Spelling Error,Synonym replacement
saattrupdan/nbailab-base-ner-scandi,"86.3 (82.4, 89.7)","81.9 (79.1, 85.0)","86.5 (84.4, 89.0)","78.8 (75.7, 81.6)","73.3 (69.9, 76.8)","87.1 (84.9, 89.6)"
da_dacy_large_trf-0.2.0,"85.4 (81.2, 88.9)","86.0 (82.8, 88.9)","86.9 (83.9, 89.4)","69.7 (66.4, 72.4)","59.7 (56.4, 63.9)","85.9 (82.9, 88.8)"
da_dacy_medium_trf-0.2.0,"84.9 (81.0, 88.5)","69.6 (66.7, 72.1)","83.7 (81.3, 86.3)","70.5 (66.6, 74.0)","65.4 (62.6, 68.5)","85.1 (82.5, 88.3)"
da_dacy_small_trf-0.2.0,"82.7 (79.3, 85.9)","51.7 (49.1, 54.6)","81.1 (78.6, 83.5)","64.3 (60.4, 67.2)","63.1 (59.9, 66.5)","83.4 (81.0, 85.7)"
alexandrainst/da-ner-base,"70.7 (66.2, 75.2)","78.7 (75.3, 81.6)","80.8 (77.6, 83.2)","63.4 (59.4, 66.3)","49.9 (47.3, 53.6)","80.1 (77.1, 82.8)"
da_core_news_trf-3.5.0,"79.0 (75.1, 82.3)","75.1 (72.4, 77.3)","81.3 (78.5, 84.1)","58.9 (55.8, 62.3)","41.2 (38.5, 44.0)","80.4 (77.6, 83.3)"
da_core_news_lg-3.5.0,"74.6 (70.8, 78.1)","47.0 (44.5, 49.7)","74.5 (71.6, 77.7)","51.1 (48.1, 53.8)","44.9 (42.0, 47.9)","76.3 (73.6, 79.1)"
da_core_news_md-3.5.0,"71.2 (66.9, 75.2)","48.7 (45.7, 51.6)","71.6 (68.2, 75.4)","51.1 (47.6, 54.3)","41.8 (38.8, 44.7)","72.8 (69.2, 76.1)"
da_core_news_sm-3.5.0,"64.4 (59.7, 68.5)","31.9 (29.6, 34.1)","61.5 (58.1, 64.6)","46.6 (43.7, 50.4)","49.6 (46.5, 53.0)","64.8 (61.4, 68.1)"


## Inference Speed


While performance naturally is important is it also important to know why you might choose one model over another. One of the main reasons for choosing a smaller model is inference speed. The following table shows the inference speed of the different models. The inference speed is measured in words per second (WPS) and is measured on a Apple M1 Pro 16Gb running macOS 13.3.1 (i.e. high-end consumer laptop). The models are tested on the test set of DaNE.

```{admonition} GPU Acceleration
:class: note

These benchmarks does not use GPU acceleration. If you were to use GPU acceleration the inference speed would be much higher, similarly larger models would benefit more from this acceleration.
```

In [31]:
dane = {}
for mdl_name, model_getter in MODELS.items():
    mdl_results = apply_models(mdl_name, model_getter, dataset="dane", splits=["test"])
    dane[mdl_name] = mdl_results["test"]

dane (test): Loading prediction for saattrupdan/nbailab-base-ner-scandi
dane (test): Loading prediction for da_dacy_large_trf-0.2.0
dane (test): Loading prediction for da_dacy_medium_trf-0.2.0
dane (test): Loading prediction for da_dacy_small_trf-0.2.0
dane (test): Loading prediction for da_dacy_large_ner_fine_grained-0.1.0
dane (test): Loading prediction for da_dacy_medium_ner_fine_grained-0.1.0
dane (test): Loading prediction for da_dacy_small_ner_fine_grained-0.1.0
dane (test): Loading prediction for alexandrainst/da-ner-base
dane (test): Loading prediction for da_core_news_trf-3.5.0
dane (test): Loading prediction for da_core_news_lg-3.5.0
dane (test): Loading prediction for da_core_news_md-3.5.0
dane (test): Loading prediction for da_core_news_sm-3.5.0


In [32]:
rows = []
n_words = None
for mdl_name, model_getter in MODELS.items():
    total_time = dane[mdl_name]["time_in_seconds"]
    if n_words is None:
        examples = dane[mdl_name]["examples"]
        n_words = sum(len(e.y) for e in examples)
    wps = n_words / total_time
    rows.append({"Model": mdl_name, "Words per second": wps, "Total time (sec)": total_time})

speed = pd.DataFrame(rows)

In [33]:
# create the table
style = speed.style.set_caption("Inference speed on DANE test set")


def highlight_min(s):
    """highlight the minimum in a series with bold"""
    is_min = s == s.min()
    return ["font-weight: bold" if v else "" for v in is_min]

def highlight_max(s):
    """highlight the minimum in a series with bold"""
    is_max = s == s.max()
    return ["font-weight: bold" if v else "" for v in is_max]

style= style.apply(highlight_min, axis=0, subset=["Total time (sec)"])
style = style.apply(highlight_max, axis=0, subset=["Words per second"])

style = style.set_properties(subset=["Words per second", "Total time (sec)"], **{"text-align": "right"})
# set decimal places
style = style.format({"Words per second": "{:.1f}", "Total time (sec)": "{:.2f}"})

style = style.hide(axis="index")
style = style.set_properties(subset=["Model"], **{"text-align": "left"})
style


Model,Words per second,Total time (sec)
saattrupdan/nbailab-base-ner-scandi,1438.8,6.97
da_dacy_large_trf-0.2.0,353.3,28.37
da_dacy_medium_trf-0.2.0,770.2,13.01
da_dacy_small_trf-0.2.0,2024.6,4.95
da_dacy_large_ner_fine_grained-0.1.0,567.9,17.65
da_dacy_medium_ner_fine_grained-0.1.0,1670.3,6.0
da_dacy_small_ner_fine_grained-0.1.0,5717.6,1.75
alexandrainst/da-ner-base,1618.7,6.19
da_core_news_trf-3.5.0,1125.1,8.91
da_core_news_lg-3.5.0,31364.7,0.32


Note here that the `da_dacy_{size}_trf-{version}` models from DaCy and the `da_core_news_{size}-{version}` models from spaCy are multi-task models so performs multiple tasks at once. This means that the inference speed is not directly comparable to the other models.

## References

```{bibliography} ../../references.bib