# Named Entity Recognition

## Biases

To examine the biases in Danish models we use augmentation to replace names in the Danish dataset DaNE {cite}`hvingelby2020dane`, this approach
is similar to that introduced in the initial DaCy paper {cite}`enevoldsen2021dacy`.

Here is a short example of how the augmentation might look like:


````{admonition} Example

```{admonition} Original
:class: note


Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```

```{admonition} Female name augmentation
:class: important

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```
````


In [4]:
import spacy
import dacy
from dacy.datasets import dane
from ner_biases_utils import apply_models, create_table, get_augmenters
import spacy_wrap
from pathlib import Path
import pandas as pd

# augmenters = get_augmenters()
augmenters = []

# if csv already exists, load it
save_path = Path('./tables/ner_bias_results.csv')
if save_path.exists():
    result_df = pd.read_csv(save_path)
else:
    # !spacy download da_core_news_sm
    # !spacy download da_core_news_md
    # !spacy download da_core_news_lg
    # !spacy download da_core_news_trf
    sp_sm = spacy.load("da_core_news_sm")
    sp_md = spacy.load("da_core_news_md")
    sp_lg = spacy.load("da_core_news_lg")
    sp_trf = spacy.load("da_core_news_trf")
    # dacy_sm = dacy.load("da_dacy_small_trf-0.2.0")
    # dacy_md = dacy.load("da_dacy_medium_trf-0.2.0")
    # dacy_lg = dacy.load("da_dacy_large_trf-0.2.0")
    # dacy_fg_ner_sm = dacy.load('da_dacy_small_ner_fine_grained-0.1.0')
    # dacy_fg_ner_md = dacy.load('da_dacy_medium_ner_fine_grained-0.1.0')
    # dacy_fg_ner_lg = dacy.load('da_dacy_large_ner_fine_grained-0.1.0')

    daner_base = spacy.blank("da")
    config = {"model": {"name": "alexandrainst/da-ner-base"}, "predictions_to": ["ents"]}
    daner_base.add_pipe("token_classification_transformer", config=config)

    scandiner = spacy.blank("da")
    scandiner.add_pipe("dacy/ner")

    models = [
        ("spaCy (da_core_news_sm)", sp_sm),
        ("spaCy (da_core_news_md)", sp_md),
        ("spaCy (da_core_news_lg)", sp_lg),
        ("spaCy (da_core_news_trf)", sp_trf),
        # ("DaCy (da_dacy_small_trf-0.2.0)", dacy_sm),
        # ("DaCy (da_dacy_medium_trf-0.2.0)", dacy_md),
        # ("DaCy (da_dacy_large_trf-0.2.0)", dacy_lg),
        # ("DaCy (da_dacy_small_ner_fine_grained-0.1.0)", dacy_fg_ner_sm),
        # ("DaCy (da_dacy_medium_ner_fine_grained-0.1.0)", dacy_fg_ner_md),
        # ("DaCy (da_dacy_large_ner_fine_grained-0.1.0)", dacy_fg_ner_lg),
        ("alexandrainst/da-ner-base", daner_base),
        ("saattrupdan/nbailab-base-ner-scandi", scandiner),
    ]

    dataset = dane(splits="test")
    result_df = apply_models(models, dataset, augmenters, n_rep=20)
    # save to csv
    result_df.to_csv('ner_results.csv')

s = create_table(result_df, augmenters)
s

Model,Baseline
alexandrainst/da-ner-base,70.43
saattrupdan/nbailab-base-ner-scandi,86.1
spaCy (da_core_news_lg),74.46
spaCy (da_core_news_md),71.02
spaCy (da_core_news_sm),64.32
spaCy (da_core_news_trf),78.78


## Generalization
To examine model generalization, we utilize the [DANSK](https://huggingface.co/datasets/chcaa/DANSK) dataset. This dataset is annotated across many different domains including fiction, web content, social media, wikis, news, legal and conversational data. The original dataset includes annotations corresponding to the ontonotes standard (see [getting started](https://centre-for-humanities-computing.github.io/DaCy/tutorials/basic.html#fine-grained-ner) for the full list). To test the generalization we here convert the annotations to the CoNLL-2003 format using the labels `Person`, `Location`, `Organization`. As CoNLL-2003, `Location` includes cities, roads, mountains, abstract places, specific buildings, and meeting points. Thus the `GPE` (geo-political entity) were converted to `Location`. The `MISC` category in CoNLL-2003 is a diverse category meant to denote all names not in other categories (encapsulating both e.g. events and adjectives such as ”2004 World Cup” and ”Italian”), and is therefore not included.

In [1]:
# !pip install datasets
# !pip install altair

import spacy
from spacy.tokens import Doc
from typing import List, Optional
from datasets import load_dataset
import warnings
nlp = spacy.blank("da")

def dansk(splits: Optional[List[str]]= None, **kwargs):
    if splits is None:
        splits = ["train", "dev", "test"]
        
    if Doc.has_extension("meta"):
        warnings.warn("Overwriting existing meta extension")
    Doc.set_extension("meta", default={}, force=True)


    nlp = spacy.blank("da")
    def convert_to_doc(example):
        doc = Doc(nlp.vocab).from_json(example)
        # set metadata
        for k in ['dagw_source', 'dagw_domain', 'dagw_source_full']:
            doc._.meta[k] = example[k]
        return doc

    return_ds = []
    for split in splits:
        ds = load_dataset("chcaa/DANSK", split=split, **kwargs)
        docs = [convert_to_doc(example) for example in ds]
        return_ds.append(docs)
    return return_ds

In [2]:
train, dev, test = dansk()

Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-f6b5c98c643cd000/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-f6b5c98c643cd000/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/chcaa___parquet/chcaa--DANSK-f6b5c98c643cd000/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [3]:
set([e.label_ for doc in train for e in doc.ents])

{'CARDINAL',
 'DATE',
 'EVENT',
 'FACILITY',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOCATION',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORGANIZATION',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK OF ART'}

In [4]:
# convert to Conll-2003 format
def convert_to_conll_2003(docs, mapping={"PERSON": "PER", "GPE": "LOC", "LOCATION": "LOC", "ORGANIZATION": "ORG"}):
    for doc in docs:
        ents = doc.ents
        ents = [e for e in ents if e.label_ in mapping]
        # convert GPE
        for ent in ents:
            ent.label_ = mapping[ent.label_]
        doc.ents = ents

convert_to_conll_2003(train)
convert_to_conll_2003(dev)
convert_to_conll_2003(test)

dataset = train + dev + test


In [5]:
assert set([e.label_ for doc in train for e in doc.ents]) == set(["PER", "LOC", "ORG"])

In [6]:
domains = {}
for doc in dataset:
    domain = doc._.meta["dagw_domain"]
    if domain not in domains:
        domains[domain] = []
    domains[domain].append(doc)


In [43]:
sp_sm = spacy.load("da_core_news_sm")
sp_lg = spacy.load("da_core_news_lg")

mdls = [
    ("spaCy (da_core_news_sm)", sp_sm),
    ("spaCy (da_core_news_lg)", sp_lg),
]
from spacy.training import Example
from spacy.scorer import Scorer
import random
import numpy as np

scorer = Scorer()

def no_misc_getter(doc, attr):
    for ent in doc.ents:
        if ent.label_ != "MISC":
            yield ent

def bootstrap(examples, n_rep=100):
    scores = []
    for i in range(n_rep):
        sample = random.choices(examples, k=len(examples))
        score = scorer.score_spans(sample, getter=no_misc_getter, attr="ents")
        scores.append(score)
    return scores

def compute_mean_and_ci(scores):

    ent_f = [score["ents_f"] for score in scores]
    per_f = [score["ents_per_type"].get("PER", {"f": None})["f"] for score in scores]
    loc_f = [score["ents_per_type"].get("LOC", {"f": None})["f"] for score in scores]
    org_f = [score["ents_per_type"].get("ORG", {"f": None})["f"] for score in scores]

    nam = ["Average F1", "Person F1", "Location F1", "Organization F1"]

    d = {}
    for n, f in zip(nam, [ent_f, per_f, loc_f, org_f]):
        f = [x for x in f if x is not None]
        if len(f) == 0:
            d[n] = {
                "mean": None,
                "ci": None
            }
            continue
        d[n] = {
            "mean": np.mean(f),
            "ci": np.percentile(f, [2.5, 97.5])
        }
    return d


all_examples = {}
rows= []
for mdl_name, mdl in mdls:
    all_examples[mdl_name] = []
    for domain in domains:
        docs = domains[domain]
        model_pred = mdl.pipe([doc.text for doc in docs])
        examples = [Example(predicted=x, reference=y) for x, y in zip(model_pred, docs)]
        all_examples[mdl_name].extend(examples)

        bs_score = bootstrap(examples)
        score = compute_mean_and_ci(bs_score)


        row = {
            "Model": mdl_name,
            "Domain": domain,
            "Average F1": score["Average F1"]["mean"],
            "Person F1": score["Person F1"]["mean"],
            "Location F1": score["Location F1"]["mean"],
            "Organization F1": score["Organization F1"]["mean"],
            "Average F1 CI": score["Average F1"]["ci"],

            "Number of docs": len(docs),
            
        }
        rows.append(row)

# across domains
for mdl in all_examples:
    examples = all_examples[mdl]
    bs_score = bootstrap(examples)
    score = compute_mean_and_ci(bs_score)

    row = {
        "Model": mdl,
        "Domain": "All",
        "Average F1": score["Average F1"]["mean"],
        "Person F1": score["Person F1"]["mean"],
        "Location F1": score["Location F1"]["mean"],
        "Organization F1": score["Organization F1"]["mean"],
        "Average F1 CI": score["Average F1"]["ci"],
        "Number of docs": len(examples),
    }
    rows.append(row)

# write to file
import pandas as pd
df = pd.DataFrame(rows)
df.to_csv("ner_performance.csv", index=False)

In [87]:
import pandas as pd
import altair as alt

df = pd.DataFrame(rows)

# filter out domains
df = df[df["Domain"] != "danavis"]
df = df[df["Domain"] != "dannet"]
df = df[df["Domain"].notnull()]

df['Average F1 CI Lower'] = df['Average F1 CI'].apply(lambda x: x[0])
df['Average F1 CI Upper'] = df['Average F1 CI'].apply(lambda x: x[1])
df['Average F1 CI Lower'] = pd.to_numeric(df['Average F1 CI Lower'])
df['Average F1 CI Upper'] = pd.to_numeric(df['Average F1 CI Upper'])



selection = alt.selection_point(fields=['Domain'], bind='legend', value=[{'Domain': 'All'}]) # does not work

bind_checkbox = alt.binding_checkbox(name='Scale point size by number of documents: ')
param_checkbox = alt.param(bind=bind_checkbox)

base = alt.Chart(df).mark_point(filled=True).encode(
    # x='Average F1',
    x=alt.X('Average F1', title="F1"),
    y='Model',
    color='Domain',
    size=alt.condition(
        param_checkbox,
        'Number of docs',
        alt.value(100)
    ),
    tooltip=["Model", "Domain", "Average F1", "Person F1", "Location F1", "Organization F1"],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
)
error_bars = alt.Chart(df).mark_errorbar(ticks=False).encode(
    # x='Average F1 CI Lower',
    x = alt.X('Average F1 CI Lower', title="F1"),
    x2='Average F1 CI Upper',
    y='Model',
    color='Domain',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
)

chart = error_bars +base

chart.add_params(selection, param_checkbox).properties(
    width=800,
    height=400
)

```{note}
The F1 in the figure denotes the mean bootstrapped F1 score with a 95% confidence interval. The F1 score is calculated on all of the DANSK dataset.
```

## Robustness

In the paper [DaCy: A Unified Framework for Danish NLP](https://github.com/centre-for-humanities-computing/DaCy/blob/main/papers/DaCy-A-Unified-Framework-for-Danish-NLP/readme.md) we conduct a series on augmentation on the DaNE test set to estimate the robustness and biases of DaCy and other Danish language processing pipelines. This page represents only parts of the paper. We recommend reading the paper for a more thorough and nuanced overview.

Let's start by examining a couple of the augmentations, namely changing out names or introducing plausible keystroke errors.

````{admonition} Example

```{note} Original

Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```

```{important} Female name augmentation

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.
```
````

The underlying assumption of making these augmentations is that the tags of the tokens do not change with augmentation. In our case, this includes that "Anna Østergaard" is still a person and that "vonde" can still be considered a verb based on its context.

Based on this, we can assume that if a model performs worse on a certain set of names or with minor spelling variations or errors, we can conclude that the model is vulnerable to such input. For instance, if the model has a hard time when replacing æ, ø, and å with ae, oe, and aa, it might not be ideal to apply to historic texts.

As seen in the example above, while text with 5% keystroke is still readable. However, 15% keystroke errors tests the limit of what humans and models can reasonably be expected to comprehend.

```{important}
**15% keytype errors**

Peter Schmeichel mejer ogsp, at ddt danske landshoof anbo 202q tilhårer gerfenatop0en of lan vinde sen kpmkendw lamp mod England.
```




The following tables show a detailed breakdown of performance for named entity recognition, part-of-speech tagging, and dependency parsing. These show some general trends, some of which include:

- Spelling variations and abbreviated first names consistently reduce performance of all models on all tasks.
- Even simple replacements of æ, ø, and å with ae, oe, and aa lead to notable performance degradation.
- In general, larger models handle augmentations better than small models with DaCy large performing the best.
- The BiLSTM-based models (Stanza and Flair) perform competitively under augmentations and are only consistently outperformed by DaCy large.


# References

```{bibliography}
```