# Named Entity Recognition (NER) Comparison: spaCy vs Stanza

This notebook demonstrates how to extract named entities from sentences using two different NLP libraries: **spaCy** and **Stanza**. We use a token-level test set and compare the results of both systems.

## 1. Load the NER Test Data

We load the token-level NER test set, which contains columns for sentence ID, token ID, token, and BIO NER tag.

In [None]:
import pandas as pd # Use pandas to read the dataset
ner_test = pd.read_csv(r'NER-test.tsv', sep="\t") # Read the dataset
ner_test.head() # Look at the first 5 rows to ensure the data is read correctly

## 2. Reconstruct Sentences

Since the data is tokenized, we group tokens by `sentence_id` to reconstruct the full sentences for NER processing.

In [None]:
sentences = ner_test.groupby('sentence_id')['token'].apply(lambda tokens: ' '.join(tokens)).reset_index() # Group the data by sentence ID and join the tokens into a single sentence
sentences.columns = ['sentence_id', 'sentence'] # Rename the columns

## 3. Import and Initialize NLP Libraries

We import and initialize the spaCy and Stanza pipelines for English. These will be used to extract named entities from each sentence.

In [None]:
import spacy
import stanza

In [None]:
nlp_spacy = spacy.load("en_core_web_sm") # Load the English language model

In [None]:
stanza.download("en") # Download the English language model
nlp_stanza = stanza.Pipeline("en") # Load the English language model

## 4. Define Entity Extraction Functions

We define helper functions to extract entities from a sentence using each library. The functions return a list of (entity text, entity label) pairs.

In [None]:
def extract_entities_spacy(text): # Function to extract entities using spaCy 
    doc = nlp_spacy(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

In [None]:
def extract_entities_stanza(text): # Function to extract entities using stanza
    doc = nlp_stanza(text)
    return [(ent.text, ent.type) for sent in doc.sentences for ent in sent.ents]

## 5. Apply NER Systems

We apply both spaCy and Stanza NER to each sentence and store the results in new columns.

In [None]:
sentences['spacy_entities'] = sentences['sentence'].apply(extract_entities_spacy) # Apply the spaCy function to each sentence
sentences['stanza_entities'] = sentences['sentence'].apply(extract_entities_stanza) # Apply the stanza function to each sentence

## 6. Display and Compare Results

We display the sentences alongside the entities extracted by each system. This allows for direct comparison and further analysis.

In [None]:
for idx, row in sentences.iterrows():
    print(f"Sentence: {row['sentence']}\n") # Print the sentence
    print("spaCy entities:")
    for ent in row['spacy_entities']: # Iterate over the spaCy entities and print them
        print(f"  {ent[0]} ({ent[1]})")
    print("Stanza entities:")
    for ent in row['stanza_entities']: # Iterate over the stanza entities and print them
        print(f"  {ent[0]} ({ent[1]})")
    print("-" * 60)

## 7. Direct Comparison of spaCy and Stanza Results

The following sections present a direct comparison between the named entity recognition results produced by spaCy and Stanza. We will analyze agreements, disagreements, and unique findings from each system side by side.

A table containing all the results of the models side by side

In [None]:
# Pivot the comparison DataFrame to see spaCy and Stanza side by side
pivot_df = comparison_df.pivot_table(
    index=['sentence_id', 'sentence', 'entity'],
    columns='system',
    values='label',
    aggfunc='first'
).reset_index()

pd.set_option('display.max_colwidth', None)
display(pivot_df)

The number of entities found by each model

In [None]:
# Count entities found by each model
n_spacy = comparison_df[comparison_df['system'] == 'spaCy'].shape[0]
n_stanza = comparison_df[comparison_df['system'] == 'Stanza'].shape[0]

print(f"Entities found by spaCy: {n_spacy}")
print(f"Entities found by Stanza: {n_stanza}")

The number of entities with same or different label

In [None]:
# Entities found by both systems (side-by-side comparison)
filtered = pivot_df.dropna(subset=['spaCy', 'Stanza'], how='all')
agreement = filtered[filtered['spaCy'] == filtered['Stanza']]
disagreement = filtered[filtered['spaCy'] != filtered['Stanza']]

print(f"Entities with SAME label: {len(agreement)}")
print(f"Entities with DIFFERENT label: {len(disagreement)}")

Uniqueness of entities

In [None]:
# Entities found only by spaCy
only_spacy = pivot_df[(pivot_df['spaCy'].notna()) & (pivot_df['Stanza'].isna())]
# Entities found only by Stanza
only_stanza = pivot_df[(pivot_df['Stanza'].notna()) & (pivot_df['spaCy'].isna())]

print(f"Entities found ONLY by spaCy: {len(only_spacy)}")
print(f"Entities found ONLY by Stanza: {len(only_stanza)}")

Previously mentioned comparisons in table format

In [None]:
print("Table of agreement:")
display(agreement)

print("Table of disagreement:")
display(disagreement)

print("Table of entities only found by spaCy:")
display(only_spacy)

print("Table of entities only found by Stanza:")
display(only_stanza)

Distribution of entity types

In [None]:
# Distribution of entity types per model
spacy_types = comparison_df[comparison_df['system'] == 'spaCy']['label'].value_counts()
stanza_types = comparison_df[comparison_df['system'] == 'Stanza']['label'].value_counts()

print("spaCy entity type distribution:")
print(spacy_types)
print("\nStanza entity type distribution:")
print(stanza_types)

## 8. Performance analysis of both models

Since the NER-test.tsv file includes BIO_NER_tags (gold data), we can use these tags to analyse the performance of the two models. This gives an idea on the correctness of the models

Analyse performance on true positives, false positives and false negatives. Performance measured by precision, recall and f1-scores.

In [None]:
def bio_to_spans(tokens, tags): # Function to convert BIO tags to entity spans
    """Convert BIO tags to entity spans: (start, end, label, text)"""
    spans = []
    start = None
    label = None
    for i, tag in enumerate(tags):
        if tag.startswith('B-'):
            if start is not None:
                spans.append((start, i, label, ' '.join(tokens[start:i])))
            start = i
            label = tag[2:]
        elif tag.startswith('I-'):
            continue
        else:  # tag == 'O'
            if start is not None:
                spans.append((start, i, label, ' '.join(tokens[start:i])))
                start = None
                label = None
    if start is not None:
        spans.append((start, len(tags), label, ' '.join(tokens[start:len(tags)])))
    return spans

# Build gold spans for each sentence
gold_spans = {}
for sid, group in ner_test.groupby('sentence_id'):
    tokens = group['token'].tolist()
    tags = group['BIO_NER_tag'].tolist()
    spans = bio_to_spans(tokens, tags)
    gold_spans[sid] = set((span[3], span[2]) for span in spans)  # (text, label)

# Build predicted spans for each model
def get_predicted_spans(row, col):
    return set((ent[0], ent[1]) for ent in row[col])

sentences['gold_spans'] = sentences['sentence_id'].map(gold_spans)
sentences['spacy_spans'] = sentences['spacy_entities'].apply(lambda ents: set(ents))
sentences['stanza_spans'] = sentences['stanza_entities'].apply(lambda ents: set(ents))

# Evaluate for each model
def evaluate(pred_col):
    tp = 0  # true positives
    fp = 0  # false positives
    fn = 0  # false negatives
    for _, row in sentences.iterrows():
        gold = row['gold_spans']
        pred = set((text, label) for text, label in row[pred_col])
        tp += len(gold & pred)
        fp += len(pred - gold)
        fn += len(gold - pred)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    return tp, fp, fn, precision, recall, f1

spacy_tp, spacy_fp, spacy_fn, spacy_prec, spacy_rec, spacy_f1 = evaluate('spacy_entities')
stanza_tp, stanza_fp, stanza_fn, stanza_prec, stanza_rec, stanza_f1 = evaluate('stanza_entities')

print("spaCy:")
print(f"  TP: {spacy_tp}, FP: {spacy_fp}, FN: {spacy_fn}")
print(f"  Precision: {spacy_prec:.2f}, Recall: {spacy_rec:.2f}, F1: {spacy_f1:.2f}")

print("\nStanza:")
print(f"  TP: {stanza_tp}, FP: {stanza_fp}, FN: {stanza_fn}")
print(f"  Precision: {stanza_prec:.2f}, Recall: {stanza_rec:.2f}, F1: {stanza_f1:.2f}")

In [None]:
import matplotlib.pyplot as plt

metrics = ['TP', 'FP', 'FN']
spacy_scores = [spacy_tp, spacy_fp, spacy_fn]
stanza_scores = [stanza_tp, stanza_fp, stanza_fn]

x = range(len(metrics))
plt.figure(figsize=(7,4))
plt.bar(x, spacy_scores, width=0.35, label='spaCy', align='center')
plt.bar([i + 0.35 for i in x], stanza_scores, width=0.35, label='Stanza', align='center')
plt.xticks([i + 0.175 for i in x], metrics)
plt.ylabel('Count')
plt.title('NER Model Performance')
plt.legend()
plt.show()

NER Model performance visualized

In [None]:
metrics = ['Precision', 'Recall', 'F1']
spacy_scores = [spacy_prec, spacy_rec, spacy_f1]
stanza_scores = [stanza_prec, stanza_rec, stanza_f1]

x = range(len(metrics))
plt.figure(figsize=(7,4))
plt.bar(x, spacy_scores, width=0.35, label='spaCy', align='center')
plt.bar([i + 0.35 for i in x], stanza_scores, width=0.35, label='Stanza', align='center')
plt.xticks([i + 0.175 for i in x], metrics)
plt.ylim(0, 1)
plt.ylabel('Score')
plt.title('NER Model Performance')
plt.legend()
plt.show()