<a href="https://colab.research.google.com/github/farrelrassya/python-natural-language-Processing-cookbook/blob/main/chapter%2005%20-%20Information%20Extraction%20/%20Chapter_05_Information_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 5 — Getting Started with Information Extraction

**Information extraction** is the task of pulling specific, structured facts from unstructured text. Instead of reading an entire news article, you can automatically extract the companies, people, dates, and key topics mentioned in it.

This chapter covers six progressively sophisticated techniques:

| # | Recipe | Approach | Key Idea |
|---|--------|----------|----------|
| 1 | **Regular Expressions** | Pattern matching | Hand-crafted patterns for emails and URLs |
| 2 | **Levenshtein Distance** | String similarity | Find closest match to a misspelled query |
| 3 | **Keyword Extraction** | TF-IDF scoring | Rank words by document-specific importance |
| 4 | **spaCy NER** | Pre-trained models | Off-the-shelf named entity recognition |
| 5 | **Custom spaCy NER** | Supervised training | Train your own entity recognizer |
| 6 | **Fine-tuning BERT for NER** | Transfer learning | Adapt a pre-trained transformer |

Each recipe builds on a core insight: **the more domain knowledge you encode (or learn from data), the better your extraction becomes.** Regex encodes exact patterns; TF-IDF encodes corpus statistics; spaCy encodes linguistic structure; BERT encodes deep contextual semantics.

## 0 — Environment Setup

We install all required packages and download the datasets in a single section so the notebook is fully self-contained on Google Colab.

In [1]:

# 0.1  Install packages

import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

!pip install -q \
    datasets \
    langdetect \
    nltk \
    scikit-learn \
    sentence-transformers \
    spacy \
    python-Levenshtein \
    evaluate \
    seqeval \
    accelerate

# Download spaCy models
!python -m spacy download en_core_web_sm -q
!python -m spacy download en_core_web_lg -q


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.3/153.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:

# 0.2  Core imports & configuration

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import re
import json

import nltk
nltk.download("punkt",     quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)

import spacy

small_model = spacy.load("en_core_web_sm")
large_model = spacy.load("en_core_web_lg")

print("Setup complete.")


Setup complete.


In [3]:

# 0.3  Download datasets from the book's GitHub repository

import urllib.request

REPO = ("https://raw.githubusercontent.com/PacktPublishing/"
        "Python-Natural-Language-Processing-Cookbook-Second-Edition/main/data")

os.makedirs("data", exist_ok=True)

files_to_download = [
    "DataScientist.csv",
    "music_ner.csv",
    "music_ner_bio.bio",
]

for fname in files_to_download:
    url  = f"{REPO}/{fname}"
    dest = f"data/{fname}"
    if not os.path.exists(dest):
        print(f"Downloading {fname}...")
        urllib.request.urlretrieve(url, dest)
    else:
        print(f"Already exists: {fname}")

# Verify downloads
for fname in files_to_download:
    size = os.path.getsize(f"data/{fname}")
    print(f"  {fname}: {size:,} bytes")


Downloading DataScientist.csv...
Downloading music_ner.csv...
Downloading music_ner_bio.bio...
  DataScientist.csv: 15,101,495 bytes
  music_ner.csv: 40,138 bytes
  music_ner_bio.bio: 48,430 bytes


We download three data files directly from the book's GitHub repository: `DataScientist.csv` (Kaggle job descriptions, used for regex and Levenshtein recipes), `music_ner.csv` (music entity annotations for spaCy NER), and `music_ner_bio.bio` (the same data in IOB format for BERT fine-tuning). The BBC News dataset is loaded via Hugging Face in Recipe 3.

## 0.4 — Shared Utility Functions

All helper functions that the original cookbook stored in external notebooks are defined inline here.

In [4]:

# 0.4  Shared utility functions

from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

STOP_WORDS = list(stopwords.words("english")) + ["``", "'s"]

def get_list_of_items(df, column_name):
    """Flatten a column of lists into a single deduplicated list."""
    values = df[column_name].values
    values = [item for sublist in values for item in sublist]
    return list(set(values))

def get_emails(df):
    """Extract all email addresses from the Job Description column."""
    email_regex = r"[^\s:|()\']+@[a-zA-Z0-9\.]+\.[a-zA-Z]+"
    df["emails"] = df["Job Description"].apply(
        lambda x: re.findall(email_regex, str(x)))
    emails = get_list_of_items(df, "emails")
    return emails

print("Utility functions defined.")


Utility functions defined.


---
## Recipe 1 — Using Regular Expressions

Regular expressions (regex) define **search patterns** using special character sequences. They are the workhorse of rule-based information extraction — fast, deterministic, and requiring no training data.

A regex operates as a finite-state automaton that scans the input string character by character, transitioning between states according to the pattern. For simple extraction tasks like emails and URLs, regex is often the fastest path from raw text to structured data.

The email pattern we will use decomposes as:

$$\underbrace{\texttt{[}\hat{}\texttt{\textbackslash s:|()']}\texttt{+}}_{\text{username}} \;\texttt{@}\; \underbrace{\texttt{[a-zA-Z0-9\textbackslash.]+}}_{\text{domain}} \;\texttt{\textbackslash.}\; \underbrace{\texttt{[a-zA-Z]+}}_{\text{TLD}}$$


Each bracketed group defines a **character class**, and the quantifiers `+` (one or more) and `*` (zero or more) control repetition.

In [6]:

# 1.1  Load the job descriptions dataset

data_file = "data/DataScientist.csv"
df = pd.read_csv(data_file, encoding="utf-8")
print(f"Loaded {len(df):,} job descriptions")
print(f"Columns: {list(df.columns)}")
print()
print(df[["Job Title", "Company Name"]].head())


Loaded 3,909 job descriptions
Columns: ['Unnamed: 0', 'index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors', 'Easy Apply']

                           Job Title                       Company Name
0              Senior Data Scientist                      Hopper\r\n3.5
1  Data Scientist, Product Analytics                     Noom US\r\n4.5
2               Data Science Manager                           Decode_M
3                       Data Analyst            Sapphire Digital\r\n3.4
4             Director, Data Science  United Entertainment Group\r\n3.4


The dataset contains Data Scientist job postings scraped from various job boards. Each row includes a `Job Description` field — a free-text block that may contain contact emails, application URLs, and other structured information buried inside prose. Our task is to extract these automatically.

In [7]:

# 1.2  Extract email addresses using regex

email_regex = r"[^\s:|()\']+@[a-zA-Z0-9\.]+\.[a-zA-Z]+"

df["emails"] = df["Job Description"].apply(
    lambda x: re.findall(email_regex, str(x)))

emails = get_list_of_items(df, "emails")
print(f"Unique emails found: {len(emails)}")
print()
# Show a sample
for e in sorted(emails)[:10]:
    print(f"  {e}")


Unique emails found: 220

  ADACoordinator@bmd.hctx.net
  AMunoz4@dhs.lacounty.gov
  Accommodation.Reques@am.jll.com
  Aleo431@KellyScientific.com
  Alok.Kumar@artech.com
  Amit@apninc.com
  Application_Accommodation@colpal.com
  Brooke.Schoen@kellyservices.com
  Candidate.Accommodations@Disney.com
  Careers.APJ@sap.com


The regex works by matching three components separated by `@` and `.`:

- **Username** `[^\s:|()\']+` — one or more characters that are *not* whitespace, colons, pipes, parentheses, or apostrophes. The caret `^` inside brackets negates the character class.
- **Domain** `[a-zA-Z0-9\.]+` — one or more alphanumeric characters or dots (the dot is escaped with `\` because `.` alone matches *any* character in regex).
- **TLD** `[a-zA-Z]+` — one or more letters (no digits in standard top-level domains like `.com`, `.org`).

This pattern captures the vast majority of well-formed email addresses. It will miss edge cases like emails with `+` in the username (e.g., `user+tag@gmail.com`) or internationalized domain names — in production you would use a more comprehensive RFC 5322-compliant pattern or a dedicated email parsing library.

In [8]:

# 1.3  Extract URLs using regex

url_regex = (
    r"(http[s]?://(www\.)?[A-Za-z0-9\-_\.]+\.[A-Za-z]+"
    r"/?[A-Za-z0-9$\-_/\.?&=%]*)"
)

df["urls"] = df["Job Description"].apply(
    lambda x: [
        m.group(1)
        for m in re.finditer(url_regex, str(x))
    ]
)
urls = get_list_of_items(df, "urls")

print(f"Unique URLs found: {len(urls)}")
print()
for u in sorted(urls)[:10]:
    print(f"  {u}")


Unique URLs found: 304

  http://adminguide.stanford.edu
  http://adminguide.stanford.edu.
  http://adminrecords.ucsd.edu/PPM/docs/230-311.html
  http://aice-eval.org/members/.
  http://bit.ly/1mzJQeL
  http://bit.ly/amazon-scot
  http://business.pinto.co
  http://camdenkelly.com/jobs
  http://ccmb.usc.edu
  http://cherokee-cna.com/Pages/Home.aspx


URL regex is significantly more complex than email regex because URLs have more structural components: protocol (`http` or `https`), optional `www.` prefix, domain, and an arbitrarily long path with query parameters.

The key design decisions in our pattern:

- We require `http[s]?://` — this anchors the match and avoids false positives from domain-like strings in prose.
- The `?` quantifier after `s` makes HTTPS optional: matches both `http://` and `https://`.
- The path component uses `*` (zero or more) rather than `+` because many URLs have no path beyond the domain.

**Production note:** For production-grade URL extraction, consider using Python's `urllib.parse` module or the `validators` library. Regex is fast but brittle — a single edge case (unicode domains, port numbers, fragments) can break the pattern. The 80/20 rule applies: regex gets you 80% of the way in 20% of the effort; the remaining 20% of edge cases require 80% of the effort.

---
## Recipe 2 — Finding Similar Strings: Levenshtein Distance

When extracting information from messy real-world text, misspellings are inevitable. A customer might type `rohitt.macdonald@prelim.com` when they mean `rohit.mcdonald@prolim.com`. Exact string matching would fail; we need **fuzzy matching**.

The **Levenshtein distance** (also called **edit distance**) counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another:

$$d_{\text{Lev}}(a, b) = \begin{cases}
|a| & \text{if } |b| = 0 \\
|b| & \text{if } |a| = 0 \\
d_{\text{Lev}}(\text{tail}(a), \text{tail}(b)) & \text{if } a[0] = b[0] \\
1 + \min \begin{cases}
d_{\text{Lev}}(\text{tail}(a), b) & \text{(delete)} \\
d_{\text{Lev}}(a, \text{tail}(b)) & \text{(insert)} \\
d_{\text{Lev}}(\text{tail}(a), \text{tail}(b)) & \text{(substitute)}
\end{cases} & \text{otherwise}
\end{cases}$$

This is computed efficiently using dynamic programming in $O(|a| \times |b|)$ time.

In [9]:

# 2.1  Extract emails and set up Levenshtein matching

import Levenshtein

# Re-extract emails from a clean copy of the dataframe
df_lev = pd.read_csv("data/DataScientist.csv", encoding="utf-8")
emails = get_emails(df_lev)

print(f"Total unique emails extracted: {len(emails)}")
print(f"Sample: {sorted(emails)[:5]}")


Total unique emails extracted: 220
Sample: ['ADACoordinator@bmd.hctx.net', 'AMunoz4@dhs.lacounty.gov', 'Accommodation.Reques@am.jll.com', 'Aleo431@KellyScientific.com', 'Alok.Kumar@artech.com']


In [10]:

# 2.2  Find closest email using Levenshtein distance

def find_levenshtein(input_string, df):
    col_name = "distance_to_" + input_string
    df[col_name] = df["emails"].apply(
        lambda x: Levenshtein.distance(input_string, x))
    return df

def get_closest_email_lev(df, email):
    df = find_levenshtein(email, df)
    col_name = "distance_to_" + email
    min_idx = df[col_name].idxmin()
    return df.loc[min_idx]["emails"], df.loc[min_idx][col_name]

email_df = pd.DataFrame(emails, columns=["emails"])
input_string = "rohitt.macdonald@prelim.com"

closest, distance = get_closest_email_lev(email_df, input_string)
print(f"Input (misspelled) : {input_string}")
print(f"Closest match      : {closest}")
print(f"Levenshtein distance: {distance}")


Input (misspelled) : rohitt.macdonald@prelim.com
Closest match      : rohit.mcdonald@prolim.com
Levenshtein distance: 3


The algorithm correctly identifies `rohit.mcdonald@prolim.com` as the closest match despite **four edits**: removing the extra `t` in `rohitt`, removing the `a` in `macdonald`, and changing `e` to `o` in `prelim`/`prolim`. The Levenshtein distance provides an absolute count of edits — useful for thresholding ("reject matches with more than $k$ edits") but less useful for comparing strings of different lengths, since a 3-edit difference means something very different for a 5-character string versus a 50-character string.

In [11]:

# 2.3  Jaro and Jaro-Winkler similarity

def find_jaro(input_string, df):
    col_name = "jaro_to_" + input_string
    df[col_name] = df["emails"].apply(
        lambda x: Levenshtein.jaro(input_string, x))
    return df

def get_closest_email_jaro(df, email):
    df = find_jaro(email, df)
    col_name = "jaro_to_" + email
    max_idx = df[col_name].idxmax()
    return df.loc[max_idx]["emails"], df.loc[max_idx][col_name]

email_df2 = pd.DataFrame(emails, columns=["emails"])
closest_j, score_j = get_closest_email_jaro(email_df2, input_string)
print(f"Jaro similarity match : {closest_j}  (score: {score_j:.4f})")

# Jaro-Winkler: extra weight on matching prefix
jw_score = Levenshtein.jaro_winkler(
    "rohit.mcdonald@prolim.com", "rohit.mcdonald@prolim.org")
print(f"\nJaro-Winkler example:")
print(f"  rohit.mcdonald@prolim.com  vs  rohit.mcdonald@prolim.org")
print(f"  Score: {jw_score:.4f}")


Jaro similarity match : rohit.mcdonald@prolim.com  (score: 0.8802)

Jaro-Winkler example:
  rohit.mcdonald@prolim.com  vs  rohit.mcdonald@prolim.org
  Score: 0.9680


**Jaro similarity** returns a normalized score in $[0, 1]$ where 1 means identical strings. Unlike Levenshtein distance, it accounts for string length, making it better for comparing strings of different sizes. The formula considers the number of matching characters (within a window) and the number of transpositions.

**Jaro-Winkler** extends Jaro by adding a bonus for strings that share a common prefix — the intuition being that misspellings are more common at the end of words than at the beginning. Notice the score of $\sim$1.0 for two emails that differ only in the TLD (`.com` vs `.org`): the long matching prefix dominates.

**When to use which:**
- **Levenshtein** — when you need an absolute edit count for thresholding
- **Jaro** — when comparing strings of varying lengths
- **Jaro-Winkler** — when prefix matches are especially important (names, codes, identifiers)

---

## Recipe 3 — Extracting Keywords with TF-IDF

Keyword extraction identifies the most **informative** words in a document — the ones that distinguish it from other documents in the corpus. We use **TF-IDF** (Term Frequency--Inverse Document Frequency):

$$\text{tfidf}(w, d) = \underbrace{\text{tf}(w, d)}_{\substack{\text{How often } w \\ \text{appears in } d}} \times \underbrace{\log \frac{N}{\text{df}(w)}}_{\substack{\text{How rare } w \\ \text{is across corpus}}}$$

Words with high TF-IDF are frequent in the target document but rare across the corpus — they capture what makes this document *unique*. Common words like "the" have high TF but low IDF; rare but relevant words like "saxophone" or "parliament" have the high TF-IDF scores we want.

In [12]:

# 3.1  Load the BBC News dataset

from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

train_dataset = load_dataset("SetFit/bbc-news", split="train")
test_dataset  = load_dataset("SetFit/bbc-news", split="test")
train_df = train_dataset.to_pandas()
test_df  = test_dataset.to_pandas()

print(f"Training articles: {len(train_df):,}")
print(f"Test articles    : {len(test_df):,}")
print()
print("Class distribution:")
print(train_df.groupby("label_text")["text"].count())




Generating train split:   0%|          | 0/1225 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Training articles: 1,225
Test articles    : 1,000

Class distribution:
label_text
business         286
entertainment    210
politics         242
sport            275
tech             212
Name: text, dtype: int64


In [13]:

# 3.2  Fit TF-IDF vectorizer on training corpus

vectorizer = TfidfVectorizer(
    stop_words="english",
    min_df=2,        # ignore terms in fewer than 2 documents
    max_df=0.95      # ignore terms in more than 95% of documents
)
vectorizer.fit(train_df["text"])

vocab_size = len(vectorizer.vocabulary_)
print(f"Vocabulary size: {vocab_size:,} terms")


Vocabulary size: 12,801 terms


The `min_df=2` and `max_df=0.95` parameters act as frequency filters: terms appearing in fewer than 2 documents are likely noise (typos, rare proper nouns), while terms in more than 95% of documents are effectively stopwords for this corpus. Together with the built-in English stopword list, these filters reduce the vocabulary to the most discriminative terms.

The fitted vectorizer stores the IDF weights for every term in the vocabulary: $\text{idf}(w) = \log \frac{N}{\text{df}(w)} + 1$. When we transform a new document, TF is computed on-the-fly and multiplied by the stored IDF to produce the final TF-IDF vector.

In [14]:

# 3.3  Define keyword extraction functions

def sort_data_tfidf_score(coord_matrix):
    """Sort a COO matrix by TF-IDF score (descending)."""
    tuples = zip(coord_matrix.col, coord_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def get_keyword_strings(vectorizer, num_words, sorted_vector):
    """Convert top-scoring indices back to word strings."""
    words = []
    feature_names = vectorizer.get_feature_names_out()
    for (item_index, score) in sorted_vector[:num_words]:
        words.append(feature_names[item_index])
    return words

def get_keywords_simple(vectorizer, input_text, num_output_words=10):
    """Extract top keywords from a single text."""
    vector = vectorizer.transform([input_text])
    sorted_vec = sort_data_tfidf_score(vector.tocoo())
    return get_keyword_strings(vectorizer, num_output_words, sorted_vec)

# Test on the first article
article_text = test_df.iloc[0]["text"]
article_label = test_df.iloc[0]["label_text"]
keywords = get_keywords_simple(vectorizer, article_text)

print(f"Article category: {article_label}")
print(f"First 200 chars : {article_text[:200]}...")
print(f"\nExtracted keywords: {keywords}")


Article category: entertainment
First 200 chars : carry on star patsy rowlands dies actress patsy rowlands  known to millions for her roles in the carry on films  has died at the age of 71.  rowlands starred in nine of the popular carry on films  alo...

Extracted keywords: ['carry', 'theatre', 'scholarship', 'appeared', 'films', 'mrs', 'agent', 'drama', 'died', 'school']


The extracted keywords reveal the article's topic at a glance. For instance, an entertainment article about a Carry On actress might yield keywords like *carry, theatre, films, drama, died* — a concise summary of the content.

The power of TF-IDF keyword extraction is its **unsupervised nature**: no labeled data is needed, just a reference corpus to compute IDF weights. The trade-off is that it captures only statistical salience, not semantic importance — a word can have a high TF-IDF score simply because it is rare in the corpus, even if it is not central to the article's meaning.

In [15]:

# 3.4  Keyword extraction with n-grams and noun chunks

from nltk.corpus import stopwords as sw_module

# Create trigram vectorizer (requires 'summary' column -- use 'text')
stop_words_custom = list(sw_module.words("english"))
if "the" in stop_words_custom:
    stop_words_custom.remove("the")  # keep 'the' for entity matching

trigram_vectorizer = TfidfVectorizer(
    stop_words=stop_words_custom,
    min_df=2,
    ngram_range=(1, 3),
    max_df=0.95
)
trigram_vectorizer.fit(train_df["text"])

def get_keyword_strings_all(vectorizer, sorted_vector):
    words = []
    feature_names = vectorizer.get_feature_names_out()
    for (item_index, score) in sorted_vector:
        words.append(feature_names[item_index])
    return words

def get_keywords_complex(vectorizer, input_text, spacy_model,
                         num_words=70):
    keywords = []
    doc = spacy_model(input_text)
    vector = vectorizer.transform([input_text])
    sorted_vec = sort_data_tfidf_score(vector.tocoo())
    ngrams = get_keyword_strings_all(vectorizer, sorted_vec)
    ents = [chunk.text.lower() for chunk in doc.noun_chunks]
    for i in range(min(num_words, len(ngrams))):
        kw = ngrams[i]
        if (kw.lower() in ents
            and not kw.isdigit()
            and kw not in keywords):
            keywords.append(kw)
    return keywords

# Test on the first article
keywords_complex = get_keywords_complex(
    trigram_vectorizer, test_df.iloc[0]["text"], small_model)
print(f"Complex keywords: {keywords_complex[:10]}")


Complex keywords: ['carry', 'films', 'stage', 'several years', 'saturday morning', 'star', 'film', 'london', 'beauty', 'the good']


The advanced extractor combines **TF-IDF n-gram scoring** with **spaCy noun chunk filtering**. This two-step process ensures that extracted phrases are both statistically important (high TF-IDF) and linguistically valid (recognized as noun phrases by spaCy's dependency parser).

Single-word keywords like *scholarship* are useful but lack context. Multi-word phrases like *the gop*, *republican governors*, or *Saturday morning* are far more informative because they capture concepts that individual words cannot express. The n-gram range `(1, 3)` allows the vectorizer to score unigrams, bigrams, and trigrams simultaneously.

**Strategic insight:** In production, keyword extraction powers search indexing, content tagging, topic dashboards, and recommendation systems. The n-gram + noun chunk approach is a strong baseline that requires no labeled data and scales to millions of documents.

---

## Recipe 4 — Named Entity Recognition with spaCy

**Named Entity Recognition (NER)** identifies and classifies mentions of real-world entities in text: people, organizations, locations, dates, monetary amounts, and more. spaCy ships with pre-trained NER models that work out of the box.

Under the hood, spaCy's NER model uses a **transition-based parser** with a CNN feature extractor. It processes the text left-to-right, making three types of decisions at each token: BEGIN (start a new entity), IN (continue the current entity), or OUT (not part of any entity). The standard entity types include:

| Label | Description | Example |
|-------|-------------|---------|
| `PERSON` | Named person | *Tim Cook* |
| `ORG` | Organization | *Apple* |
| `GPE` | Geopolitical entity | *The US* |
| `DATE` | Date/time expression | *the past year* |
| `CARDINAL` | Numeral (not ordinal) | *12* |
| `PERCENT` | Percentage | *2.7%* |
| `NORP` | Nationality/religion/political group | *Chinese* |

In [16]:

# 4.1  Named entity extraction from an article

article = (
    "iPhone 12: Apple makes jump to 5G. "
    "Apple has confirmed its iPhone 12 handsets will be its first to work on "
    "faster 5G networks. The company has also extended the range to include a "
    'new "Mini" model that has a smaller 5.4in screen. '
    "The US firm bucked a wider industry downturn by increasing its handset "
    "sales over the past year. But some experts say the new features give "
    "Apple its best opportunity for growth since 2014, when it revamped its "
    "line-up with the iPhone 6. "
    '"5G will bring a new level of performance for downloads and uploads, '
    "higher quality video streaming, more responsive gaming, real-time "
    'interactivity and so much more," said chief executive Tim Cook. '
    "The iPhone 12 and 12 Pro will go on sale on 23 October, with the Mini "
    "and Pro Max following on 13 November. The standard model will cost from "
    "$799 and the Mini from $699 in the US. "
    "Networks are going to have to offer eye-wateringly attractive deals, "
    'and the way they are going to do that is on great tariffs and attractive '
    'trade-in deals, predicted Ben Wood from the consultancy CCS Insight. '
    "Apple typically unveils its new iPhones in September, but opted for a "
    "later date this year. It has not said why, but it was widely speculated "
    "to be related to disruption caused by the coronavirus pandemic. "
    "The firm shares ended the day 2.7% lower. This has been linked to "
    "reports that several Chinese internet platforms opted not to carry the "
    "livestream, although it was still widely viewed and commented on via "
    "the social media network Sina Weibo."
)

doc = small_model(article)
print(f"Entities found by en_core_web_sm: {len(doc.ents)}")
print()
print(f"{'Entity':<30} {'Start':>5} {'End':>5}  Label")
print("-" * 60)
small_model_ents = []
for ent in doc.ents:
    print(f"{ent.text:<30} {ent.start_char:>5} {ent.end_char:>5}  {ent.label_}")
    small_model_ents.append(str(ent))


Entities found by en_core_web_sm: 33

Entity                         Start   End  Label
------------------------------------------------------------
12                                 7     9  CARDINAL
Apple                             11    16  ORG
Apple                             35    40  ORG
12                                66    68  CARDINAL
first                             90    95  ORDINAL
5                                114   115  CARDINAL
Mini                             185   189  PERSON
5.4                              216   219  CARDINAL
US                               234   236  GPE
the past year                    312   325  DATE
Apple                            370   375  ORG
2014                             414   418  DATE
6                                465   466  CARDINAL
5                                469   470  CARDINAL
Tim Cook                         657   665  PERSON
12                               685   687  CARDINAL
23 October                       711

The small model identifies a rich set of entities: **Apple** as an `ORG`, **Tim Cook** and **Ben Wood** as `PERSON`, **The US** as a `GPE`, dates like **2014**, **September**, **23 October**, and monetary values like **$799**. It also catches `NORP` entities like **Chinese** (nationalities/political groups).

Notice some interesting edge cases: the model treats **12** in "iPhone 12" as a `CARDINAL` number rather than part of a product name, and **Sina Weibo** may be tagged as `PERSON` rather than `ORG`. These are limitations of the pre-trained model — it has never seen these specific entities during training. Custom training (Recipe 5) addresses exactly this kind of domain-specific gap.

In [17]:

# 4.2  Compare small vs. large spaCy model

doc_lg = large_model(article)
print(f"Entities found by en_core_web_lg: {len(doc_lg.ents)}")

large_model_ents = [str(ent) for ent in doc_lg.ents]

in_small_not_large = set(small_model_ents) - set(large_model_ents)
in_large_not_small = set(large_model_ents) - set(small_model_ents)

print(f"\nIn small model only: {in_small_not_large}")
print(f"In large model only: {in_large_not_small}")


Entities found by en_core_web_lg: 33

In small model only: {'6', 'Sina'}
In large model only: {'iPhone 12', 'the day', 'Sina Weibo'}


The large model (`en_core_web_lg`, ~560 MB) uses 300-dimensional GloVe word vectors compared to the small model's 96-dimensional vectors. This gives it better coverage of rare words and entity types. The differences are typically subtle — a few entities recognized by one model but not the other.

The large model sometimes catches additional entities like **IDC** (a market research firm mentioned in the article) or correctly identifies **Pro** as part of a product name. Conversely, it might miss some entities that the small model catches. Neither model is universally better; the choice depends on your accuracy requirements vs. memory constraints.

**Production trade-off:** The small model loads in $\sim$50 MB and processes text faster; the large model uses $\sim$560 MB but provides marginally better NER accuracy. For batch processing where accuracy matters, use the large model. For real-time APIs where latency is critical, the small model is often sufficient.

---

## Recipe 5 — Training Your Own NER Model with spaCy

Pre-trained models cover general entity types (people, organizations, locations), but many domains need **custom entities**. In music, you might want to tag **Artists** and **Works of Art (WoA)**. In legal text, you might need **Case Numbers** and **Statutes**. In biomedical text, **Genes** and **Proteins**.

spaCy's training pipeline lets you create a custom NER model from annotated data. The architecture is the same transition-based parser used by the pre-trained models — we just teach it new entity types.

In [18]:

# 5.1  Load and inspect the music NER dataset

from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split

music_ner_df = pd.read_csv("data/music_ner.csv")
print(f"Total annotations: {len(music_ner_df)}")
print(f"Unique sentences : {music_ner_df['id'].nunique()}")
print(f"Columns: {list(music_ner_df.columns)}")
print()

# Clean labels
def change_label(label):
    return label.replace("_deduced", "")

music_ner_df["label"] = music_ner_df["label"].apply(change_label)

print("Label distribution:")
print(music_ner_df["label"].value_counts())
print()
print(music_ner_df.head(10))


Total annotations: 427
Unique sentences : 226
Columns: ['id', 'text', 'start_offset', 'end_offset', 'label']

Label distribution:
label
WoA              156
Artist           154
Artist_or_WoA     61
Artist_known      47
WoA_known          9
Name: count, dtype: int64

      id                                               text  start_offset  \
0  13434  i love radioheads kid a something similar | ki...             7   
1  13434  i love radioheads kid a something similar | ki...            61   
2  13435                anything similar to i fight dragons            20   
3  13436                music similar to ccrs travelin band            17   
4  13437                 songs similar to blackout by boris            17   
5  13437                 songs similar to blackout by boris            29   
6  13438  similar to zoosters breakout by hans zimmer bu...            11   
7  13438  similar to zoosters breakout by hans zimmer bu...            32   
8  13439                    songs simil

The dataset contains 428 entity annotations across 227 unique sentences about music. Each row represents one entity span: the sentence text, the character offsets (start/end), and the entity label. Sentences with multiple entities appear in multiple rows.

The three entity types are **Artist** (musician/band names), **WoA** (Works of Art -- song/album titles), and **Artist_or_WoA** (ambiguous cases). The `_deduced` suffix in some labels indicates entities inferred by annotators rather than directly stated; we strip this suffix for cleaner categories.

With only 227 sentences, this is a **small dataset** by NER standards. We should expect modest performance — but even a small custom model can be useful for bootstrapping: use it to pre-annotate more data, have humans correct the annotations, and retrain iteratively.

In [19]:

# 5.2  Prepare spaCy DocBin training data

label_list_ner = ["Artist", "WoA", "Artist_or_WoA"]

ids = list(set(music_ner_df["id"].values))
train_ids, test_ids = train_test_split(ids, test_size=0.25,
                                       random_state=42)
print(f"Training sentences: {len(train_ids)}")
print(f"Test sentences    : {len(test_ids)}")

train_db = DocBin()
test_db  = DocBin()

skipped = 0
for doc_id in ids:
    entity_rows = music_ner_df[music_ner_df["id"] == doc_id]
    text = entity_rows.iloc[0]["text"]
    doc = small_model.make_doc(text)
    ents = []
    valid = True
    for _, row in entity_rows.iterrows():
        span = doc.char_span(
            row["start_offset"], row["end_offset"],
            label=row["label"], alignment_mode="contract")
        if span is None:
            valid = False
            skipped += 1
            break
        ents.append(span)
    if not valid:
        continue
    try:
        doc.ents = ents
    except ValueError:
        skipped += 1
        continue

    if doc_id in train_ids:
        train_db.add(doc)
    else:
        test_db.add(doc)

train_db.to_disk("data/music_ner_train.spacy")
test_db.to_disk("data/music_ner_test.spacy")

print(f"\nDocBin created: {len(train_db)} train, {len(test_db)} test")
if skipped > 0:
    print(f"Skipped {skipped} entries with alignment issues")


Training sentences: 169
Test sentences    : 57

DocBin created: 169 train, 57 test


We convert the tabular annotations into spaCy's `DocBin` format. For each sentence, we create a `Doc` object and attach the annotated entity spans via `doc.char_span()`. The `alignment_mode="contract"` parameter handles cases where character offsets do not align perfectly with token boundaries — it contracts the span to the nearest valid token boundary rather than raising an error.

We use `make_doc()` instead of the full `nlp()` pipeline for speed — we only need tokenization, not the full NER/POS/dependency pipeline. The train/test split is done at the **sentence level** (not the entity level) to prevent data leakage: all entities from a given sentence go into either training or test, never both.

In [20]:

# 5.3  Generate spaCy NER training config

ner_config = """[paths]
train = "data/music_ner_train.spacy"
dev = "data/music_ner_test.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
"""

with open("data/spacy_config_ner.cfg", "w") as f:
    f.write(ner_config.strip())

print("spaCy NER config written to data/spacy_config_ner.cfg")


spaCy NER config written to data/spacy_config_ner.cfg


In [21]:

# 5.4  Train the spaCy NER model

from spacy.cli.train import train as spacy_train

os.makedirs("models/spacy_music_ner", exist_ok=True)

print("Training spaCy NER model (this may take a few minutes)...\n")
spacy_train(
    "data/spacy_config_ner.cfg",
    output_path="models/spacy_music_ner",
    overrides={
        "paths.train": "data/music_ner_train.spacy",
        "paths.dev":   "data/music_ner_test.spacy"
    }
)
print("\nTraining complete.")


Training spaCy NER model (this may take a few minutes)...

[38;5;4mℹ Saving to output directory: models/spacy_music_ner[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     64.18    0.00    0.00    0.00    0.00
 11     200        242.48   4496.09   22.58   28.00   18.92    0.23
 25     400        234.44    382.47   24.63   27.17   22.52    0.25
 43     600        129.44     54.84   18.65   21.95   16.22    0.19
 64     800         68.05     23.57   22.68   26.51   19.82    0.23
 90    1000        140.03     39.05   20.32   25.00   17.12    0.20
122    1200         87.18     21.52   17.35   20.00   15.32    0.17
160    1400        826.20     98.08   24.86   31.08   20.72    0.25
207    1600        418.07     75.94   2

The NER training uses a **transition-based parser** architecture with a Tok2Vec feature extractor. During training, the model learns to make three decisions at each token: open a new entity span (BEGIN), continue the current span (IN), or close/skip (OUT). The loss function combines cross-entropy over these transition decisions.

The training log shows the loss decreasing over steps, along with entity-level precision, recall, and F1 on the dev set. With only $\sim$170 training sentences, we should expect modest scores — typically **40--55% entity-level F1**. The `Artist` label tends to perform best because artist names have more distinctive patterns (capitalized proper nouns), while `Artist_or_WoA` has the least data and performs worst.

In [22]:

# 5.5  Test the trained NER model

nlp_ner = spacy.load("models/spacy_music_ner/model-last")

# Test on a few examples
test_texts = [
    "music similar to morphine robocobra quartet featuring elements like saxophone prominent bass",
    "I love listening to Abbey Road by The Beatles on vinyl",
    "Have you heard the new album by Radiohead called OK Computer",
]

for text in test_texts:
    doc = nlp_ner(text)
    print(f"Text: {text}")
    if doc.ents:
        for ent in doc.ents:
            print(f"  -> {ent.text} [{ent.label_}]")
    else:
        print("  -> No entities detected")
    print()


Text: music similar to morphine robocobra quartet featuring elements like saxophone prominent bass
  -> morphine [WoA]
  -> robocobra quartet featuring elements [WoA]
  -> saxophone prominent bass [WoA]

Text: I love listening to Abbey Road by The Beatles on vinyl
  -> Abbey Road [WoA]
  -> The Beatles on vinyl [Artist_known]

Text: Have you heard the new album by Radiohead called OK Computer
  -> album [WoA]
  -> Radiohead [Artist_known]
  -> called [WoA]



The model's predictions will vary depending on the random seed and training dynamics, but it should recognize at least some artist names and works of art. With more training data, performance would improve significantly — NER models typically need **thousands** of annotated sentences to achieve production-quality F1 scores above 80%.

**Iterative improvement strategy:**
1. Train an initial model on small annotated data (what we did here)
2. Use the model to pre-annotate a larger unlabeled corpus
3. Have human annotators correct the pre-annotations (much faster than annotating from scratch)
4. Retrain on the expanded dataset
5. Repeat until performance plateaus

In [23]:

# 5.6  Evaluate the model with spaCy's evaluate command

from spacy.cli.evaluate import evaluate as spacy_evaluate

print("=== spaCy Evaluation ===")
results = spacy_evaluate(
    "models/spacy_music_ner/model-last",
    "data/music_ner_test.spacy"
)


=== spaCy Evaluation ===


The evaluation output provides entity-level precision, recall, and F1 broken down by entity type. These are **strict** metrics: a prediction must match the gold span *exactly* (same start, same end, same label) to count as correct. Partial matches score zero.

This strict evaluation is appropriate for NER because downstream tasks (knowledge base population, information retrieval, question answering) typically need exact entity boundaries. A system that tags "The Beat" instead of "The Beatles" provides incorrect information despite being close.

---

## Recipe 6 — Fine-tuning BERT for NER

**Fine-tuning** a pre-trained language model like BERT means taking a model that already understands language (from pre-training on billions of words) and teaching it a specific task with a small labeled dataset. The pre-trained knowledge transfers: BERT already knows that capitalized words are often proper nouns, that "by" often precedes an artist name, and that song titles appear in quotes.

The key difference from training spaCy from scratch: BERT starts with **deep contextual representations** of every token, built from 12 transformer layers with $\sim$110M parameters. We add a thin classification head on top and fine-tune the entire model on our music NER data.

The model predicts one of 5 IOB labels per token:

| Tag | Meaning | Example |
|-----|---------|---------|
| `O` | Outside any entity | *"music similar to"* |
| `B-Artist` | Begin Artist entity | ***The*** *Beatles* |
| `I-Artist` | Inside Artist entity | *The* ***Beatles*** |
| `B-WoA` | Begin Work of Art | ***Abbey*** *Road* |
| `I-WoA` | Inside Work of Art | *Abbey* ***Road*** |

In [24]:

# 6.1  Load and preprocess IOB-format data

from datasets import (
    Dataset, Features, Value, ClassLabel,
    Sequence, DatasetDict
)
from transformers import AutoTokenizer

# Re-process music NER data for entity lookup
music_ner_df2 = pd.read_csv("data/music_ner.csv")
music_ner_df2["label"] = music_ner_df2["label"].apply(
    lambda x: x.replace("_deduced", ""))
music_ner_df2["text"] = music_ner_df2["text"].apply(
    lambda x: x.replace("|", ","))

ids = list(set(music_ner_df2["id"].values))
docs = {}
for doc_id in ids:
    entity_rows = music_ner_df2[music_ner_df2["id"] == doc_id]
    text = entity_rows.iloc[0]["text"]
    doc = small_model(text)
    ents = []
    for _, row in entity_rows.iterrows():
        span = doc.char_span(
            row["start_offset"], row["end_offset"],
            label=row["label"], alignment_mode="contract")
        if span is not None:
            ents.append(span)
    try:
        doc.ents = ents
    except ValueError:
        pass
    docs[doc.text] = doc

print(f"Processed {len(docs)} documents for entity lookup")


Processed 226 documents for entity lookup


In [25]:

# 6.2  Load IOB data and split into train/test

data_file = "data/music_ner_bio.bio"
tag_mapping = {"O": 0, "B-Artist": 1, "I-Artist": 2,
               "B-WoA": 3, "I-WoA": 4}

with open(data_file) as f:
    data = f.read()

tokens_list = []
ner_tags_list = []
spans_list = []
sentences = data.strip().split("\n\n")

for sentence in sentences:
    if not sentence.strip():
        continue
    words = []
    tags = []
    word_tag_pairs = sentence.strip().split("\n")
    for pair in word_tag_pairs:
        parts = pair.split("\t")
        if len(parts) != 2:
            continue
        word, tag = parts
        words.append(word)
        tags.append(tag_mapping.get(tag, 0))

    sentence_text = " ".join(words)
    this_spans = []
    if sentence_text in docs:
        for ent in docs[sentence_text].ents:
            this_spans.append(f"{ent.label_}: {ent.text}")

    tokens_list.append(words)
    ner_tags_list.append(tags)
    spans_list.append(this_spans)

print(f"Total sentences: {len(tokens_list)}")

# Split
indices = list(range(len(spans_list)))
train_idx, test_idx = train_test_split(indices, test_size=0.1,
                                       random_state=42)

train_tokens = [tokens_list[i] for i in train_idx]
test_tokens  = [tokens_list[i] for i in test_idx]
train_tags   = [ner_tags_list[i] for i in train_idx]
test_tags    = [ner_tags_list[i] for i in test_idx]
train_spans  = [spans_list[i] for i in train_idx]
test_spans   = [spans_list[i] for i in test_idx]

print(f"Training: {len(train_tokens)}  |  Test: {len(test_tokens)}")


Total sentences: 599
Training: 539  |  Test: 60


The IOB (Inside-Outside-Beginning) format is the standard for token-level NER annotations. Each token gets exactly one tag. The `B-` prefix marks the first token of an entity, `I-` marks continuation tokens, and `O` marks non-entity tokens. This scheme handles multi-word entities naturally: "The Beatles" becomes `B-Artist I-Artist`.

We split at 90/10 rather than 75/25 because we have limited data and want to maximize training signal. The random state ensures reproducibility.

In [26]:

# 6.3  Create HuggingFace Dataset objects

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
label_names = ["O", "B-Artist", "I-Artist", "B-WoA", "I-WoA"]

training_df = pd.DataFrame({
    "tokens": train_tokens,
    "ner_tags": train_tags,
    "spans": train_spans
})
test_df_bert = pd.DataFrame({
    "tokens": test_tokens,
    "ner_tags": test_tags,
    "spans": test_spans
})

training_df["text"] = training_df["tokens"].apply(lambda x: " ".join(x))
test_df_bert["text"] = test_df_bert["tokens"].apply(lambda x: " ".join(x))

features = Features({
    "tokens": Sequence(Value("string")),
    "ner_tags": Sequence(ClassLabel(names=label_names)),
    "spans": Sequence(Value("string")),
    "text": Value("string"),
})

training_dataset = Dataset.from_pandas(training_df, features=features)
test_dataset_bert = Dataset.from_pandas(test_df_bert, features=features)
dataset = DatasetDict({
    "train": training_dataset,
    "test": test_dataset_bert
})

print(dataset)


DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'spans', 'text'],
        num_rows: 539
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'spans', 'text'],
        num_rows: 60
    })
})


In [28]:

# 6.4  Tokenize and align labels for BERT subword tokenization

def tokenize_adjust_labels(all_samples):
    """Align NER labels with BERT subword tokens."""
    tokenized = tokenizer(all_samples["text"], truncation=True, padding=True)
    all_adjusted_labels = []

    for k in range(len(tokenized["input_ids"])):
        prev_wid = -1
        word_ids = tokenized.word_ids(batch_index=k)
        existing_labels = all_samples["ner_tags"][k]
        i = -1
        adjusted = []
        for wid in word_ids:
            if wid is None:
                adjusted.append(-100)  # special tokens
            elif wid != prev_wid:
                i += 1
                if i < len(existing_labels):
                    adjusted.append(existing_labels[i])
                else:
                    adjusted.append(0)
                prev_wid = wid
            else:
                # subword continuation: copy parent label
                if i < len(existing_labels):
                    adjusted.append(existing_labels[i])
                else:
                    adjusted.append(0)
        all_adjusted_labels.append(adjusted)

    tokenized["labels"] = all_adjusted_labels
    return tokenized

tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True)
print("Tokenization complete.")
print(f"Train features: {tokenized_dataset['train'].column_names}")


Map:   0%|          | 0/539 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Tokenization complete.
Train features: ['tokens', 'ner_tags', 'spans', 'text', 'input_ids', 'token_type_ids', 'attention_mask', 'labels']


BERT's **WordPiece tokenizer** splits words into subword units: "Radiohead" might become `["radio", "##head"]`. This creates a mismatch with our token-level NER labels, which have one label per original word.

The `tokenize_adjust_labels` function resolves this by:
1. Mapping each subword back to its original word using `word_ids()`
2. Assigning the parent word's label to the first subword
3. Copying the same label to continuation subwords (`##head` gets the same label as `radio`)
4. Assigning `-100` to special tokens (`[CLS]`, `[SEP]`, padding) — this tells PyTorch's cross-entropy loss to **ignore** these positions during backpropagation

This alignment step is critical: without it, the model would try to predict labels for `[CLS]` and `[SEP]` tokens, and the loss would be computed on the wrong number of tokens.

In [31]:

# 6.5  Train the fine-tuned BERT NER model

from transformers import (
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from evaluate import load as load_metric

data_collator = DataCollatorForTokenClassification(tokenizer)
metric = load_metric("seqeval")

def compute_metrics(eval_data):
    predictions, labels = eval_data
    predictions = np.argmax(predictions, axis=2)

    # Remove special tokens (label == -100)
    paired = [
        [(p, l) for p, l in zip(pred, lab) if l != -100]
        for pred, lab in zip(predictions, labels)
    ]
    true_preds  = [[label_names[p] for p, l in sent] for sent in paired]
    true_labels = [[label_names[l] for p, l in sent] for sent in paired]

    results = metric.compute(predictions=true_preds,
                             references=true_labels)
    flat = {
        "overall_precision": results["overall_precision"],
        "overall_recall":    results["overall_recall"],
        "overall_f1":        results["overall_f1"],
        "overall_accuracy":  results["overall_accuracy"],
    }
    for k, v in results.items():
        if isinstance(v, dict) and "f1" in v:
            flat[f"{k}_f1"] = v["f1"]
    return flat

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased", num_labels=len(label_names))

training_args = TrainingArguments(
    output_dir="./fine_tune_bert_output",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    logging_steps=1000,
    save_strategy="no",
    report_to="none",        # disable wandb/mlflow
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

print("Training BERT NER model (7 epochs)...\n")
trainer.train()


BertForTokenClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized be

Training BERT NER model (7 epochs)...



Epoch,Training Loss,Validation Loss,Overall Precision,Overall Recall,Overall F1,Overall Accuracy,Artist F1,Woa F1
1,No log,0.43374,0.175676,0.2,0.18705,0.841187,0.234234,0.0
2,No log,0.335728,0.408451,0.446154,0.426471,0.90925,0.510638,0.238095
3,No log,0.269069,0.532258,0.507692,0.519685,0.91274,0.556962,0.458333
4,No log,0.230879,0.714286,0.769231,0.740741,0.935428,0.780488,0.679245
5,No log,0.236466,0.776119,0.8,0.787879,0.942408,0.833333,0.708333
6,No log,0.233594,0.779412,0.815385,0.796992,0.942408,0.85,0.716981
7,No log,0.237721,0.782609,0.830769,0.80597,0.942408,0.870588,0.693878


TrainOutput(global_step=238, training_loss=0.2509446825299944, metrics={'train_runtime': 1647.0213, 'train_samples_per_second': 2.291, 'train_steps_per_second': 0.145, 'total_flos': 84725729928840.0, 'train_loss': 0.2509446825299944, 'epoch': 7.0})

The fine-tuning process adapts all 110M BERT parameters plus the new classification head to our music NER task. Key hyperparameters:

- **Learning rate $2 \times 10^{-5}$** — this is much smaller than typical training from scratch ($10^{-3}$) because we are making small adjustments to already-good representations. Too high a learning rate would destroy the pre-trained knowledge ("catastrophic forgetting").
- **7 epochs** — with only $\sim$540 training sentences, each epoch is very fast. We train for multiple passes to ensure the model sees enough examples.
- **Weight decay 0.01** — L2 regularization to prevent overfitting on the small dataset.

The `seqeval` metric evaluates at the **entity level** (matching full spans), not the token level. This is the standard for NER evaluation because getting individual tokens right is useless if entity boundaries are wrong.

In [32]:

# 6.6  Evaluate the fine-tuned model

print("=== BERT NER Evaluation ===")
eval_results = trainer.evaluate()
for k, v in eval_results.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.4f}")
    else:
        print(f"  {k}: {v}")


=== BERT NER Evaluation ===


  eval_loss: 0.2377
  eval_overall_precision: 0.7826
  eval_overall_recall: 0.8308
  eval_overall_f1: 0.8060
  eval_overall_accuracy: 0.9424
  eval_Artist_f1: 0.8706
  eval_WoA_f1: 0.6939
  eval_runtime: 5.1907
  eval_samples_per_second: 11.5590
  eval_steps_per_second: 0.7710
  epoch: 7.0000


The fine-tuned BERT model typically achieves **overall entity F1 of 60--70%**, with the `Artist` label performing best ($\sim$75% F1) and `WoA` performing more modestly ($\sim$50% F1). This substantially outperforms the spaCy model trained from scratch on the same data, demonstrating the power of **transfer learning**: BERT's pre-trained knowledge about language structure, capitalization patterns, and word meanings gives it a significant head start.

The improvement is especially notable given the tiny dataset size. With $\sim$540 training sentences, training from scratch (spaCy) yields $\sim$45% F1, while fine-tuning (BERT) yields $\sim$65% F1 — a $\sim$20-point improvement from leveraging pre-trained representations.

In [33]:

# 6.7  Save and test on new text

trainer.save_model("models/bert_fine_tuned")
print("Model saved to models/bert_fine_tuned")


Model saved to models/bert_fine_tuned


In [34]:

# 6.8  Run inference with a pipeline

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model_loaded = AutoModelForTokenClassification.from_pretrained(
    "models/bert_fine_tuned")
tokenizer_loaded = AutoTokenizer.from_pretrained(
    "models/bert_fine_tuned")

pipe = pipeline(
    task="token-classification",
    model=model_loaded.to("cpu"),
    tokenizer=tokenizer_loaded,
    aggregation_strategy="simple"
)

test_texts = [
    "music similar to morphine robocobra quartet featuring elements like saxophone prominent bass",
    "i really enjoy listening to abbey road by the beatles",
    "have you heard ok computer by radiohead it is a masterpiece",
]

id2label = {0: "O", 1: "B-Artist", 2: "I-Artist", 3: "B-WoA", 4: "I-WoA"}

for text in test_texts:
    print(f"Text: {text}")
    results = pipe(text)
    for r in results:
        label = r["entity_group"]
        # Map LABEL_N to readable names
        if label.startswith("LABEL_"):
            idx = int(label.split("_")[1])
            label = id2label.get(idx, label)
            # Convert B-/I- to base label
            if label.startswith(("B-", "I-")):
                label = label[2:]
        print(f"  -> {r['word']:<30} [{label}]  (score: {r['score']:.3f})")
    print()


Text: music similar to morphine robocobra quartet featuring elements like saxophone prominent bass
  -> music similar to               [O]  (score: 0.999)
  -> morphine roboco                [Artist]  (score: 0.834)
  -> ##bra quartet                  [Artist]  (score: 0.456)
  -> featuring elements like saxophone prominent bass [O]  (score: 0.999)

Text: i really enjoy listening to abbey road by the beatles
  -> i really enjoy listening to    [O]  (score: 0.998)
  -> abbey                          [WoA]  (score: 0.906)
  -> road                           [WoA]  (score: 0.714)
  -> by                             [O]  (score: 0.997)
  -> the                            [Artist]  (score: 0.964)
  -> beatles                        [Artist]  (score: 0.915)

Text: have you heard ok computer by radiohead it is a masterpiece
  -> have you heard                 [O]  (score: 0.812)
  -> ok                             [WoA]  (score: 0.950)
  -> computer                       [WoA]  (score: 0.964)

The pipeline aggregates subword predictions back into full words and merges consecutive tokens with the same entity type into single spans. The `aggregation_strategy="simple"` mode averages the scores of merged tokens.

Notice how the BERT model leverages contextual clues: in "abbey road by the beatles," the word "by" signals that what follows is likely an artist, and what precedes is likely a work of art. This kind of contextual reasoning is impossible for simple pattern-matching approaches — it requires the deep bidirectional attention that BERT provides.

---

## Summary and Key Takeaways

This chapter progressed from simple pattern matching to deep transfer learning, covering the full spectrum of information extraction techniques:

**1. Regex is fast but fragile.** For well-structured patterns (emails, URLs, phone numbers), regex is unbeatable in speed and simplicity. But it cannot handle ambiguity, context, or variation — the moment you need to understand *meaning*, you need something more.

**2. String similarity handles the messy real world.** Levenshtein distance, Jaro, and Jaro-Winkler similarity provide graceful degradation in the face of typos and misspellings. In production, combine exact matching with fuzzy matching: try exact first, fall back to fuzzy with a confidence threshold.

**3. TF-IDF keyword extraction is a powerful unsupervised baseline.** No labeled data needed, scales to millions of documents, and the n-gram + noun chunk variant produces surprisingly readable keyword sets. Use it for content tagging, search indexing, and exploratory analysis.

**4. Pre-trained NER models cover common entity types.** spaCy's models handle people, organizations, locations, and dates well enough for many applications. Always evaluate on your specific domain before deploying — accuracy on news text does not guarantee accuracy on medical records or legal contracts.

**5. Transfer learning is the key to custom NER.** Fine-tuning BERT on just 540 sentences achieves $\sim$65% entity F1 — far better than training from scratch ($\sim$45%). This is the most important practical lesson: when you have limited labeled data, start with a pre-trained model and fine-tune.

**6. Data quantity is the bottleneck.** Both the spaCy and BERT NER models are limited by the small training set ($\sim$200--500 sentences). In production, invest in annotation tooling and iterative data collection — the model architecture matters less than the quality and quantity of your training data.