# Central Bank-BERT for Named Entity Recognition (NER)

## Description

A **domain-adapted BERT model (Central Bank-BERT)** was fine-tuned for **Named Entity Recognition (NER)** in central banking discourse. The model automatically identifies and labels key entities in central bank speeches and related documents, focusing on three categories of interest:

* **AUTHOR / SPEAKER** – the individual delivering the speech or statement
* **POSITION** – the official title or role of the speaker (e.g., Governor, Deputy Governor, Board Member)
* **AFFILIATION** – the institution or organization associated with the speaker (e.g., Bank of Japan, European Central Bank, Bank of England)

The **COUNTRY** label was not explicitly modeled, since this information can be reliably **inferred from the affiliation of the central bank**.

---

## Data

* **Source**: **BIS database of central bank speeches (1996–2024)**
* **Corpus Size**: 17,648 speeches with 1,961 held out for validation.
* **Input Field**: *Speech descriptions*, which typically contain a short speech title along with the name, position, and institutional affiliation of the speaker.

**Annotation Process**:

1. A subset of short speech descriptions was **manually annotated** with entity spans for Author, Position, and Affiliation.
2. This annotated subset was used to **train an initial NER model**.
3. The model was then applied to the larger dataset (1996–2024) to generate preliminary labels.
4. All generated labels were **manually reviewed and corrected**, ensuring complete and consistent annotation across the entire corpus of available speeches.

This approach combined **manual expertise** with **machine-assisted annotation**, making it feasible to build a large-scale, high-quality dataset covering nearly three decades of central bank communication.

---

## Data Preparation

1. **Normalization**: Lowercasing, removal of diacritics, and unification of punctuation.
2. **Alias resolution**: Institution abbreviations normalized (e.g., “BOJ” → “Bank of Japan”, “ECB” → “European Central Bank”).
3. **Entity alignment**: Fuzzy string matching used to locate annotated entities in raw text.
4. **BIO Encoding**:

   * Tokenization with *BERT WordPiece tokenizer*.
   * Conversion of annotations into **BIO tags** (`B-`, `I-`, `O`) at token level.
   * Construction of a training file in **JSONL format** with `tokens` and `ner_tags`.

---

## Model Training

* **Base model**: [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a domain-adapted BERT trained on central banking corpora.
* **Task head**: Token classification layer with `num_labels = 7` (BIO scheme for Author, Position, Affiliation).
* **Token alignment**: Word-to-token mapping with subword label propagation (`-100` used for ignored positions).
* **Training setup**:

  * Optimizer: AdamW with weight decay `0.01`
  * Learning rate: `2e-5`
  * Batch size: `16` (train & eval)
  * Epochs: `3`
  * Mixed precision (`fp16`) when available
  * Evaluation with `seqeval` metrics (precision, recall, F1)

---

## Results

The model was trained on **17,648 annotated speeches** with a **1,961-speech validation set**. Evaluation metrics are reported using **entity-level precision, recall, and F1-score** from the `seqeval` library.

**Final Validation Performance (Epoch 3):**

| Entity Type     | Precision  | Recall     | F1-score   | Support |
| --------------- | ---------- | ---------- | ---------- | ------- |
| **Affiliation** | 0.9850     | 0.9862     | 0.9856     | 1,734   |
| **Author**      | 0.9816     | 0.9912     | 0.9864     | 1,936   |
| **Position**    | 0.9735     | 0.9846     | 0.9790     | 1,942   |
| **Overall**     | **0.9798** | **0.9862** | **0.9830** | —       |

* **Accuracy (token-level):** 0.9956
* **Overall F1 (macro):** 0.983

The results show **high precision and recall across all three categories**, confirming that the model provides reliable structured metadata extraction from central bank communications.




In [None]:
!pip -q install -U transformers datasets evaluate seqeval rapidfuzz

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m106.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following depe

In [None]:
# ==========================================================
#  Build BIO file
# ==========================================================
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import pandas as pd, re, unicodedata, json, os
from rapidfuzz import fuzz, process
from transformers import AutoTokenizer
from tqdm.auto import tqdm

# ---------- paths ----------------------------------------------------
BASE = "/content/drive/MyDrive/CB-BERT-NER"
CSV  = os.path.join(BASE, "bis_speeches_data.csv")
OUT  = os.path.join(BASE, "ner_train.jsonl")

cols = ["description", "author", "affiliation", "position", "country"]
df   = pd.read_csv(CSV, usecols=cols)

# ---------- normalisation helpers -----------------------------------
def strip_acc(s):
    return "".join(c for c in unicodedata.normalize("NFD", s)
                   if unicodedata.category(c) != "Mn")

def norm(s):
    if pd.isna(s): return ""
    s = strip_acc(str(s))
    s = re.sub(r"[‐-–—]", "-", s)
    s = re.sub(r"[‘’´`]",  "'", s)
    return re.sub(r"\s+", " ", s.lower()).strip(" ,.;:")

for c in cols:
    df[c+"_n"] = df[c].apply(norm)

affil_alias = {
    "boj": "bank of japan",
    "ecb": "european central bank",
    "boe": "bank of england",
}
country_alias = {
    "uk": "united kingdom",
    "u.k.": "united kingdom",
    "us": "united states",
    "u.s.": "united states",
}
def alias(s):
    return affil_alias.get(s) or country_alias.get(s) or s

# ---------- fuzzy span finder ---------------------------------------
def span_fuzzy(entity, raw, thr=92):
    if not entity or pd.isna(entity): return None
    ent = alias(norm(entity))
    txt = norm(raw)

    # direct
    i = txt.find(ent)
    if i >= 0: return i, i+len(ent)

    # sliding-window fuzzy
    window = [txt[i:j]
              for i in range(len(txt))
              for j in range(i+len(ent)-3, i+len(ent)+3)
              if j <= len(txt)]
    best, score, _ = process.extractOne(ent, window,
                                        scorer=fuzz.token_set_ratio) or ("",0,None)
    if score >= thr:
        k = txt.find(best)
        if k >= 0: return k, k+len(best)
    return None

# ---------- tokenizer & BIO labeller --------------------------------
tok = AutoTokenizer.from_pretrained("bert-base-uncased")

def bio_from_row(r):
    text = norm(r["description"])
    enc  = tok(text, add_special_tokens=False, return_offsets_mapping=True)
    tokens, offs = enc.tokens(), enc["offset_mapping"]
    labs = ["O"] * len(tokens)

    def tag(col, tag_name):
        span = span_fuzzy(r[col], r["description"])
        if not span: return
        cs, ce = span
        t_idx = [i for i,(s,e) in enumerate(offs) if not (e<=cs or s>=ce)]
        if not t_idx: return
        # label only where still "O"
        if labs[t_idx[0]] == "O":
            labs[t_idx[0]] = f"B-{tag_name}"
        for i in t_idx[1:]:
            if labs[i] == "O":
                labs[i] = f"I-{tag_name}"

    # order matters: earlier tags win
    tag("author",      "AUTHOR")
    tag("position",    "POSITION")
    tag("affiliation", "AFFILIATION")
    tag("country",     "COUNTRY")
    return tokens, labs

# ---------- build JSONL ---------------------------------------------
recs = []
for _, row in tqdm(df.iterrows(), total=len(df), desc="Building BIO"):
    t, l = bio_from_row(row)
    recs.append({"tokens": t, "ner_tags": l})

with open(OUT, "w") as f:
    for r in recs:
        f.write(json.dumps(r) + "\n")

print(f"✅  Wrote {len(recs)} sentences → {OUT}")


Mounted at /content/drive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Building BIO:   0%|          | 0/19609 [00:00<?, ?it/s]

✅  Wrote 19609 sentences → /content/drive/MyDrive/CB-BERT-NER/ner_train.jsonl


In [None]:
# ---------------------------------------------------------------
#  Fine-tune CentralBank-BERT-MLM for NER  (AUTHOR / AFFILIATION / POSITION)
# ---------------------------------------------------------------
# !pip -q install -U transformers datasets evaluate seqeval

from google.colab import drive
drive.mount("/content/drive", force_remount=True)

import os, json, numpy as np, evaluate, torch
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments, Trainer
)

BASE_DIR   = "/content/drive/MyDrive/CB-BERT-NER"
JSONL_PATH = os.path.join(BASE_DIR, "ner_train.jsonl")
MODEL_NAME = "bilalzafar/cb-bert-mlm" # updated model name as CentralBank-BERT
OUT_DIR    = os.path.join(BASE_DIR, "cb-bert-ner")

# 1. Load BIO jsonl ➜ HuggingFace Dataset
print("🔍 Reading", JSONL_PATH)
with open(JSONL_PATH) as fh:
    data = [json.loads(l) for l in fh]
ds = Dataset.from_list(data)

# 2. Tag maps
tags   = sorted({t for ex in ds for t in ex["ner_tags"]})
tag2id = {t:i for i,t in enumerate(tags)}
id2tag = {i:t for t,i in tag2id.items()}
print("🏷 Tags:", tags)

ds = ds.map(lambda e: {"ner_tags": [tag2id[t] for t in e["ner_tags"]]})
split = ds.train_test_split(test_size=0.1, seed=42)
train_ds, val_ds = split["train"], split["test"]
print(f"📊  train={len(train_ds)}  val={len(val_ds)}")

# 3. Tokeniser + correct alignment
tok = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_and_align(examples):
    enc = tok(
        examples["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding="max_length",
        max_length=128,
    )

    aligned = []
    for i in range(len(enc["input_ids"])):               # each sentence
        word_ids   = enc.word_ids(batch_index=i)
        label_mask = examples["ner_tags"][i]
        cur = []
        prev_wid = None
        for wid in word_ids:
            if wid is None:
                cur.append(-100)
            else:
                cur.append(label_mask[wid])
            prev_wid = wid
        aligned.append(cur)

    enc["labels"] = aligned
    return enc

cols = train_ds.column_names
train_ds = train_ds.map(tokenize_and_align, batched=True, remove_columns=cols)
val_ds   = val_ds.map  (tokenize_and_align, batched=True, remove_columns=cols)

# 4. Model
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(tags),
    id2label=id2tag,
    label2id=tag2id
)

# 5. Seqeval metric
seqeval = evaluate.load("seqeval")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, -1)

    true_preds, true_labels = [], []
    for p, l in zip(preds, labels):
        tp, tl = [], []
        for pi, li in zip(p, l):
            if li != -100:
                tp.append(id2tag[pi])
                tl.append(id2tag[li])
        true_preds.append(tp)
        true_labels.append(tl)
    return seqeval.compute(predictions=true_preds, references=true_labels)

# 6. Trainer
args = TrainingArguments(
    OUT_DIR,
    eval_strategy      ="epoch",
    save_strategy      ="epoch",
    learning_rate      =2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size =16,
    num_train_epochs   =3,
    weight_decay       =0.01,
    fp16               =torch.cuda.is_available(),
    report_to          ="none",
    logging_steps      =250,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tok,
    data_collator=DataCollatorForTokenClassification(tok),
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model(OUT_DIR)
tok.save_pretrained(OUT_DIR)
print("✅ Model saved to", OUT_DIR)


Mounted at /content/drive
🔍 Reading /content/drive/MyDrive/CB-BERT-NER/ner_train.jsonl
🏷 Tags: ['B-AFFILIATION', 'B-AUTHOR', 'B-COUNTRY', 'B-POSITION', 'I-AFFILIATION', 'I-AUTHOR', 'I-COUNTRY', 'I-POSITION', 'O']


Map:   0%|          | 0/19609 [00:00<?, ? examples/s]

📊  train=17648  val=1961


Map:   0%|          | 0/17648 [00:00<?, ? examples/s]

Map:   0%|          | 0/1961 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bilalzafar/cb-bert-mlm and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Affiliation,Author,Country,Position,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.0334,0.023597,"{'precision': 0.9810453762205629, 'recall': 0.9850057670126874, 'f1': 0.9830215827338129, 'number': 1734}","{'precision': 0.982051282051282, 'recall': 0.9891528925619835, 'f1': 0.9855892949047864, 'number': 1936}","{'precision': 0.9424242424242424, 'recall': 0.933933933933934, 'f1': 0.93815987933635, 'number': 333}","{'precision': 0.9689883070665989, 'recall': 0.9814624098867147, 'f1': 0.9751854694295217, 'number': 1942}",0.975284,0.982338,0.978798,0.994313
2,0.0216,0.0193,"{'precision': 0.9810453762205629, 'recall': 0.9850057670126874, 'f1': 0.9830215827338129, 'number': 1734}","{'precision': 0.9830855971296771, 'recall': 0.9907024793388429, 'f1': 0.9868793413943916, 'number': 1936}","{'precision': 0.9846153846153847, 'recall': 0.960960960960961, 'f1': 0.9726443768996961, 'number': 333}","{'precision': 0.976482617586912, 'recall': 0.9835221421215242, 'f1': 0.9799897383273473, 'number': 1942}",0.980412,0.985029,0.982715,0.994991
3,0.0152,0.017772,"{'precision': 0.9850230414746544, 'recall': 0.986159169550173, 'f1': 0.9855907780979827, 'number': 1734}","{'precision': 0.9815856777493606, 'recall': 0.9912190082644629, 'f1': 0.986378822924698, 'number': 1936}","{'precision': 0.9787234042553191, 'recall': 0.9669669669669669, 'f1': 0.972809667673716, 'number': 333}","{'precision': 0.9735234215885947, 'recall': 0.984552008238929, 'f1': 0.9790066564260111, 'number': 1942}",0.979779,0.986207,0.982983,0.995555


✅ Model saved to /content/drive/MyDrive/CB-BERT-NER/cb-bert-ner


In [None]:
from transformers import pipeline

# path to the folder you just saved
model_dir = "/content/drive/MyDrive/CB-BERT-NER/cb-bert-ner"

ner = pipeline(
    "token-classification",
    model     = model_dir,
    tokenizer = model_dir,
    aggregation_strategy="simple"
)

text = "Speech by Mr Yi Gang, Governor of the People's Bank of China, at the IMF Annual Meeting."
for ent in ner(text):
    print(f"{ent['entity_group']:12}  {ent['word']:<25}  score={ent['score']:.3f}")


Device set to use cuda:0


AUTHOR        yi gang                    score=0.997
POSITION      governor                   score=0.999
AFFILIATION   people ' s bank of china   score=0.999
