# Certificate Field Extraction via Token Classification

**Purpose:**  
Fine-tune a SciBERT token-classification model to extract key fields from text versions of incorporation certificates:

- **Company Name**  
- **Date**  
- **Document Type**  
- **Preferred Stocks**  
- **Priority Order**  
- **Liquidation Value**

**Inputs:**  
- Metadata CSVs:  
  - `VC Research (Batch 2) - Batch 2 Main.csv`  
  - `VC Research (Batch 2) - Key for Data.csv`  
- Plain-text files in `Batch2_text_readable/`

**Outputs:**  
1. A trained SciBERT token-classification model.  
2. Quantitative evaluation metrics (precision, recall, F1).  
3. Qualitative NER output via a Transformers pipeline.  
4. DataFrame comparing ground-truth vs. predicted field values.

---

## Table of Contents

1. [Environment Setup & Imports](#setup)  
2. [Paths & Configuration](#config)  
3. [Data Loading & Filtering](#load)  
4. [Span Generation for Ground Truth](#spans)  
5. [Tokenization & Dataset Preparation](#dataset)  
6. [Model Initialization & Training](#train)  
7. [Evaluation & Qualitative Inference](#eval)  
8. [Compare True vs. Predicted Fields](#compare)  
9. [Next Steps & Extensions](#next)


In [2]:
# 1. Environment Setup & Imports

import re
from pathlib import Path

import pandas as pd               # for DataFrame operations
import torch                      # for tensors and model operations
from torch.utils.data import Dataset as TorchDataset

from sklearn.model_selection import train_test_split
from collections import defaultdict

# Hugging Face Transformers for token-classification
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    Trainer,
    TrainingArguments,
    pipeline
)

# Configure logging (optional)
import logging
logging.basicConfig(level=logging.WARNING)


In [3]:
# 2. Paths & Configuration

# Paths to metadata CSVs
MAIN_CSV   = Path(r"D:\vc-research\vc-research\VC Research (Batch 2) - Batch 2 Main.csv")
KEY_CSV    = Path(r"D:\vc-research\vc-research\VC Research (Batch 2) - Key for Data.csv")

# Directory of readable text files from Batch 2
TXT_DIR    = Path(r"D:\vc-research\vc-research\Batch2_text_readable")

# Pretrained model for token classification
MODEL_NAME = "allenai/scibert_scivocab_uncased"

In [4]:
# 3. Data Loading & Filtering

# Load main metadata file
df_main = pd.read_csv(MAIN_CSV)
df_main.columns = df_main.columns.str.strip()

def find_filename_column(cols):
    """
    Identify which column contains filenames; default to 'File Name'
    or any column containing 'file' in its name.
    """
    if 'File Name' in cols:
        return 'File Name'
    for c in cols:
        if 'file' in c.lower():
            return c
    if 'Unnamed: 0' in cols:
        return 'Unnamed: 0'
    raise KeyError(f"No filename column found. Columns: {cols}")

# Create a 'fname' column matching our .txt filenames
file_col = find_filename_column(df_main.columns)
df_main['fname'] = (
    df_main[file_col]
      .astype(str)
      .str.strip()
      .apply(lambda x: Path(x).stem + '.txt')
)

# Read in all text files into a dict: { filename : full_text }
txt_files = list(TXT_DIR.glob("*.txt"))
texts = {p.name: p.read_text(encoding='utf-8', errors='ignore') for p in txt_files}

# Filter metadata to only those with an existing text file
df = df_main[df_main['fname'].isin(texts)]


In [5]:
# 4. Span Generation for Ground Truth

def find_span(text: str, value: str):
    """
    Return (start, end) indices of the first occurrence of `value` in `text`,
    or None if not found.
    """
    idx = text.find(value)
    return (idx, idx + len(value)) if idx >= 0 else None

# Prepare training examples: each with text + lists of character spans + labels
examples = []
FIELDS = [
    'Company Name', 'Date', 'Document Type',
    'Preferred Stocks', 'Priority Order', 'Liquidation Value'
]

for _, row in df.iterrows():
    doc_text = texts[row['fname']]
    span_starts, span_ends, span_labels = [], [], []

    for field in FIELDS:
        value = row.get(field)
        if pd.isna(value):
            continue
        # Handle comma-separated multiple values
        tokens = str(value).split(',') if ',' in str(value) else [str(value)]
        for tok in tokens:
            tok = tok.strip()
            span = find_span(doc_text, tok)
            if span:
                s, e = span
                span_starts.append(s)
                span_ends.append(e)
                # Replace spaces in label with underscore for BIO tagging
                span_labels.append(field.replace(' ', '_'))

    # Only include docs where at least one span was found
    if span_starts:
        examples.append({
            'text': doc_text,
            'span_starts': span_starts,
            'span_ends': span_ends,
            'span_labels': span_labels
        })


In [6]:
# 5. Tokenization & Dataset Preparation

# Load model config to get max sequence length
config    = AutoConfig.from_pretrained(MODEL_NAME)
max_len   = config.max_position_embeddings or 512

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Build label vocabulary in BIO format
unique_fields = sorted({lbl for ex in examples for lbl in ex['span_labels']})
bio_labels    = ['O'] + [f"{p}-{fld}" for fld in unique_fields for p in ('B','I')]
label2id      = {lab: i for i, lab in enumerate(bio_labels)}
id2label      = {i: lab for lab, i in label2id.items()}

# Tokenize and align labels to token offsets
encodings = []
for ex in examples:
    enc = tokenizer(
        ex['text'],
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_offsets_mapping=True
    )
    offsets = enc.pop('offset_mapping')
    labels = [label2id['O']] * max_len

    # Assign B- and I-labels to tokens overlapping each ground-truth span
    for start, end, fld in zip(ex['span_starts'], ex['span_ends'], ex['span_labels']):
        for i, (off_s, off_e) in enumerate(offsets):
            if off_e <= start:
                continue
            if off_s >= end:
                break
            tag = 'B' if off_s == start else 'I'
            labels[i] = label2id[f"{tag}-{fld}"]

    enc['labels'] = labels
    encodings.append(enc)

# Custom Dataset wrapping our encodings
class NERDataset(TorchDataset):
    def __init__(self, encs): 
        self.encs = encs
    def __len__(self): 
        return len(self.encs)
    def __getitem__(self, idx): 
        return {k: torch.tensor(v) for k, v in self.encs[idx].items()}

# Split into train/eval
train_encs, eval_encs = train_test_split(encodings, test_size=0.1, random_state=42)
train_dataset = NERDataset(train_encs)
eval_dataset  = NERDataset(eval_encs)


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt: 0.00B [00:00, ?B/s]

In [7]:
# 6. Model Initialization & Training

# Load pre-trained SciBERT for token classification
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(bio_labels),
    id2label=id2label,
    label2id=label2id
)

# Collator that pads inputs & labels
data_collator = DataCollatorForTokenClassification(tokenizer)

# Training hyperparameters
training_args = TrainingArguments(
    output_dir='out_ner',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    logging_dir='logs',
    logging_steps=50,
    save_steps=100,
    do_train=True,
    do_eval=True
)

# Trainer API
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Train the model
trainer.train()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]



Step,Training Loss
50,0.1118


TrainOutput(global_step=84, training_loss=0.07871079870632716, metrics={'train_runtime': 532.743, 'train_samples_per_second': 0.625, 'train_steps_per_second': 0.158, 'total_flos': 87014179998720.0, 'train_loss': 0.07871079870632716, 'epoch': 3.0})

In [8]:
# 7. Evaluation & Qualitative Inference

# 7.1 Quantitative Evaluation
eval_metrics = trainer.evaluate(eval_dataset=eval_dataset)
print("=== Trainer.evaluate() metrics ===")
print(eval_metrics)

# Compute span-level precision/recall/F1 with seqeval
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

pred_logits, true_label_ids, _ = trainer.predict(eval_dataset)
pred_label_ids = torch.argmax(torch.tensor(pred_logits), dim=-1).tolist()

true_tags = [[id2label[i] for i in seq] for seq in true_label_ids]
pred_tags = [[id2label[i] for i in seq] for seq in pred_label_ids]

print("\n=== Span-level Metrics ===")
print("Precision:", precision_score(true_tags, pred_tags))
print("Recall:   ", recall_score(true_tags, pred_tags))
print("F1:       ", f1_score(true_tags, pred_tags))
print("\n", classification_report(true_tags, pred_tags))

# 7.2 Qualitative Inference via NER Pipeline
tokenizer.model_max_length = max_len
ner_pipe = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=0 if torch.cuda.is_available() else -1
)

def extract_fields(text: str) -> dict:
    """
    Run the NER pipeline on raw text and aggregate tokens by field.
    """
    entities = ner_pipe(text)
    fields = defaultdict(list)
    for ent in entities:
        grp = ent.get("entity_group", ent.get("entity"))
        if '-' in grp:
            tag, fld = grp.split('-', 1)
        else:
            fld = grp
        fields[fld].append(ent["word"])
    return {fld: " ".join(tokens) for fld, tokens in fields.items()}

print("\n=== Sample Inference Results ===")
for fname, raw in texts.items():
    print(f"--- {fname} ---")
    out = extract_fields(raw)
    if out:
        for fld, txt in out.items():
            print(f" {fld}: {txt}")
    else:
        print(" (no entities found)")
    print()



=== Trainer.evaluate() metrics ===
{'eval_loss': 0.026672448962926865, 'eval_runtime': 3.3936, 'eval_samples_per_second': 3.831, 'eval_steps_per_second': 1.179, 'epoch': 3.0}


ModuleNotFoundError: No module named 'seqeval'

In [None]:
# 8. Compare True vs. Predicted Fields

# Identify which fields we have ground-truth columns for
candidate = [
    'Company Name','Date','Document Type',
    'Preferred Stocks','Priority Order','Liquidation Value'
]
orig_fields = [f for f in candidate if f in df.columns]
pred_fields = [f.replace(' ', '_') for f in orig_fields]

# Select a subset of files for evaluation
eval_files = train_test_split(df['fname'], test_size=0.9, random_state=42)[0]

# Build comparison rows
rows = []
for fname in eval_files:
    true_row = df[df['fname']==fname].iloc[0]
    pred_row = extract_fields(texts[fname])
    row = {'fname': fname}
    for orig, lab in zip(orig_fields, pred_fields):
        row[f"{lab}_true"] = true_row.get(orig, "")
        row[f"{lab}_pred"] = pred_row.get(lab, "")
    rows.append(row)

df_eval = pd.DataFrame(rows)
display(df_eval)


Unnamed: 0,fname,Company_Name_true,Company_Name_pred,Date_true,Date_pred,Document_Type_true,Document_Type_pred
0,223_2007-08-03_Certificates of Incorporation.txt,,advion biosciences inc. .,,,,
1,192_2005-09-27_Certificates of Incorporation.txt,"Advanced BioHealing, Inc.",advanced biohealing inc. .,2005-09-27,,Amended and Restated Certificate of Incorporation,
2,189_2005-12-20_Certificates of Incorporation.txt,"Adspace Networks, Inc.",. .,2005-12-20,,Amended and Restated Certificate of Incorporation,
3,200_2008-08-22_Certificates of Incorporation.txt,"Advanced Electron Beams, Inc.",advanced electron beams inc.,2008-08-22,,Amended and Restated Certificate of Incorporation,amended and restated certificate of incorporation
4,188_2010-11-08_Certificates of Incorporation.txt,"Semantic Sugar, Inc.",semantic sugar inc.,2010-11-08,,Certificate of Amendment to the Restated Certi...,
5,181_2007-10-29_Certificates of Incorporation.txt,"Adknowledge, Inc.",adknowledge inc.,2007-10-29,,Amended and Restated Certificate of Incorporat...,
6,200_2013-01-30_Certificates of Incorporation.txt,"Advanced Electron Beams, Inc.",advanced electron beams inc.,2013-01-30,,Certificate of Dissolution,
7,234_2012-02-14_Certificates of Incorporation.txt,,aerohive networks inc. ##oh networks,,,,
8,136_2007-02-14_Certificates of Incorporation.txt,"Actmis Pharamaceuticals, Inc.",. inc.,2007-02-14,,Amended and Restated Certificate of Incorporation,
9,169_2011-08-16_Certificates of Incorporation.txt,"Adchemy, Inc.",adchemy inc.,2011-08-16,,Amended and Restated Certificate of Incorporation,


## Next Steps & Extensions

- **Logging & Error Handling:** record missing spans or tokenization errors.  
- **Hyperparameter Tuning:** experiment with learning rates, batch sizes, epochs.  
- **Thresholding:** refine NER pipeline aggregation & threshold settings.  
- **Integration:** wrap into an API for automated certificate processing.  
- **Unit Tests:** validate span alignment and label correctness with pytest.