<a href="https://colab.research.google.com/github/amalsalilan/B3-Developing-Named-Entity-Recognition-NER-Models-for-Financial-Data-Extraction-/blob/Isha/pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install spacy docling
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# step1_docling.py
from docling.document_converter import DocumentConverter
from pathlib import Path

def read_with_docling(path):
    conv = DocumentConverter()
    result = conv.convert(str(path))
    doc = result.document
    # prefer plain text export if available
    try:
        text = doc.export_to_text()
    except AttributeError:
        text = doc.export_to_markdown()
    return str(text)

if __name__ == "__main__":
    src = "/content/financial_data.pdf"   # your uploaded file. :contentReference[oaicite:1]{index=1}
    text = read_with_docling(src)
    print(text[:800])  # print first 800 chars to confirm


[32m[INFO] 2025-10-29 12:28:08,185 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-10-29 12:28:08,241 [RapidOCR] download_file.py:60: File exists and is valid: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2025-10-29 12:28:08,243 [RapidOCR] torch.py:54: Using /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2025-10-29 12:28:08,484 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-10-29 12:28:08,489 [RapidOCR] download_file.py:60: File exists and is valid: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2025-10-29 12:28:08,490 [RapidOCR] torch.py:54: Using /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2025-10-29 12:28:08,599 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-10-29 12:28:08,702 [RapidOCR] downloa

On March 12, 2025, BlueRock Capital Pvt. Ltd., headquartered in Mumbai, announced a 48.7 crore investment in Aurora FinTech Solutions, a Bangalore-₹ based digital lending startup founded by Raghav Menon. The investment represents a 12.5% equity stake, making BlueRock one of Aurora's major institutional backers. The company stated that the funding round was oversubscribed by nearly 40%, signaling strong investor confidence in Aurora's AI-driven credit assessment platform.

According to Financial Insight Weekly, Aurora's annual revenue grew from 26 ₹ crore in FY 2023-24 to 41 crore in FY 2024-25, marking a 57% year-over-year ₹ increase. The firm attributed this growth to the rapid adoption of its mobilefirst micro-loan services across Tier-2 cities such as Indore, Kochi, and Nagpur. Chief Fi


In [None]:
# step2_ner.py
import spacy
from collections import defaultdict

def run_spacy_ner(text, model_name="en_core_web_sm"):
    """Return (doc, flat_entities_list, grouped_entities_dict)."""
    nlp = spacy.load(model_name)
    doc = nlp(text)

    flat = []
    grouped = defaultdict(list)
    for ent in doc.ents:
        item = {
            "text": ent.text,
            "label": ent.label_,
            "start_char": ent.start_char,
            "end_char": ent.end_char
        }
        flat.append(item)
        grouped[ent.label_].append(item)
    return doc, flat, dict(grouped)

# Quick test
if __name__ == "__main__":
    sample = "Apple Inc. hired John on January 5, 2024 and paid $5,000."
    doc, flat, grouped = run_spacy_ner(sample)
    print(flat)
    print(grouped.keys())


[{'text': 'Apple Inc.', 'label': 'ORG', 'start_char': 0, 'end_char': 10}, {'text': 'John', 'label': 'PERSON', 'start_char': 17, 'end_char': 21}, {'text': 'January 5, 2024', 'label': 'DATE', 'start_char': 25, 'end_char': 40}, {'text': '5,000', 'label': 'MONEY', 'start_char': 51, 'end_char': 56}]
dict_keys(['ORG', 'PERSON', 'DATE', 'MONEY'])


### Visualization and Testing on PDF Content

Now, we will take the `text` extracted from the PDF using `docling`, process it with spaCy for named entity recognition, and then visualize the entities using `spacy.displacy.render()`.

In [None]:
from spacy import displacy
from IPython.display import display, HTML
import spacy

# Assuming 'text' variable contains the document content after running cell 90KifyrE4e01 (docling on PDF)
# If cell 90KifyrE4e01 was not run or failed, the 'text' variable might not be available.
# You should run cell 90KifyrE4e01 first.

# Check if 'text' variable exists
try:
    text
except NameError:
    print("Error: 'text' variable not found. Please run the docling cell (90KifyrE4e01) first to extract text from the PDF.")
    # Exit or handle the case where text is not available
    # For now, we will stop here and wait for the user to run the previous cell.
    raise

# Load the English language model if not already loaded (it should be loaded by run_spacy_ner)
try:
    nlp
except NameError:
     nlp = spacy.load("en_core_web_sm")


# Run spaCy NER on the text extracted from the PDF
doc, flat_entities, grouped_entities = run_spacy_ner(text)

# Generate HTML visualization of entities
html = displacy.render(doc, style="ent", page=True)

# Display the HTML
display(HTML(html))

# You can also print the flat and grouped entities to verify
print("\nFlat Entities:")
print(flat_entities)

print("\nGrouped Entities:")
print(grouped_entities)

<IPython.core.display.HTML object>


Flat Entities:
[{'text': 'March 12, 2025', 'label': 'DATE', 'start_char': 3, 'end_char': 17}, {'text': 'Capital Pvt. Ltd.', 'label': 'ORG', 'start_char': 28, 'end_char': 45}, {'text': 'Mumbai', 'label': 'GPE', 'start_char': 64, 'end_char': 70}, {'text': '48.7', 'label': 'CARDINAL', 'start_char': 84, 'end_char': 88}, {'text': 'Aurora FinTech Solutions', 'label': 'ORG', 'start_char': 109, 'end_char': 133}, {'text': 'Raghav Menon', 'label': 'PERSON', 'start_char': 190, 'end_char': 202}, {'text': '12.5%', 'label': 'PERCENT', 'start_char': 232, 'end_char': 237}, {'text': 'Aurora', 'label': 'PERSON', 'start_char': 275, 'end_char': 281}, {'text': 'nearly 40%', 'label': 'PERCENT', 'start_char': 377, 'end_char': 387}, {'text': 'Aurora', 'label': 'PERSON', 'start_char': 429, 'end_char': 435}, {'text': 'Financial Insight Weekly', 'label': 'ORG', 'start_char': 490, 'end_char': 514}, {'text': 'Aurora', 'label': 'PERSON', 'start_char': 516, 'end_char': 522}, {'text': 'annual', 'label': 'DATE', 'sta