# Financial Document Reader - Demo

This notebook demonstrates the Named Entity Recognition (NER) extraction capabilities of the Financial Document Reader.

## Features
- Extract financial entities from PDF, DOCX, and TXT files
- Rule-based extraction with regex patterns
- Automatic normalization (amounts, dates, spreads)
- Confidence scoring for each extraction


In [5]:
import sys
import json
from pathlib import Path

# Add backend to path
sys.path.insert(0, str(Path.cwd().parent / 'backend'))

from app.extractors.rule_based import RuleBasedExtractor
from app.extractors.document_processor import DocumentProcessor
from app.utils.normalizers import *


ModuleNotFoundError: No module named 'pydantic'

## Example 1: Extract from Sample Text

Let's extract entities from the provided trade confirmation text.


In [None]:
sample_text = """
11:49:05 I'll revert regarding BANK ABC to try to do another 200 mio at 2Y
FR001400QV82    AVMAFC FLOAT    06/30/28
offer 2Y EVG estr+45bps
estr average Estr average / Quarterly interest payment
"""

# Initialize extractor
extractor = RuleBasedExtractor()

# Extract entities
entities = extractor.extract(sample_text, source="sample_text")

print(f"Found {len(entities)} entities:\n")

for entity in entities:
    print(f"[{entity.entity}]")
    print(f"   Raw: {entity.raw_value}")
    print(f"   Normalized: {entity.normalized}")
    print(f"   Confidence: {entity.confidence:.2%}")
    print(f"   Position: {entity.char_start}-{entity.char_end}")
    print()


## Example 2: Using Pre-trained NER Model (spaCy)

This demonstrates using a general-purpose NER model for entity extraction.


In [None]:
# Install spaCy if needed: pip install spacy
# Download model: python -m spacy download en_core_web_sm

try:
    import spacy
    
    # Load pre-trained NER model
    nlp = spacy.load("en_core_web_sm")
    
    # Process sample text
    doc = nlp(sample_text)
    
    print("Entities detected by spaCy NER model:\n")
    
    for ent in doc.ents:
        print(f"Text: {ent.text}")
        print(f"  Label: {ent.label_}")
        print(f"  Description: {spacy.explain(ent.label_)}")
        print(f"  Position: {ent.start_char}-{ent.end_char}")
        print()
    
    print(f"\nTotal entities found: {len(doc.ents)}")
    
    # Note: Pre-trained models recognize general entities (PERSON, ORG, MONEY, DATE)
    # For financial-specific entities (ISIN, Tenor, Spread), fine-tuning is needed
    # See NER_FINETUNING_METHODOLOGY.md for details
    
except ImportError:
    print("spaCy not installed. Install with: pip install spacy")
    print("Then download model: python -m spacy download en_core_web_sm")


## Example 3: Comparison - Rule-Based vs NER Model

Let's compare both approaches:


In [None]:
print("COMPARISON: Rule-Based vs NER Model\n")
print("=" * 60)

print("\nRule-Based Approach:")
print("  Pros:")
print("    - Extracts domain-specific entities (ISIN, Tenor, Spread)")
print("    - Very fast (< 50ms)")
print("    - High precision for well-defined patterns")
print("    - No training data needed")
print("  Cons:")
print("    - Brittle to format variations")
print("    - Requires manual pattern updates")
print("    - Limited context understanding")

print("\nNER Model Approach:")
print("  Pros:")
print("    - Handles format variations better")
print("    - Understands linguistic context")
print("    - Learns from data")
print("    - Generalizes to new patterns")
print("  Cons:")
print("    - Requires training data (500-1000 examples)")
print("    - Slower inference (~100-500ms)")
print("    - May need fine-tuning for financial entities")
print("    - Higher computational requirements")

print("\nRecommended Hybrid Approach:")
print("  1. Use rule-based for high-confidence patterns (ISIN, amounts)")
print("  2. Use NER model for ambiguous cases (counterparties, context)")
print("  3. Combine with confidence scoring")
print("  4. Post-process and validate results")

print("\nFor fine-tuning NER models, see: NER_FINETUNING_METHODOLOGY.md")


## Example 4: Test Normalizers

Demonstrate the normalization functions for different data types.


In [None]:
print("Amount Normalization:")
print(f"  '200 mio' -> {normalize_amount('200 mio')}")
print(f"  'EUR 1 million' -> {normalize_amount('EUR 1 million')}")
print(f"  '500k' -> {normalize_amount('500k')}")

print("\nDate Normalization:")
print(f"  '31 January 2025' -> {normalize_date('31 January 2025')}")
print(f"  '06/30/28' -> {normalize_date('06/30/28')}")

print("\nSpread Normalization:")
print(f"  'estr+45bps' -> {normalize_spread('estr+45bps')}")
print(f"  'libor+100' -> {normalize_spread('libor+100')}")

print("\nPercentage Normalization:")
print(f"  '75%' -> {normalize_percentage('75%')}")
print(f"  '0.5%' -> {normalize_percentage('0.5%')}")

print("\nTenor Normalization:")
print(f"  '2Y' -> {normalize_tenor('2Y')}")
print(f"  '6M' -> {normalize_tenor('6M')}")


## Summary

This demo showed:
1. Rule-based entity extraction with regex
2. Automatic normalization of amounts, dates, and spreads
3. Confidence scoring for each extraction
4. Multi-format support (TXT, PDF, DOCX)

## Next Steps

- Upload your own documents through the web UI at http://localhost:3000
- Use the REST API at http://localhost:8000/docs
- Extend the extraction patterns in `backend/app/extractors/rule_based.py`
