# 8-K XBRL Extraction Pipeline - Interactive Testing

Test every component of the extraction pipeline step-by-step.

| Section | What it Tests | Dependencies |
|---------|--------------|--------------|
| 1-5 | Individual functions (parsing, validation) | None |
| 6 | Full postprocessor with mock data | None |
| 7 | XBRL catalog fetch from Neo4j | `neo4j`, `python-dotenv` |
| 8-9 | **Real LangExtract call** + postprocessing | `langextract`, `neo4j` |
| 10 | Save results (JSONL + HTML) | `langextract` |

In [1]:
# Setup: Add paths and imports
import sys
import os

# Ensure imports resolve
BASE = os.path.dirname(os.path.abspath('.'))
sys.path.insert(0, os.getcwd())
sys.path.insert(0, os.path.join(os.getcwd(), '..', 'Experiments'))

print(f"Working directory: {os.getcwd()}")

Working directory: /home/faisal/EventMarketDB/drivers/8K_XBRL_Linking/FinalScripts


---
## 1. Value Parsing

Extracts numeric values from text using priority-based matching:
- Currency + multiplier: `$2.75 billion` → 2,750,000,000
- Parentheses negative: `($450 million)` → -450,000,000
- Rejects percentage-only for non-ratio units

In [2]:
from postprocessor import parse_number_from_text

# Test cases: (text, unit, expected, description)
test_cases = [
    ("net income of $2.75 billion", None, 2_750_000_000, "Currency + multiplier"),
    ("revenue increased 10% to $2.75 billion", None, 2_750_000_000, "Priority over percentage"),
    ("($2.3 billion)", None, -2_300_000_000, "Parentheses negative"),
    ("loss of ($450 million)", None, -450_000_000, "Parentheses negative with context"),
    ("EPS of $4.80", None, 4.80, "Small currency amount"),
    ("$2,750,000,000 in assets", None, 2_750_000_000, "Large number with commas"),
    ("2.75 billion shares", None, 2_750_000_000, "Multiplier without currency"),
    ("margin of 23.5%", "pure", 0.235, "Percentage with ratio unit"),
    ("margin of 23.5%", "USD", None, "Percentage rejected for USD"),
    ("(10%) decline", None, None, "Percentage in parens - not negative"),
]

print("VALUE PARSING TESTS\n" + "="*60)
for text, unit, expected, desc in test_cases:
    value, error = parse_number_from_text(text, unit)
    
    if expected is None:
        passed = value is None
    else:
        passed = value is not None and abs(value - expected) < 0.01
    
    status = "✓" if passed else "✗"
    print(f"{status} {desc}")
    print(f"   Input: \"{text}\" (unit={unit})")
    print(f"   Expected: {expected}, Got: {value}")
    if error:
        print(f"   Error: {error}")
    print()

VALUE PARSING TESTS
✓ Currency + multiplier
   Input: "net income of $2.75 billion" (unit=None)
   Expected: 2750000000, Got: 2750000000.0

✓ Priority over percentage
   Input: "revenue increased 10% to $2.75 billion" (unit=None)
   Expected: 2750000000, Got: 2750000000.0

✓ Parentheses negative
   Input: "($2.3 billion)" (unit=None)
   Expected: -2300000000, Got: -2300000000.0

✓ Parentheses negative with context
   Input: "loss of ($450 million)" (unit=None)
   Expected: -450000000, Got: -450000000.0

✓ Small currency amount
   Input: "EPS of $4.80" (unit=None)
   Expected: 4.8, Got: 4.8

✓ Large number with commas
   Input: "$2,750,000,000 in assets" (unit=None)
   Expected: 2750000000, Got: 2750000000.0

✓ Multiplier without currency
   Input: "2.75 billion shares" (unit=None)
   Expected: 2750000000, Got: 2750000000.0

✓ Percentage with ratio unit
   Input: "margin of 23.5%" (unit=pure)
   Expected: 0.235, Got: 0.235

✓ Percentage rejected for USD
   Input: "margin of 23.5%" (unit

In [3]:
# Interactive: Try your own text
test_text = "Record revenue of $29.8 billion, up 19% year over year"
test_unit = "USD"

value, error = parse_number_from_text(test_text, test_unit)
print(f"Text: {test_text}")
print(f"Unit: {test_unit}")
print(f"Parsed value: {value:,.2f}" if value else f"Error: {error}")

Text: Record revenue of $29.8 billion, up 19% year over year
Unit: USD
Parsed value: 29,800,000,000.00


---
## 2. Period Normalization

Normalizes period formats to canonical form:
- Arrow variants: `-->`, `->`, ` - `, ` to ` → `→`
- Strips whitespace: `2024-01-01 → 2024-12-31` → `2024-01-01→2024-12-31`
- Collapses same-day durations to instant: `2024-06-30→2024-06-30` → `2024-06-30`

In [4]:
from postprocessor import normalize_period, validate_period

# Test cases: (input, expected_normalized, should_be_valid, description)
test_cases = [
    ("2024-12-31", "2024-12-31", True, "Instant - no change"),
    ("2024-01-01→2024-12-31", "2024-01-01→2024-12-31", True, "Duration - no change"),
    ("2024-06-30→2024-06-30", "2024-06-30", True, "Same start/end → instant"),
    ("2024-01-01-->2024-12-31", "2024-01-01→2024-12-31", True, "Arrow variant -->"),
    ("2024-01-01->2024-12-31", "2024-01-01→2024-12-31", True, "Arrow variant ->"),
    ("2024-01-01 - 2024-12-31", "2024-01-01→2024-12-31", True, "Arrow variant ' - '"),
    ("2024-01-01 → 2024-12-31", "2024-01-01→2024-12-31", True, "Spaces around arrow"),
    ("2024-01-01 to 2024-12-31", "2024-01-01→2024-12-31", True, "Arrow variant ' to '"),
    ("invalid", "invalid", False, "Invalid format"),
]

print("PERIOD NORMALIZATION TESTS\n" + "="*60)
for input_period, expected, should_be_valid, desc in test_cases:
    normalized, was_normalized = normalize_period(input_period)
    is_valid = validate_period(normalized)
    
    passed = (normalized == expected) and (is_valid == should_be_valid)
    status = "✓" if passed else "✗"
    
    print(f"{status} {desc}")
    print(f"   Input: \"{input_period}\"")
    print(f"   Normalized: \"{normalized}\" (changed={was_normalized})")
    print(f"   Valid: {is_valid}")
    print()

PERIOD NORMALIZATION TESTS
✓ Instant - no change
   Input: "2024-12-31"
   Normalized: "2024-12-31" (changed=False)
   Valid: True

✓ Duration - no change
   Input: "2024-01-01→2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=False)
   Valid: True

✓ Same start/end → instant
   Input: "2024-06-30→2024-06-30"
   Normalized: "2024-06-30" (changed=True)
   Valid: True

✓ Arrow variant -->
   Input: "2024-01-01-->2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=True)
   Valid: True

✓ Arrow variant ->
   Input: "2024-01-01->2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=True)
   Valid: True

✓ Arrow variant ' - '
   Input: "2024-01-01 - 2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=True)
   Valid: True

✓ Spaces around arrow
   Input: "2024-01-01 → 2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=False)
   Valid: True

✓ Arrow variant ' to '
   Input: "2024-01-01 to 2024-12-31"
   Normalized: "2024-01-01→2024-12-31" (changed=True)

---
## 3. Qname Validation

Validates concept qnames against the catalog. Invalid qnames become `UNMATCHED`.

In [5]:
from postprocessor import validate_qname

# Mock catalog qnames
valid_qnames = {
    "us-gaap:Revenues",
    "us-gaap:NetIncomeLoss",
    "us-gaap:Assets",
    "us-gaap:EarningsPerShareDiluted",
}

# Test cases
test_cases = [
    ("us-gaap:Revenues", "us-gaap:Revenues", True, "Valid qname"),
    ("us-gaap:NetIncomeLoss", "us-gaap:NetIncomeLoss", True, "Valid qname"),
    ("us-gaap:FakeConceptXYZ", "UNMATCHED", False, "Invalid qname → UNMATCHED"),
    ("UNMATCHED", "UNMATCHED", True, "UNMATCHED passthrough"),
    ("", "UNMATCHED", False, "Empty string → UNMATCHED"),
]

print("QNAME VALIDATION TESTS\n" + "="*60)
for qname, expected_result, expected_valid, desc in test_cases:
    result, was_valid = validate_qname(qname, valid_qnames)
    
    passed = (result == expected_result) and (was_valid == expected_valid)
    status = "✓" if passed else "✗"
    
    print(f"{status} {desc}")
    print(f"   Input: \"{qname}\"")
    print(f"   Result: \"{result}\" (valid={was_valid})")
    print()

Invalid qname 'us-gaap:FakeConceptXYZ' not in catalog, setting to UNMATCHED


QNAME VALIDATION TESTS
✓ Valid qname
   Input: "us-gaap:Revenues"
   Result: "us-gaap:Revenues" (valid=True)

✓ Valid qname
   Input: "us-gaap:NetIncomeLoss"
   Result: "us-gaap:NetIncomeLoss" (valid=True)

✓ Invalid qname → UNMATCHED
   Input: "us-gaap:FakeConceptXYZ"
   Result: "UNMATCHED" (valid=False)

✓ UNMATCHED passthrough
   Input: "UNMATCHED"
   Result: "UNMATCHED" (valid=True)

✓ Empty string → UNMATCHED
   Input: ""
   Result: "UNMATCHED" (valid=False)



---
## 4. Unit Validation

Validates units against catalog (case-insensitive). Invalid units → REVIEW status.

In [6]:
from postprocessor import validate_unit

# Mock catalog units
valid_units = {"USD", "USD/share", "shares", "pure"}

test_cases = [
    ("USD", "USD", True, "Valid unit"),
    ("USD/share", "USD/share", True, "Valid per-share unit"),
    ("usd", "USD", True, "Case-insensitive match"),
    ("EUR", "EUR", False, "Invalid unit"),
]

print("UNIT VALIDATION TESTS\n" + "="*60)
for unit, expected_result, expected_valid, desc in test_cases:
    result, was_valid = validate_unit(unit, valid_units)
    
    passed = (was_valid == expected_valid)
    status = "✓" if passed else "✗"
    
    print(f"{status} {desc}")
    print(f"   Input: \"{unit}\"")
    print(f"   Result: \"{result}\" (valid={was_valid})")
    print()

Unit 'EUR' not in catalog units: {'shares', 'USD', 'USD/share', 'pure'}


UNIT VALIDATION TESTS
✓ Valid unit
   Input: "USD"
   Result: "USD" (valid=True)

✓ Valid per-share unit
   Input: "USD/share"
   Result: "USD/share" (valid=True)

✓ Case-insensitive match
   Input: "usd"
   Result: "USD" (valid=True)

✓ Invalid unit
   Input: "EUR"
   Result: "EUR" (valid=False)



---
## 5. Status Determination

Determines extraction status based on validation results:
- **COMMITTED**: confidence ≥ 0.90 + valid qname + valid unit + valid period + value parsed
- **CANDIDATE_ONLY**: Valid but low confidence or UNMATCHED concept  
- **REVIEW**: Parse failure, invalid period, or invalid unit

In [7]:
from postprocessor import determine_status
from extraction_schema import ExtractionStatus

# Test cases: (concept, confidence, value, period_valid, qname_valid, unit_valid, expected_status)
test_cases = [
    ("us-gaap:Revenues", 0.95, 1000.0, True, True, True, ExtractionStatus.COMMITTED, "High confidence valid"),
    ("us-gaap:Revenues", 0.85, 1000.0, True, True, True, ExtractionStatus.CANDIDATE_ONLY, "Below threshold"),
    ("UNMATCHED", 0.95, 1000.0, True, True, True, ExtractionStatus.CANDIDATE_ONLY, "UNMATCHED concept"),
    ("us-gaap:Revenues", 0.95, None, True, True, True, ExtractionStatus.REVIEW, "Parse failed"),
    ("us-gaap:Revenues", 0.95, 1000.0, False, True, True, ExtractionStatus.REVIEW, "Invalid period"),
    ("us-gaap:Revenues", 0.95, 1000.0, True, True, False, ExtractionStatus.REVIEW, "Invalid unit"),
    ("us-gaap:FakeConcept", 0.95, 1000.0, True, False, True, ExtractionStatus.CANDIDATE_ONLY, "Invalid qname"),
]

print("STATUS DETERMINATION TESTS\n" + "="*60)
for concept, conf, value, period_v, qname_v, unit_v, expected, desc in test_cases:
    status, committed = determine_status(concept, conf, value, period_v, qname_v, unit_v)
    
    passed = status == expected
    status_str = "✓" if passed else "✗"
    
    print(f"{status_str} {desc}")
    print(f"   Got: {status.value}, Expected: {expected.value}")
    print(f"   Committed: {committed}")
    print()

STATUS DETERMINATION TESTS
✓ High confidence valid
   Got: COMMITTED, Expected: COMMITTED
   Committed: True

✓ Below threshold
   Got: CANDIDATE_ONLY, Expected: CANDIDATE_ONLY
   Committed: False

✓ UNMATCHED concept
   Got: CANDIDATE_ONLY, Expected: CANDIDATE_ONLY
   Committed: False

✓ Parse failed
   Got: REVIEW, Expected: REVIEW
   Committed: False

✓ Invalid period
   Got: REVIEW, Expected: REVIEW
   Committed: False

✓ Invalid unit
   Got: REVIEW, Expected: REVIEW
   Committed: False

✓ Invalid qname
   Got: CANDIDATE_ONLY, Expected: CANDIDATE_ONLY
   Committed: False



---
## 6. Full Postprocessor (Mock Data)

Tests the complete postprocessor with mock LangExtract-like input. No external dependencies.

In [8]:
from extraction_schema import RawExtraction, ExtractionStatus
from postprocessor import postprocess

# Mock catalog data
valid_qnames = {
    "us-gaap:Revenues",
    "us-gaap:NetIncomeLoss",
    "us-gaap:EarningsPerShareDiluted",
    "us-gaap:Assets",
}
valid_units = {"USD", "USD/share", "shares", "pure"}

# Mock extractions (as if from LangExtract)
raw_extractions = [
    RawExtraction(
        extraction_text="net income of $2.75 billion for fiscal year 2024",
        char_start=100, char_end=150,
        concept_top1="us-gaap:NetIncomeLoss",
        matched_period="2024-01-01→2024-12-31",
        matched_unit="USD",
        confidence=0.95,
        reasoning="Explicit net income, exact catalog match",
    ),
    RawExtraction(
        extraction_text="diluted EPS of $4.80",
        char_start=200, char_end=220,
        concept_top1="us-gaap:EarningsPerShareDiluted",
        matched_period="2024-01-01→2024-12-31",
        matched_unit="USD/share",
        confidence=0.92,
        reasoning="Diluted EPS value",
    ),
    RawExtraction(
        extraction_text="revenue grew to $8.5 billion",
        char_start=300, char_end=330,
        concept_top1="UNMATCHED",
        concept_top2="us-gaap:Revenues",
        matched_period="2024-01-01→2024-12-31",
        matched_unit="USD",
        confidence=0.60,
        reasoning="Revenue concept varies by filer",
    ),
    RawExtraction(
        extraction_text="reported loss of ($450 million)",
        char_start=400, char_end=435,
        concept_top1="us-gaap:NetIncomeLoss",
        matched_period="2024-04-01→2024-06-30",
        matched_unit="USD",
        confidence=0.88,
        reasoning="Quarterly net loss in parentheses",
    ),
    RawExtraction(
        extraction_text="some text without clear number",
        char_start=500, char_end=530,
        concept_top1="us-gaap:Assets",
        matched_period="2024-12-31",
        matched_unit="USD",
        confidence=0.70,
        reasoning="No numeric value found",
    ),
]

# Run postprocessor
processed = postprocess(raw_extractions, valid_qnames, valid_units)

print(f"POSTPROCESSOR TEST\n" + "="*60)
print(f"Processed {len(processed)} facts:\n")

for i, fact in enumerate(processed, 1):
    print(f"--- Fact {i}: {fact.status.value} ---")
    print(f"  Text: \"{fact.extraction_text[:50]}...\"")
    print(f"  Concept: {fact.concept_top1}")
    if fact.concept_top2:
        print(f"  Concept2: {fact.concept_top2}")
    print(f"  Value: {fact.value_parsed:,.2f}" if fact.value_parsed else f"  Value: None")
    print(f"  Period: {fact.matched_period}")
    print(f"  Unit: {fact.matched_unit} (valid={fact.unit_valid})")
    print(f"  Committed: {fact.committed}")
    if fact.parse_error:
        print(f"  Parse Error: {fact.parse_error}")
    print()

# Summary
committed = sum(1 for f in processed if f.status == ExtractionStatus.COMMITTED)
candidate = sum(1 for f in processed if f.status == ExtractionStatus.CANDIDATE_ONLY)
review = sum(1 for f in processed if f.status == ExtractionStatus.REVIEW)
print(f"Summary: {committed} COMMITTED, {candidate} CANDIDATE_ONLY, {review} REVIEW")

REVIEW needed: some text without clear number... (parse_error=Could not parse number from: some text without clear number..., period_valid=True)


POSTPROCESSOR TEST
Processed 5 facts:

--- Fact 1: COMMITTED ---
  Text: "net income of $2.75 billion for fiscal year 2024..."
  Concept: us-gaap:NetIncomeLoss
  Value: 2,750,000,000.00
  Period: 2024-01-01→2024-12-31
  Unit: USD (valid=True)
  Committed: True

--- Fact 2: COMMITTED ---
  Text: "diluted EPS of $4.80..."
  Concept: us-gaap:EarningsPerShareDiluted
  Value: 4.80
  Period: 2024-01-01→2024-12-31
  Unit: USD/share (valid=True)
  Committed: True

--- Fact 3: CANDIDATE_ONLY ---
  Text: "revenue grew to $8.5 billion..."
  Concept: UNMATCHED
  Concept2: us-gaap:Revenues
  Value: 8,500,000,000.00
  Period: 2024-01-01→2024-12-31
  Unit: USD (valid=True)
  Committed: False

--- Fact 4: CANDIDATE_ONLY ---
  Text: "reported loss of ($450 million)..."
  Concept: us-gaap:NetIncomeLoss
  Value: -450,000,000.00
  Period: 2024-04-01→2024-06-30
  Unit: USD (valid=True)
  Committed: False

--- Fact 5: REVIEW ---
  Text: "some text without clear number..."
  Concept: us-gaap:Assets
  Value: 

---
## 7. Catalog Fetch (Neo4j)

Fetches XBRL catalog from Neo4j (READ ONLY - no writes). Contains concepts, units, dimensions, and historical facts for context.

In [9]:
# Fetch catalog for a company
TICKER = "DELL"  # Change this to test different companies

try:
    from xbrl_catalog import xbrl_catalog, print_catalog_summary
    
    print(f"Fetching XBRL catalog for {TICKER}...")
    catalog = xbrl_catalog(TICKER, limit_filings=2)
    
    print_catalog_summary(catalog)
    
except ImportError as e:
    print(f"Import error: {e}")
    print("Install: pip install python-dotenv neo4j")
except Exception as e:
    print(f"Error: {e}")

Fetching XBRL catalog for DELL...

XBRL Catalog: DELL TECHNOLOGIES INC (DELL)
CIK: 0001571996
Industry: ComputerHardware
Sector: Technology

Total Filings: 2
Total Facts: 3,536
Unique Concepts: 814

Filings:
  - 10-Q (2025-05-02): 1,231 facts
  - 10-K (2025-01-31): 2,305 facts

Top Segments:
  - FinanceLeasesPortfolioSegment: 811 facts
  - LoansAndFinanceReceivables: 707 facts
  - Nondesignated: 590 facts
  - UnsecuredDebt: 576 facts
  - ForeignExchangeContract: 528 facts




In [10]:
# Inspect catalog data
if 'catalog' in dir():
    print(f"Valid Qnames ({len(catalog.concepts)} total, first 20):")
    for qname in list(catalog.concepts.keys())[:20]:
        print(f"  {qname}")
    
    print(f"\nValid Units ({len(catalog.units)} total):")
    for unit in catalog.units.keys():
        print(f"  {unit}")

Valid Qnames (814 total, first 20):
  us-gaap:Revenues
  us-gaap:StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest
  us-gaap:NotesReceivableGross
  us-gaap:DebtInstrumentCarryingAmount
  us-gaap:DerivativeFairValueOfDerivativeLiability
  us-gaap:DerivativeFairValueOfDerivativeAsset
  us-gaap:CostOfRevenue
  us-gaap:ProfitLoss
  us-gaap:DerivativeAssetsLiabilitiesAtFairValueNet
  us-gaap:OperatingIncomeLoss
  us-gaap:DebtInstrumentInterestRateStatedPercentage
  us-gaap:NotesReceivableNet
  us-gaap:LongTermDebt
  us-gaap:CommonStockSharesIssued
  us-gaap:FinancingReceivableAllowanceForCreditLosses
  us-gaap:DividendsCommonStock
  us-gaap:FinancingReceivableRevolving
  us-gaap:AdjustmentsToAdditionalPaidInCapitalSharebasedCompensationRequisiteServicePeriodRecognitionValue
  us-gaap:SeveranceCosts1
  us-gaap:OperatingExpenses

Valid Units (12 total):
  iso4217:USD
  shares
  pure
  iso4217:USDshares
  dell:vote
  iso4217:EUR
  dell:facility
  dell:segment
  dell:tranch

In [11]:
# Preview LLM context
if 'catalog' in dir():
    context = catalog.to_llm_context()
    print(f"LLM Context ({len(context):,} chars):\n")
    print(context[:3000])
    print("\n... [truncated] ...")

LLM Context (25,973 chars):

<<<BEGIN_XBRL_REFERENCE_DATA>>>
COMPANY: DELL TECHNOLOGIES INC (DELL)
CIK: 0001571996 | Industry: ComputerHardware | Sector: Technology

LEGEND:
• qname = unique concept identifier (e.g., us-gaap:Revenues)
• label = human-readable name
• balance: credit = ↑equity/liability | debit = ↑assets/expenses

────────────────────────────────────────────────────────────────────────
FILINGS (2 reports, 3,536 total facts)
────────────────────────────────────────────────────────────────────────
10-Q 2025-05-02 [1,231 facts]
10-K 2025-01-31 [2,305 facts] | Earnings Per Share, Basic=1.7, Net Income (Loss) Attributable to Parent=1,210,000,000, Assets=82,126,000,000

────────────────────────────────────────────────────────────────────────
CONCEPTS (814 total, top 100 shown)
History shows recent values for magnitude validation (10-K=annual, 10-Q=quarterly)
────────────────────────────────────────────────────────────────────────
── TOP CONCEPTS (with history for magnitude val

---
## 8. Real LangExtract Call

Calls LangExtract directly to see the **raw extraction output** before any postprocessing. This is the actual LLM call with your prompt and examples.

In [12]:
# Load sample 8-K and catalog
TICKER = "DELL"
SAMPLE_FILE = "/home/faisal/EventMarketDB/drivers/8K_XBRL_Linking/sample_data/DELL_1571996_2025-08-28_000157199625000096/exhibit_EX-99.1.txt"

# Load 8-K text
with open(SAMPLE_FILE, 'r') as f:
    sample_8k_text = f.read()

print(f"Loaded 8-K: {len(sample_8k_text):,} characters")

# Fetch catalog
from xbrl_catalog import xbrl_catalog
catalog = xbrl_catalog(TICKER, limit_filings=2)
print(f"Catalog: {len(catalog.concepts)} concepts, {len(catalog.units)} units")

# Build full context (document + catalog)
llm_context = catalog.to_llm_context()
full_context = f"{sample_8k_text}\n\n{llm_context}"
print(f"Full context: {len(full_context):,} characters")

Loaded 8-K: 28,441 characters
Catalog: 814 concepts, 12 units
Full context: 54,416 characters


In [13]:
# Initialize LangExtract
import langextract as lx
from extraction_config import PROMPT_DESCRIPTION, EXAMPLES

print("LangExtract imported")
print(f"Model: gemini-2.0-flash")
print(f"Prompt length: {len(PROMPT_DESCRIPTION)} chars")
print(f"Examples: {len(EXAMPLES)}")

LangExtract imported
Model: gemini-2.0-flash
Prompt length: 2388 chars
Examples: 6


In [14]:
# Run extraction - THIS IS THE REAL LANGEXTRACT CALL
print("Running LangExtract extraction...")
print("(This may take 30-60 seconds depending on document size)\n")

annotated_doc = lx.extract(
    text_or_documents=full_context,
    prompt_description=PROMPT_DESCRIPTION,
    examples=EXAMPLES,
    model_id="gemini-2.0-flash"
)

# Get extractions from annotated document
raw_extractions = annotated_doc.extractions

print(f"✓ LangExtract returned {len(raw_extractions)} extractions")

Running LangExtract extraction...
(This may take 30-60 seconds depending on document size)



[94m[1mLangExtract[0m: model=[92mgemini-2.0-flash[0m, current=[92m7,256[0m chars, processed=[92m54,363[0m chars:  [01:07]

[92m✓[0m Extraction processing complete
[92m✓[0m Extracted [1m140[0m entities ([1m1[0m unique types)
  [96m•[0m Time: [1m67.57s[0m
  [96m•[0m Speed: [1m805[0m chars/sec
  [96m•[0m Chunks: [1m55[0m
✓ LangExtract returned 140 extractions





In [15]:
# Inspect RAW LangExtract output (before postprocessing)
print(f"RAW LANGEXTRACT OUTPUT ({len(raw_extractions)} extractions)\n" + "="*70)

for i, ext in enumerate(raw_extractions, 1):
    print(f"\n[{i}] Raw Extraction")
    print(f"    class: {ext.extraction_class}")
    print(f"    text: \"{ext.extraction_text[:80]}...\"" if len(ext.extraction_text) > 80 else f"    text: \"{ext.extraction_text}\"")
    
    # Handle char_interval (can be None)
    if ext.char_interval:
        print(f"    span: {ext.char_interval.start_pos} - {ext.char_interval.end_pos}")
    else:
        print(f"    span: N/A")
    
    if ext.alignment_status:
        print(f"    alignment: {ext.alignment_status.value}")
    
    attrs = ext.attributes or {}
    if attrs:
        print(f"    --- Attributes ---")
        for k, v in attrs.items():
            if v is not None:
                print(f"    {k}: {v}")

RAW LANGEXTRACT OUTPUT (140 extractions)

[1] Raw Extraction
    class: financial_fact
    text: "Record revenue of $29.8 billion"
    span: 326 - 357
    alignment: match_exact
    --- Attributes ---
    concept_top1: UNMATCHED
    concept_top2: us-gaap:Revenues
    matched_period: 2025-04-01	2025-06-30
    matched_unit: USD
    confidence: .4
    reasoning: Revenue is not defined in the catalog, so I am using the us-gaap revenue as a second best guess.

[2] Raw Extraction
    class: financial_fact
    text: "Operating income of $1.8 billion"
    span: 382 - 414
    alignment: match_exact
    --- Attributes ---
    concept_top1: us-gaap:OperatingIncomeLoss
    concept_top2: us-gaap:IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest
    matched_period: 2025-04-01	2025-06-30
    matched_unit: USD
    confidence: 0.7
    reasoning: Operating income is a reasonable match, but the catalog does not contain enough information to be sure.

[3] Raw Extr

---
## 9. Postprocess Real Extractions

Runs the postprocessor on actual LangExtract output:
1. Filters extractions pointing into catalog context (precision fix)
2. Validates qnames, units, periods
3. Parses numeric values deterministically
4. Assigns status: COMMITTED / CANDIDATE_ONLY / REVIEW

In [None]:
# Map raw extractions to RawExtraction dataclass, then postprocess
from extraction_schema import RawExtraction, ExtractionStatus, UNMATCHED
from postprocessor import postprocess

def safe_float(value, default=0.0):
    """Safely convert to float, return default if conversion fails."""
    if value is None:
        return default
    try:
        return float(value)
    except (ValueError, TypeError):
        print(f"  Warning: Could not parse confidence '{value}', using {default}")
        return default

# Get valid qnames and units from catalog
valid_qnames = set(catalog.concepts.keys())
valid_units = {"USD", "USD/share", "shares", "pure"}  # Standard units

# Filter extractions that point into catalog context (precision fix)
source_text_length = len(sample_8k_text)

# Map to RawExtraction
mapped_extractions = []
for ext in raw_extractions:
    # Only process financial_fact extractions
    if ext.extraction_class != 'financial_fact':
        continue
    
    # Get char positions from char_interval
    char_start = ext.char_interval.start_pos if ext.char_interval else 0
    char_end = ext.char_interval.end_pos if ext.char_interval else 0
    
    # Skip extractions pointing into catalog context (beyond source text)
    if char_end > source_text_length:
        print(f"Dropping extraction at char_end={char_end} (beyond source text at {source_text_length})")
        continue
    
    attrs = ext.attributes or {}
    
    mapped_extractions.append(RawExtraction(
        extraction_text=ext.extraction_text,
        char_start=char_start,
        char_end=char_end,
        concept_top1=attrs.get('concept_top1') or UNMATCHED,
        matched_period=attrs.get('matched_period') or '',
        matched_unit=attrs.get('matched_unit') or '',
        confidence=safe_float(attrs.get('confidence'), 0.0),
        reasoning=attrs.get('reasoning') or '',
        concept_top2=attrs.get('concept_top2'),
        matched_dimension=attrs.get('matched_dimension'),
        matched_member=attrs.get('matched_member')
    ))

print(f"Mapped {len(mapped_extractions)} financial_fact extractions for postprocessing")

In [None]:
# Run postprocessor
from postprocessor import validate_period

processed_facts = postprocess(mapped_extractions, valid_qnames, valid_units)

# Display results
print(f"POSTPROCESSED FACTS ({len(processed_facts)} total)\n" + "="*70)

for i, fact in enumerate(processed_facts, 1):
    status_icon = "✓" if fact.status == ExtractionStatus.COMMITTED else "○" if fact.status == ExtractionStatus.CANDIDATE_ONLY else "✗"
    print(f"\n[{i}] {status_icon} {fact.status.value}")
    print(f"    Text: \"{fact.extraction_text[:60]}...\"")
    print(f"    Concept: {fact.concept_top1}")
    if fact.concept_top2:
        print(f"    Concept2: {fact.concept_top2}")
    print(f"    Value: {fact.value_parsed:,.2f}" if fact.value_parsed else f"    Value: None")
    print(f"    Period: {fact.matched_period}")
    print(f"    Unit: {fact.matched_unit} (valid={fact.unit_valid})")
    print(f"    Confidence: {fact.confidence:.2f}")
    if fact.parse_error:
        print(f"    Parse Error: {fact.parse_error}")

# Summary
committed = sum(1 for f in processed_facts if f.status == ExtractionStatus.COMMITTED)
candidate = sum(1 for f in processed_facts if f.status == ExtractionStatus.CANDIDATE_ONLY)
review = sum(1 for f in processed_facts if f.status == ExtractionStatus.REVIEW)
print(f"\n" + "="*70)
print(f"SUMMARY: {committed} COMMITTED | {candidate} CANDIDATE_ONLY | {review} REVIEW")

---
## 10. Save Extraction Results

Saves extractions in two formats:
- **JSONL**: Raw machine-readable format with all extraction data
- **HTML**: Interactive visualization with entity highlighting in source context

In [None]:
# Output directory
import os
from datetime import datetime

OUTPUT_DIR = "/home/faisal/EventMarketDB/drivers/8K_XBRL_Linking/output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Generate timestamp for filenames
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
base_name = f"DELL_8K_{timestamp}"

print(f"Output directory: {OUTPUT_DIR}")
print(f"Base filename: {base_name}")

In [None]:
# Save annotated document to JSONL using LangExtract's built-in function
jsonl_path = os.path.join(OUTPUT_DIR, f"{base_name}.jsonl")

# save_annotated_documents expects an iterator of AnnotatedDocument
lx.io.save_annotated_documents(
    annotated_documents=[annotated_doc],  # List of AnnotatedDocument
    output_dir=OUTPUT_DIR,
    output_name=f"{base_name}.jsonl"
)

print(f"✓ Saved annotated document to: {jsonl_path}")

In [None]:
# Generate interactive HTML visualization
html_path = os.path.join(OUTPUT_DIR, f"{base_name}.html")

# Use LangExtract's visualize - can take AnnotatedDocument directly
html_content = lx.visualize(annotated_doc)

# Save HTML file
with open(html_path, "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)
    else:
        f.write(str(html_content))

print(f"✓ Saved HTML visualization to: {html_path}")
print(f"\nOpen in browser to view extractions highlighted in source text.")