# 8-K → XBRL Linking Experiment (Notebook-First, Deterministic)

**Goal**: Extract structured facts from 8-K sections/exhibits and link them to existing XBRL nodes **only when provable**, maximizing coverage **without sacrificing reliability**.

**Key principles**:
- Deterministic extraction (span-verified).
- Proof-based linking only (exact concept/unit/period/member matches).
- Any fact without a provable link remains `unmapped` (still retained).
- Every step validates its output before moving on.

**High-level structure**:
1. Configure & connect to Neo4j.
2. Freeze a deterministic data snapshot (small CIK + filings).
3. Extract facts with span verification (deterministic).
4. Build company-scoped XBRL catalogs (concepts/units/periods/members).
5. Deterministic linking ladder (concept → unit → period → member/context).
6. Validate invariants & coverage.
7. Explore alternative strategies (presentation/calculation networks).


## 0) Configuration
Fill in Neo4j connection settings and choose a test CIK + sample filings.
We **only** use a small frozen subset for deterministic testing.


In [None]:
from dataclasses import dataclass
from typing import List, Dict, Any, Optional, Tuple
import hashlib
import json
import pandas as pd
import re
import os
from dotenv import load_dotenv

load_dotenv()

# === USER CONFIG ===
NEO4J_URI = "bolt://localhost:30687"  # NodePort per CLAUDE.md
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

TEST_CIK = "0000320193"  # Apple
TEST_FILINGS = []  # Optionally set explicit accession numbers

## 1) Connect to Neo4j (strict validation)
We fail fast if connection does not work.


In [None]:
from neo4j import GraphDatabase

def get_driver(uri: str, user: str, password: str):
    driver = GraphDatabase.driver(uri, auth=(user, password))
    with driver.session() as session:
        result = session.run("RETURN 1 AS ok").single()
        assert result and result["ok"] == 1, "Neo4j connection validation failed"
    return driver

driver = get_driver(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD)
print("Neo4j connection OK")


## 2) Freeze a deterministic data snapshot
We pull a small, fixed subset of 8-K sections/exhibits and XBRL nodes for a single CIK.
This prevents moving-target behavior.


In [None]:
def run_query(driver, cypher: str, params: Dict[str, Any] = None):
    with driver.session() as session:
        return list(session.run(cypher, params or {}))

filings_query = (
    "MATCH (r:Report {cik: $cik, formType: '8-K'}) \n"
    "RETURN r.accessionNo AS accessionNo \n"
    "ORDER BY r.created DESC LIMIT 20"
)
records = run_query(driver, filings_query, {"cik": TEST_CIK})
candidate_filings = [r["accessionNo"] for r in records]

filing_ids = TEST_FILINGS or candidate_filings[:10]
assert filing_ids, "No 8-K filings found for TEST_CIK"
print(f"Using {len(filing_ids)} 8-K filings for snapshot")

content_query = (
    "MATCH (r:Report) \n"
    "WHERE r.accessionNo IN $filing_ids \n"
    "OPTIONAL MATCH (r)-[:HAS_SECTION]->(s:ExtractedSectionContent) \n"
    "OPTIONAL MATCH (r)-[:HAS_EXHIBIT]->(e:ExhibitContent) \n"
    "RETURN r.accessionNo AS filing_id, \n"
    "       collect(DISTINCT {id: s.id, section_name: s.section_name, content: s.content}) AS sections, \n"
    "       collect(DISTINCT {id: e.id, exhibit_number: e.exhibit_number, content: e.content}) AS exhibits"
)
content_records = run_query(driver, content_query, {"filing_ids": filing_ids})

snapshot = {
    rec["filing_id"]: {
        "sections": [s for s in rec["sections"] if s.get("id")],
        "exhibits": [e for e in rec["exhibits"] if e.get("id")],
    }
    for rec in content_records
}

assert any(v["sections"] or v["exhibits"] for v in snapshot.values()), "Snapshot is empty"
print(f"Snapshot created for {len(snapshot)} filings")

with open("8k_snapshot.json", "w") as f:
    json.dump(snapshot, f)
print("Snapshot saved to 8k_snapshot.json")


## 3) Deterministic extraction (span-verified)
We do not proceed unless every extracted span exactly matches source text.
Replace `extract_facts_from_text` with your deterministic extraction method.


In [None]:
@dataclass
class ExtractedFact:
    filing_id: str
    source_id: str
    source_type: str  # 'section' or 'exhibit'
    metric: str
    value_text: str
    span_start: int
    span_end: int
    raw_context: str

def extract_facts_from_text(text: str) -> List[ExtractedFact]:
    """
    Placeholder deterministic extractor.
    Replace with LangExtract or other deterministic extraction with span outputs.
    Must return spans that exactly match text[span_start:span_end].
    """
    return []

def validate_spans(text: str, facts: List[ExtractedFact]) -> None:
    for f in facts:
        extracted = text[f.span_start:f.span_end]
        assert extracted == f.value_text, (
            f"Span mismatch: expected '{f.value_text}', got '{extracted}'"
        )

all_facts: List[ExtractedFact] = []

for filing_id, data in snapshot.items():
    for section in data["sections"]:
        text = section["content"] or ""
        facts = extract_facts_from_text(text)
        validate_spans(text, facts)
        for f in facts:
            f.filing_id = filing_id
            f.source_id = section["id"]
            f.source_type = "section"
        all_facts.extend(facts)

    for exhibit in data["exhibits"]:
        text = exhibit["content"] or ""
        facts = extract_facts_from_text(text)
        validate_spans(text, facts)
        for f in facts:
            f.filing_id = filing_id
            f.source_id = exhibit["id"]
            f.source_type = "exhibit"
        all_facts.extend(facts)

print(f"Extracted {len(all_facts)} facts (span-verified)")


## 4) Build XBRL catalogs (company-scoped)
Concepts, Units, Periods, Members are all restricted to the TEST_CIK.
We only allow exact matches to entries in these catalogs.


In [None]:
def normalize_label(label: str) -> str:
    label = label.lower().strip()
    label = re.sub(r"[\n\r\t]+", " ", label)
    label = re.sub(r"[^a-z0-9:% ]+", "", label)
    return re.sub(r"\s+", " ", label).strip()

concepts_query = (
    "MATCH (r:Report {cik: $cik})-[:HAS_XBRL]->(x:XBRLNode) \n"
    "MATCH (x)<-[:REPORTS]-(f:Fact)-[:HAS_CONCEPT]->(c:Concept) \n"
    "RETURN DISTINCT c.qname AS qname, c.label AS label"
)
concept_records = run_query(driver, concepts_query, {"cik": TEST_CIK})

concept_catalog = {}
for r in concept_records:
    qname = r["qname"]
    label = r.get("label") or ""
    if qname:
        concept_catalog[qname] = {"qname": qname, "labels": set()}
    if label and qname:
        concept_catalog[qname]["labels"].add(normalize_label(label))

assert concept_catalog, "No concepts found for company"
print(f"Loaded {len(concept_catalog)} company concepts")

units_query = (
    "MATCH (r:Report {cik: $cik})-[:HAS_XBRL]->(x:XBRLNode) \n"
    "MATCH (x)<-[:REPORTS]-(f:Fact)-[:HAS_UNIT]->(u:Unit) \n"
    "RETURN DISTINCT u.id AS id, u.string_value AS string_value"
)
unit_records = run_query(driver, units_query, {"cik": TEST_CIK})
unit_catalog = {r["string_value"]: r["id"] for r in unit_records if r.get("string_value")}
print(f"Loaded {len(unit_catalog)} units")

periods_query = (
    "MATCH (r:Report {cik: $cik})-[:HAS_XBRL]->(x:XBRLNode) \n"
    "MATCH (x)<-[:REPORTS]-(f:Fact)-[:HAS_PERIOD]->(p:Period) \n"
    "RETURN DISTINCT p.u_id AS u_id"
)
period_records = run_query(driver, periods_query, {"cik": TEST_CIK})
period_catalog = {r["u_id"] for r in period_records if r.get("u_id")}
print(f"Loaded {len(period_catalog)} periods")

members_query = (
    "MATCH (r:Report {cik: $cik})-[:HAS_XBRL]->(x:XBRLNode) \n"
    "MATCH (x)<-[:REPORTS]-(f:Fact)-[:FACT_MEMBER]->(m:Member) \n"
    "RETURN DISTINCT m.name AS name"
)
member_records = run_query(driver, members_query, {"cik": TEST_CIK})
member_catalog = {normalize_label(r["name"]) for r in member_records if r.get("name")}
print(f"Loaded {len(member_catalog)} members")


## 5) Deterministic linking ladder (concept → unit → period → member)
We only link when exact evidence exists. Otherwise, remain `unmapped`.


In [None]:
@dataclass
class LinkedFact:
    fact: ExtractedFact
    concept_qname: Optional[str] = None
    unit_id: Optional[str] = None
    period_u_id: Optional[str] = None
    member_name: Optional[str] = None
    completeness: str = "unmapped"
    evidence: Dict[str, str] = None

def match_concept(metric: str) -> Tuple[Optional[str], Optional[str]]:
    norm = normalize_label(metric)
    for qname, data in concept_catalog.items():
        if norm in data["labels"]:
            return qname, f"label:{norm}"
    if metric in concept_catalog:
        return metric, f"qname:{metric}"
    return None, None

def match_unit(text: str) -> Tuple[Optional[str], Optional[str]]:
    candidates = []
    if re.search(r"\bUSD\b|\$", text):
        candidates.append("iso4217:USD")
    if re.search(r"\bshares\b", text, re.IGNORECASE):
        candidates.append("shares")
    for c in candidates:
        if c in unit_catalog:
            return unit_catalog[c], f"unit:{c}"
    return None, None

def match_member(text: str) -> Tuple[Optional[str], Optional[str]]:
    norm = normalize_label(text)
    for m in member_catalog:
        if m and m in norm:
            return m, f"member:{m}"
    return None, None

def match_period(text: str) -> Tuple[Optional[str], Optional[str]]:
    return None, None

linked_facts: List[LinkedFact] = []

for fact in all_facts:
    lf = LinkedFact(fact=fact, evidence={})
    qname, ev = match_concept(fact.metric)
    if qname:
        lf.concept_qname = qname
        lf.evidence["concept"] = ev

    unit_id, unit_ev = match_unit(fact.raw_context)
    if unit_id:
        lf.unit_id = unit_id
        lf.evidence["unit"] = unit_ev

    period_id, period_ev = match_period(fact.raw_context)
    if period_id:
        lf.period_u_id = period_id
        lf.evidence["period"] = period_ev

    member_name, member_ev = match_member(fact.raw_context)
    if member_name:
        lf.member_name = member_name
        lf.evidence["member"] = member_ev

    if lf.concept_qname and lf.unit_id and lf.period_u_id and lf.member_name:
        lf.completeness = "full"
    elif lf.concept_qname and lf.unit_id and lf.period_u_id:
        lf.completeness = "concept_unit_period"
    elif lf.concept_qname and lf.unit_id:
        lf.completeness = "concept_unit"
    elif lf.concept_qname:
        lf.completeness = "concept_only"
    else:
        lf.completeness = "unmapped"

    linked_facts.append(lf)

print(f"Linked {len(linked_facts)} facts")


## 6) Validation gates (must pass before proceeding)
We do not move forward if any validation fails.


In [None]:
for lf in linked_facts:
    if lf.concept_qname:
        assert "concept" in lf.evidence, "Missing concept evidence"
    if lf.unit_id:
        assert "unit" in lf.evidence, "Missing unit evidence"
    if lf.period_u_id:
        assert "period" in lf.evidence, "Missing period evidence"
    if lf.member_name:
        assert "member" in lf.evidence, "Missing member evidence"

valid_levels = {"unmapped", "concept_only", "concept_unit", "concept_unit_period", "full"}
assert all(lf.completeness in valid_levels for lf in linked_facts)

print("Validation gates passed")


## 7) Ground-truth verification using 10-Q within 30 days

Per your note: the 10-Q filed ~30 days after an 8-K contains the same quarter values.
We can use **10-Q XBRL facts as ground truth** to validate 8-K concept matches with 100% certainty.
This step is **validation-only** (it never creates links). It only confirms or rejects existing links.

**Rules:**
- For each 8-K fact linked to a concept/unit/period, find the nearest 10-Q for the same CIK within 30 days.
- Compare the linked concept/unit/period values against 10-Q XBRL facts.
- If the value matches exactly (or within a strict tolerance for decimals), mark the link as `ground_truth_verified=true`.
- If no matching 10-Q fact exists or values differ, downgrade the completeness or mark for review.


In [None]:
# Ground-truth validation: 8-K facts vs 10-Q XBRL facts (same company, within 30 days)
# NOTE: This is validation-only and does not create links.

GROUND_TRUTH_WINDOW_DAYS = 30

GROUND_TRUTH_QUERY = """
MATCH (c:Company {cik: $cik})-[:FILED]->(r8:Report {formType: '8-K'})
MATCH (c)-[:FILED]->(r10q:Report {formType: '10-Q'})
WHERE date(r10q.filingDate) >= date(r8.filingDate)
  AND duration.inDays(date(r8.filingDate), date(r10q.filingDate)).days <= $window_days
WITH c, r8, r10q
MATCH (r8)-[:HAS_CONTENT]->(esc:ExtractedSectionContent)
MATCH (esc)<-[:EXTRACTED_FROM]-(f:EightKFact)
MATCH (f)-[:HAS_CONCEPT]->(concept:Concept)
OPTIONAL MATCH (f)-[:HAS_UNIT]->(unit:Unit)
OPTIONAL MATCH (f)-[:HAS_PERIOD]->(period:Period)
WITH c, r8, r10q, f, concept, unit, period
MATCH (r10q)-[:HAS_FACT]->(xf:Fact)-[:HAS_CONCEPT]->(concept)
OPTIONAL MATCH (xf)-[:HAS_UNIT]->(unit)
OPTIONAL MATCH (xf)-[:HAS_PERIOD]->(period)
RETURN r8.filingDate AS filing_8k, r10q.filingDate AS filing_10q,
       f.fact_id AS eightk_fact_id, f.value AS eightk_value,
       xf.fact_id AS xbrl_fact_id, xf.value AS xbrl_value
LIMIT 100
"""

ground_truth_rows = run_query(driver, GROUND_TRUTH_QUERY, {
    'cik': TEST_CIK,
    'window_days': GROUND_TRUTH_WINDOW_DAYS,
})
ground_truth_df = pd.DataFrame([dict(r) for r in ground_truth_rows])
ground_truth_df.head()


## 8) Build a company-specific ground-truth mapping cache (future 8-Ks)

We can use the **10-Q ground-truth validation** to create a durable, company-specific mapping cache
that improves reliability for **future 8-Ks**. This does **not** create links by itself; it
only stores verified mapping evidence for reuse.

**Idea:**
- When an 8-K fact is verified against a 10-Q fact (within 30 days), record the verified mapping
  from the extracted metric string → Concept/Unit/Period (and Member if applicable).
- Store a normalized metric key and the verified concept qname as a mapping entry tied to the CIK.
- For future 8-Ks, only allow a concept match if the metric string matches a verified mapping
  for that company (or passes the strict label/qname checks).
- This creates a **deterministic, company-specific translation table** backed by ground truth.

**Benefits:**
- 100% reliability for mappings derived from 10-Q verified facts.
- Higher coverage over time as more 8-K ↔ 10-Q matches accumulate.
- No semantic guessing; all mappings are evidence-backed.


In [None]:
# Build a company-specific mapping cache from ground-truth verified pairs
# This is a local, notebook-only cache; productionization would store it in the graph.

def normalize_metric(text: str) -> str:
    return normalize_label(text)

# Example: if ground_truth_df has verified matches, build a mapping dict
mapping_cache = {}  # {normalized_metric: {concept_qname, unit_id, period_u_id}}

if not ground_truth_df.empty:
    for row in ground_truth_df.itertuples(index=False):
        metric_key = None  # would come from EightKFact.metric or ExtractedFact.metric in practice
        concept_qname = getattr(row, 'concept_qname', None) if hasattr(row, 'concept_qname') else None
        unit_id = getattr(row, 'unit_id', None) if hasattr(row, 'unit_id') else None
        period_u_id = getattr(row, 'period_u_id', None) if hasattr(row, 'period_u_id') else None
        if metric_key and concept_qname:
            mapping_cache[normalize_metric(metric_key)] = {
                'concept_qname': concept_qname,
                'unit_id': unit_id,
                'period_u_id': period_u_id,
            }

len(mapping_cache)


## 9) Explore alternative strategies (Presentation / Calculation / Networks)
We can use XBRL presentation/calculation networks to validate concept proximity or detect duplicates.
These are **secondary checks** and must never override the proof-based match.


In [None]:
presentation_query = (
    "MATCH (c:Concept {qname: $qname})-[:PRESENTATION_EDGE]->(child:Concept) \n"
    "RETURN child.qname AS qname"
)

def validate_presentation_context(qname: str) -> List[str]:
    records = run_query(driver, presentation_query, {"qname": qname})
    return [r["qname"] for r in records]

for lf in linked_facts[:5]:
    if lf.concept_qname:
        neighbors = validate_presentation_context(lf.concept_qname)
        print(lf.concept_qname, "presentation children count:", len(neighbors))


## 10) Coverage summary (diagnostics only)
We calculate coverage without compromising the reliability rules.


In [None]:
from collections import Counter

levels = Counter(lf.completeness for lf in linked_facts)
total = len(linked_facts) or 1

for level, count in levels.items():
    print(level, count, f"({count/total:.1%})")
