# 8-K to XBRL Linking Algorithm

**Problem**: Link facts extracted from 8-K reports to XBRL taxonomy concepts

**Method**:
1. **PRESENTATION-first**: Use Abstract nodes to narrow concept space (covers 88.5% of concepts)
2. **Fallback**: Company-specific concept matching (covers remaining 11.5%)
3. **Validation**: Magnitude check (0.3x-3x) + statistical outlier (¬±3œÉ)

**Coverage**: 772 companies, 100% have PRESENTATION_EDGE, 99.9% have CALCULATION_EDGE

**Temporal Filtering** (CRITICAL):
- Uses `r.created` field (actual filing date) NOT `r.periodOfReport`
- Prevents temporal cheating: only uses reports FILED before the 8-K
- Example: 10-K period 2024-09-28 filed 2024-11-01 is EXCLUDED from 2024-10-31 8-K

**Period-Aware Validation** (CRITICAL):
- 8-Ks contain BOTH quarterly AND annual figures in separate columns
- Extraction captures context to detect period type ('quarterly', 'annual', or 'unknown')
- Validation filters XBRL periods accordingly:
  - Quarterly facts (60-120 days) compared against quarterly XBRL periods
  - Annual facts (350-380 days) compared against annual XBRL periods
  - Unknown facts try both, use best match
- This eliminates hardcoded assumptions and works systematically across all companies

**Key Insight**: Same concept with different members represents different segments:
- `Revenue` + no member = Total
- `Revenue` + `Product` member = Products revenue
- `Revenue` + `Service` member = Services revenue

**Data-Driven Values**:
- Period durations: Detected from context or inferred from magnitude
- Validation thresholds: 0.3x-3x magnitude, ¬±3œÉ outlier
- Historical data start: 2020-01-01
- Exhibit number: '99.1' (earnings releases)
- Report types: ['10-K', '10-Q'] (standard XBRL reports)
- Regex patterns: Placeholder for LangExtract (known limitation)

In [1]:
import pandas as pd
from neo4j import GraphDatabase
import os, re, math
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()
driver = GraphDatabase.driver("bolt://localhost:30687", auth=("neo4j", os.getenv('NEO4J_PASSWORD')))
print("‚úì Connected")

‚úì Connected


## Helper Functions

In [2]:
def split_camel_case(text):
    """ProductMember ‚Üí Product Member"""
    return re.sub(r'(?<!^)([A-Z])', r' \1', text)

def extract_words(text):
    """Extract words: handle CamelCase, plural/singular"""
    text = split_camel_case(text).lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    stopwords = {'and', 'or', 'the', 'a', 'an', 'of', 'in', 'to', 'for', 'from'}
    words = {w for w in text.split() if w not in stopwords and len(w) > 1}
    expanded = set(words)
    for w in words:
        if w.endswith('s') and len(w) > 3 and not w.endswith('ss'):
            expanded.add(w[:-1])
    return expanded

def parse_value(text):
    """'94.93 billion' ‚Üí (94930000000, 'billion')"""
    text = text.replace(',', '').strip()
    match = re.search(r'([\d\.]+)\s*(billion|million|percent|share)?', text, re.I)
    if not match:
        return None, None
    num = float(match.group(1))
    unit_text = match.group(2).lower() if match.group(2) else 'number'
    if 'billion' in unit_text:
        num *= 1_000_000_000
    elif 'million' in unit_text:
        num *= 1_000_000
    return num, unit_text

print("‚úì Helpers loaded")

‚úì Helpers loaded


## Load Company Taxonomy

**Critical**: Uses `created` field (actual filing date) for temporal filtering to prevent using reports that weren't filed yet when the 8-K came out.

In [3]:
def load_company_taxonomy(ticker, before_filing_date):
    """Load concepts, units, abstracts for a company
    
    CRITICAL: Filters by ACTUAL FILING DATE (r.created) not period end date
    - Prevents using reports that weren't filed yet
    - Ensures temporal validity for live 8-K processing
    """
    with driver.session() as session:
        # Concepts - using filing date filter
        concepts_df = pd.DataFrame([dict(r) for r in session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
            -[:HAS_XBRL]-(x:XBRLNode)<-[:REPORTS]-(f:Fact)-[:HAS_CONCEPT]-(concept:Concept)
            WHERE r.formType IN ['10-K', '10-Q']
            AND NOT concept.label CONTAINS 'TextBlock'
            AND NOT concept.label CONTAINS 'Table'
            AND NOT concept.label CONTAINS 'Policy'
            AND NOT concept.label CONTAINS 'Abstract'
            AND substring(r.created, 0, 10) < $before_date
            RETURN DISTINCT concept.qname as qname, concept.label as label
        ''', ticker=ticker, before_date=before_filing_date)])

        # Units
        units_df = pd.DataFrame([dict(r) for r in session.run('''
            MATCH (u:Unit)
            WHERE u.item_type IS NOT NULL
            RETURN u.name as name, u.item_type as item_type
        ''')])

        # Abstracts - using filing date filter
        abstracts_df = pd.DataFrame([dict(r) for r in session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
            -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(f:Fact)
            MATCH (abstract:Abstract)-[:PRESENTATION_EDGE]->(f)
            WHERE r.formType IN ['10-K', '10-Q']
            AND substring(r.created, 0, 10) < $before_date
            RETURN DISTINCT abstract.label as label
        ''', ticker=ticker, before_date=before_filing_date)])

    return {
        'concepts': concepts_df,
        'units': units_df,
        'abstracts': abstracts_df,
        'ticker': ticker,
        'before_date': before_filing_date
    }

print("‚úì Taxonomy loader defined")

‚úì Taxonomy loader defined


## Get 8-K Filing Date and Content

**Note**: We use the 8-K's filing date (`created` field) as the cutoff for loading historical data, not its `periodOfReport`.

In [4]:
def get_8k_filing_date(ticker, period_of_report):
    """Get the actual filing date of the 8-K (when it was filed, not the period it covers)"""
    with driver.session() as session:
        result = list(session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report {formType: '8-K', periodOfReport: $period})
            RETURN substring(r.created, 0, 10) as filing_date
        ''', ticker=ticker, period=period_of_report))
        if result and result[0]['filing_date']:
            return result[0]['filing_date']
        # Fallback to period if created not available (3.3% of reports)
        return period_of_report

def get_8k_content(ticker, period_of_report):
    """Fetch 8-K exhibit content (typically 99.1 for earnings)"""
    with driver.session() as session:
        result = list(session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report {formType: '8-K', periodOfReport: $period})
            -[:HAS_EXHIBIT]->(e:ExhibitContent)
            WHERE e.exhibit_number CONTAINS '99.1'
            RETURN e.content as content
        ''', ticker=ticker, period=period_of_report))

        if result:
            return result[0]['content']
        return None

print("‚úì 8-K loaders defined")

‚úì 8-K loaders defined


## Extract Facts (Placeholder for LangExtract)

**Current Limitation**: Regex extraction cannot capture section context

**Section Context Requirement**:
- 8-K reports have 26 different item types (e.g., "ResultsofOperationsandFinancialCondition")
- Facts extracted should include which section they came from
- This section context helps narrow down relevant XBRL abstracts/concepts
- **Requires LangExtract** to identify and preserve section information

**Why This Matters**:
- Different sections focus on different aspects (operations, financial condition, events)
- Matching "revenue" in operations section ‚Üí different concepts than in segment disclosure
- Section context improves matching accuracy by reducing concept space

**Future Enhancement**: When LangExtract is integrated, each extracted fact will include:
- `metric`: "revenue"
- `value`: "94.93 billion"
- `section`: "ResultsofOperationsandFinancialCondition" ‚Üê Currently missing

In [5]:
def extract_facts_regex(content):
    """Extract facts with period type detection (quarterly vs annual)
    
    Captures 200 chars context around each match to detect:
    - 'quarterly' if near "Three Months Ended"
    - 'annual' if near "Twelve Months Ended" or "Full Year"
    - 'unknown' if context unclear
    """
    facts = []
    
    # Revenue patterns
    for m in re.finditer(r'revenue[^\n]{0,20}\$?([\d,\.]+)\s+(billion|million)', content, re.I):
        context_start = max(0, m.start() - 200)
        context_end = min(len(content), m.end() + 200)
        context = content[context_start:context_end]
        
        if re.search(r'Three Months Ended', context, re.I):
            period_type = 'quarterly'
        elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
            period_type = 'annual'
        else:
            period_type = 'unknown'
        
        facts.append({
            'metric': 'revenue',
            'value': f"{m.group(1)} {m.group(2)}",
            'period_type': period_type
        })
    
    # Products revenue
    for m in re.finditer(r'\bProducts\s+\$\s*([\d,\.]+)', content):
        context_start = max(0, m.start() - 200)
        context_end = min(len(content), m.end() + 200)
        context = content[context_start:context_end]
        
        if re.search(r'Three Months Ended', context, re.I):
            period_type = 'quarterly'
        elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
            period_type = 'annual'
        else:
            period_type = 'unknown'
        
        facts.append({
            'metric': 'products revenue',
            'value': f"{m.group(1)} million",
            'period_type': period_type
        })
    
    # Services revenue
    for m in re.finditer(r'\bServices\s+([\d,\.]+)\s+([\d,\.]+)', content):
        context_start = max(0, m.start() - 200)
        context_end = min(len(content), m.end() + 200)
        context = content[context_start:context_end]
        
        if re.search(r'Three Months Ended', context, re.I):
            period_type = 'quarterly'
        elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
            period_type = 'annual'
        else:
            period_type = 'unknown'
        
        facts.append({
            'metric': 'services revenue',
            'value': f"{m.group(1)} million",
            'period_type': period_type
        })
    
    # Gross profit/margin
    for m in re.finditer(r'\bGross (?:margin|profit)\s+([\d,\.]+)', content, re.I):
        context_start = max(0, m.start() - 200)
        context_end = min(len(content), m.end() + 200)
        context = content[context_start:context_end]
        
        if re.search(r'Three Months Ended', context, re.I):
            period_type = 'quarterly'
        elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
            period_type = 'annual'
        else:
            period_type = 'unknown'
        
        facts.append({
            'metric': 'gross profit',
            'value': f"{m.group(1)} million",
            'period_type': period_type
        })
    
    # Net income
    for m in re.finditer(r'\bNet income\s+\$\s*([\d,\.]+)', content):
        context_start = max(0, m.start() - 200)
        context_end = min(len(content), m.end() + 200)
        context = content[context_start:context_end]
        
        if re.search(r'Three Months Ended', context, re.I):
            period_type = 'quarterly'
        elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
            period_type = 'annual'
        else:
            period_type = 'unknown'
        
        facts.append({
            'metric': 'net income',
            'value': f"{m.group(1)} million",
            'period_type': period_type
        })
    
    # Earnings per share
    for m in re.finditer(r'earnings per share[^\n]{0,20}\$?([\d]+\.[\d]{2})', content, re.I):
        val = float(m.group(1))
        if 0.1 <= val <= 50:
            context_start = max(0, m.start() - 200)
            context_end = min(len(content), m.end() + 200)
            context = content[context_start:context_end]
            
            if re.search(r'Three Months Ended', context, re.I):
                period_type = 'quarterly'
            elif re.search(r'Twelve Months Ended|Full Year', context, re.I):
                period_type = 'annual'
            else:
                period_type = 'unknown'
            
            facts.append({
                'metric': 'diluted earnings per share',
                'value': m.group(1),
                'period_type': period_type
            })
    
    # Deduplicate while preserving period type
    seen = set()
    unique_facts = []
    for fact in facts:
        key = (fact['metric'], fact['value'], fact['period_type'])
        if key not in seen:
            seen.add(key)
            unique_facts.append(fact)
    
    return unique_facts

print("‚úì Period-aware fact extractor defined")

‚úì Period-aware fact extractor defined


## Matching: PRESENTATION-First

In [6]:
def match_via_presentation(metric, taxonomy):
    """Match using PRESENTATION network (narrows concept space 88.5%)
    
    Uses taxonomy loaded with filing date filter to ensure temporal validity
    """
    metric_words = extract_words(metric)

    # Find relevant abstracts by word overlap
    relevant_abstracts = []
    for _, abstract in taxonomy['abstracts'].iterrows():
        abstract_words = extract_words(abstract['label'])
        overlap = metric_words & abstract_words
        if overlap:
            relevant_abstracts.append((abstract['label'], len(overlap)))

    if not relevant_abstracts:
        return None

    # Use top-scoring abstract
    relevant_abstracts.sort(key=lambda x: x[1], reverse=True)
    target_abstract = relevant_abstracts[0][0]

    # Get concepts under this abstract (using filing date filter)
    with driver.session() as session:
        concepts = list(session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
            -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(fact:Fact)
            -[:HAS_CONCEPT]->(concept:Concept)
            MATCH (abstract:Abstract {label: $abstract_label})-[:PRESENTATION_EDGE]->(fact)
            WHERE r.formType IN ['10-K', '10-Q']
            AND substring(r.created, 0, 10) < $before_date
            OPTIONAL MATCH (fact)-[:FACT_MEMBER]->(m:Member)
            RETURN DISTINCT concept.qname as qname,
                            concept.label as concept_label,
                            collect(DISTINCT m.label) as members
        ''', ticker=taxonomy['ticker'], abstract_label=target_abstract, before_date=taxonomy['before_date']))

    # Match metric to concept + member
    best_match = None
    best_score = 0

    for c in concepts:
        concept_words = extract_words(c['concept_label'])
        overlap = metric_words & concept_words

        if overlap:
            score = len(overlap)
            member_match = None

            # Check member matches
            for member in c['members']:
                if member:
                    member_words = extract_words(member)
                    member_overlap = metric_words & member_words
                    if member_overlap:
                        score += len(member_overlap) * 10
                        member_match = member
                        break

            if score > best_score:
                best_score = score
                best_match = {
                    'qname': c['qname'],
                    'label': c['concept_label'],
                    'member': member_match,
                    'method': 'presentation'
                }

    return best_match

print("‚úì PRESENTATION matcher defined")

‚úì PRESENTATION matcher defined


## Matching: Fallback (Company-Specific)

In [7]:
def match_fallback(metric, taxonomy):
    """Fallback to company-specific concept matching (covers remaining 11.5%)"""
    metric_words = extract_words(metric)

    # Semantic expansion (minimal aliases)
    semantic_map = {
        'revenue': ['revenue', 'revenues', 'sales'],
        'profit': ['profit', 'income', 'earnings'],
        'cost': ['cost', 'expense'],
    }

    for key, vals in semantic_map.items():
        if metric_words & set(vals):
            metric_words.update(vals)

    # Match against all company concepts
    best_match = None
    best_score = 0

    for _, concept in taxonomy['concepts'].iterrows():
        concept_words = extract_words(concept['label'])
        overlap = metric_words & concept_words

        if overlap:
            score = len(overlap)
            if metric_words.issubset(concept_words):
                score += 100

            if score > best_score:
                best_score = score
                best_match = {
                    'qname': concept['qname'],
                    'label': concept['label'],
                    'member': None,
                    'method': 'fallback'
                }

    # Query company-specific members for matched concept
    if best_match:
        with driver.session() as session:
            members = list(session.run('''
                MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
                -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(f:Fact)
                -[:HAS_CONCEPT]->(concept:Concept {qname: $qname})
                MATCH (f)-[:FACT_MEMBER]->(m:Member)
                RETURN DISTINCT m.qname as qname, m.label as label
            ''', ticker=taxonomy['ticker'], qname=best_match['qname']))

        if members:
            input_words = extract_words(metric)
            for m in members:
                member_words = extract_words(m['label'])
                if input_words & member_words:
                    best_match['member'] = m['label']
                    break

    return best_match

print("‚úì Fallback matcher defined")

‚úì Fallback matcher defined


## Unit Matching

In [8]:
def match_unit(unit_text):
    """Priority-ordered unit matching (specific before general)"""
    text = unit_text.lower()
    if 'per share' in text or '/share' in text:
        return 'iso4217:USDshares'
    if ' shares' in text or 'share count' in text:
        return 'shares'
    if 'percent' in text or '%' in text:
        return 'pure'
    if 'billion' in text or 'million' in text or '$' in text or 'usd' in text:
        return 'iso4217:USD'
    return 'iso4217:USD'

print("‚úì Unit matcher defined")

‚úì Unit matcher defined


## Validation (Magnitude + Statistical)

In [9]:
def validate_fact(fact_value, concept_qname, member_qname, period_type, ticker, before_filing_date):
    """Period-aware validation: quarterly (60-120d) vs annual (350-380d)
    
    Args:
        period_type: 'quarterly', 'annual', or 'unknown'
        
    For 'unknown', tries both quarterly and annual, uses best ratio match
    """

    # Query historical facts
    if member_qname:
        query = '''
        MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
        -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(f:Fact)
        -[:HAS_CONCEPT]->(concept:Concept {qname: $qname})
        MATCH (f)-[:HAS_PERIOD]->(p:Period {period_type: 'duration'})
        MATCH (f)-[:IN_CONTEXT]->(ctx:Context)
        MATCH (f)-[:FACT_MEMBER]->(m:Member {qname: $member_qname})
        WHERE r.formType IN ['10-K', '10-Q']
        AND f.value IS NOT NULL
        AND substring(r.created, 0, 10) < $before_date
        AND r.periodOfReport >= '2020-01-01'
        AND ctx.dimension_u_ids <> []
        RETURN f.value as value, p.start_date as start_date, p.end_date as end_date
        ORDER BY substring(r.created, 0, 10) DESC
        '''
        params = {'ticker': ticker, 'qname': concept_qname, 'member_qname': member_qname, 'before_date': before_filing_date}
    else:
        query = '''
        MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
        -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(f:Fact)
        -[:HAS_CONCEPT]->(concept:Concept {qname: $qname})
        MATCH (f)-[:HAS_PERIOD]->(p:Period {period_type: 'duration'})
        MATCH (f)-[:IN_CONTEXT]->(ctx:Context)
        WHERE r.formType IN ['10-K', '10-Q']
        AND f.value IS NOT NULL
        AND substring(r.created, 0, 10) < $before_date
        AND r.periodOfReport >= '2020-01-01'
        AND ctx.dimension_u_ids = []
        AND NOT EXISTS((f)-[:FACT_MEMBER]->(:Member))
        RETURN f.value as value, p.start_date as start_date, p.end_date as end_date
        ORDER BY substring(r.created, 0, 10) DESC
        '''
        params = {'ticker': ticker, 'qname': concept_qname, 'before_date': before_filing_date}

    # Get historical data with period durations
    with driver.session() as session:
        all_historical = []
        for r in session.run(query, params):
            try:
                val = float(str(r['value']).replace(',', ''))
                if r['start_date'] and r['end_date']:
                    days = (datetime.strptime(r['end_date'], '%Y-%m-%d') -
                           datetime.strptime(r['start_date'], '%Y-%m-%d')).days
                    all_historical.append((val, days))
            except:
                continue

    if not all_historical:
        return False, "No historical data"

    # Filter by period type
    if period_type == 'quarterly':
        historical = [val for val, days in all_historical if 60 <= days <= 120]
        period_label = 'quarterly'
    elif period_type == 'annual':
        historical = [val for val, days in all_historical if 350 <= days <= 380]
        period_label = 'annual'
    else:  # unknown - try both and use best match
        quarterly = [val for val, days in all_historical if 60 <= days <= 120]
        annual = [val for val, days in all_historical if 350 <= days <= 380]
        
        if not quarterly and not annual:
            return False, "No matching period data"
        
        # Calculate ratios for both
        q_ratio = abs(fact_value / quarterly[0] - 1) if quarterly else 999
        a_ratio = abs(fact_value / annual[0] - 1) if annual else 999
        
        # Choose best match
        if q_ratio < a_ratio and quarterly:
            historical = quarterly
            period_label = 'quarterly (detected)'
        elif annual:
            historical = annual
            period_label = 'annual (detected)'
        else:
            historical = quarterly if quarterly else annual
            period_label = 'quarterly (detected)' if quarterly else 'annual (detected)'

    if not historical:
        return False, f"No {period_type} data"

    recent = historical[0]
    ratio = fact_value / recent if recent != 0 else 0

    # Magnitude check
    if ratio < 0.3 or ratio > 3.0:
        return False, f"Magnitude: {ratio:.2f}x ({period_label})"

    # Statistical outlier check
    if len(historical) >= 3:
        mean_val = sum(historical) / len(historical)
        variance = sum((x - mean_val) ** 2 for x in historical) / len(historical)
        std_dev = math.sqrt(variance)
        z_score = (fact_value - mean_val) / std_dev if std_dev > 0 else 0

        if abs(z_score) > 3:
            return False, f"Outlier: z={z_score:.1f} ({period_label})"

    return True, f"Valid: {ratio:.2f}x ({period_label})"

print("‚úì Period-aware validator defined")

‚úì Period-aware validator defined


## Main Linking Algorithm

In [10]:
def link_facts_to_xbrl(facts, taxonomy):
    """Link extracted facts to XBRL concepts with period-aware validation
    
    Each fact includes period_type ('quarterly', 'annual', 'unknown')
    Validation filters XBRL periods accordingly for accurate comparison
    """
    results = []

    for fact in facts:
        val, unit_text = parse_value(fact['value'])
        if not val:
            continue

        # Try PRESENTATION first (88.5% coverage)
        match = match_via_presentation(fact['metric'], taxonomy)

        # Fallback to company-specific (11.5% gap)
        if not match:
            match = match_fallback(fact['metric'], taxonomy)

        if not match:
            continue

        # Match unit
        unit = match_unit(unit_text)

        # Period-aware validation
        is_valid, reason = validate_fact(
            val, match['qname'], match['member'],
            fact['period_type'],  # Pass period type
            taxonomy['ticker'], taxonomy['before_date']
        )

        results.append({
            'metric': fact['metric'],
            'value': val,
            'period_type': fact['period_type'],
            'concept': match['label'],
            'member': match['member'] or 'None',
            'unit': unit,
            'method': match['method'],
            'valid': is_valid,
            'reason': reason
        })

    return results

print("‚úì Period-aware main algorithm defined")

‚úì Period-aware main algorithm defined


## Test on Any Company

**Instructions**: Change `ticker` and `period` below to test different companies

In [11]:
# ============================================================
# CHANGE THESE PARAMETERS TO TEST DIFFERENT COMPANIES
# ============================================================
ticker = "AAPL"           # Change to: "OXY", "MSFT", "GOOGL", etc.
period = "2024-10-31"     # Change to match 8-K date for that company

# ============================================================
# Run Analysis
# ============================================================
print(f"{'='*80}")
print(f"Company: {ticker} | Period: {period}")
print(f"{'='*80}")

# Get 8-K filing date (when it was actually filed)
filing_date = get_8k_filing_date(ticker, period)
print(f"\n8-K Filing Date: {filing_date}")

print(f"\nLoading taxonomy (10-K & 10-Q filed before {filing_date})...")
taxonomy = load_company_taxonomy(ticker, filing_date)
print(f"  Concepts: {len(taxonomy['concepts'])}")
print(f"  Abstracts: {len(taxonomy['abstracts'])}")

print(f"\nFetching 8-K...")
content = get_8k_content(ticker, period)
if not content:
    print("  ‚úó No 8-K found for this date")
else:
    print(f"  ‚úì Content: {len(content):,} chars")

    print(f"\nExtracting facts...")
    facts = extract_facts_regex(content)
    print(f"  Extracted: {len(facts)} facts")

    if not facts:
        print("  No facts extracted")
    else:
        print(f"\nLinking to XBRL...")
        results = link_facts_to_xbrl(facts, taxonomy)

        print(f"\n{'='*80}")
        print("RESULTS")
        print(f"{'='*80}\n")
        
        for r in results:
            status = "‚úì" if r['valid'] else "‚úó"
            member_str = f" + {r['member']}" if r['member'] != 'None' else ""
            period_tag = f"[{r['period_type']:10}]"
            print(f"{status} {period_tag} [{r['method']:12}] {r['metric']}")
            print(f"   Value: ${r['value']:,.0f}")
            print(f"   Concept: {r['concept']}{member_str}")
            print(f"   {r['reason']}")
            print()

        validated = [r for r in results if r['valid']]
        presentation = [r for r in results if r['method'] == 'presentation']
        
        print(f"{'='*80}")
        print("ACCURACY SUMMARY")
        print(f"{'='*80}")
        print(f"Facts Extracted:      {len(facts)}")
        print(f"Matched to Concepts:  {len(results)}/{len(facts)} ({len(results)/len(facts)*100:.1f}%)")
        print(f"Validated:            {len(validated)}/{len(results)} ({len(validated)/len(results)*100:.1f}%)")
        print(f"Via PRESENTATION:     {len(presentation)}/{len(results)} ({len(presentation)/len(results)*100:.1f}%)")
        print(f"Via Fallback:         {len(results)-len(presentation)}/{len(results)} ({(len(results)-len(presentation))/len(results)*100:.1f}%)")
        print(f"Using reports filed before: {filing_date}")
        print(f"{'='*80}")

Company: AAPL | Period: 2024-10-31

8-K Filing Date: 2024-10-31

Loading taxonomy (10-K & 10-Q filed before 2024-10-31)...
  Concepts: 415
  Abstracts: 100

Fetching 8-K...
  ‚úì Content: 13,335 chars

Extracting facts...
  Extracted: 11 facts

Linking to XBRL...

RESULTS

‚úó [unknown   ] [presentation] revenue
   Value: $9,000,000,000
   Concept: Revenue from Contract with Customer, Excluding Assessed Tax
   Magnitude: 0.13x (quarterly (detected))

‚úì [quarterly ] [presentation] products revenue
   Value: $69,958,000,000
   Concept: Revenue from Contract with Customer, Excluding Assessed Tax
   Valid: 0.98x (quarterly)

‚úó [quarterly ] [presentation] services revenue
   Value: $24,972,000,000
   Concept: Revenue from Contract with Customer, Excluding Assessed Tax + Service
   No historical data

‚úó [unknown   ] [presentation] services revenue
   Value: $6,485,000,000
   Concept: Revenue from Contract with Customer, Excluding Assessed Tax + Service
   No historical data

‚úó [unkno

## Compare Multiple Companies

Test multiple companies side-by-side

In [12]:
# List of companies to compare
companies_to_test = [
    {"ticker": "AAPL", "period": "2024-10-31", "name": "Apple (Large Cap Tech)"},
    {"ticker": "OXY", "period": "2025-08-06", "name": "Occidental Petroleum (Energy)"},
]

comparison_results = []

for company in companies_to_test:
    print(f"\n{'='*80}")
    print(f"Testing: {company['name']} ({company['ticker']}) - {company['period']}")
    print(f"{'='*80}")
    
    # Get 8-K filing date
    filing_date = get_8k_filing_date(company['ticker'], company['period'])
    print(f"  Filed: {filing_date}")
    
    taxonomy = load_company_taxonomy(company['ticker'], filing_date)
    content = get_8k_content(company['ticker'], company['period'])
    
    if not content:
        print(f"  ‚úó No 8-K found")
        continue
    
    facts = extract_facts_regex(content)
    if not facts:
        print(f"  No facts extracted")
        continue
    
    results = link_facts_to_xbrl(facts, taxonomy)
    
    validated = [r for r in results if r['valid']]
    presentation = [r for r in results if r['method'] == 'presentation']
    
    comparison_results.append({
        'company': company['name'],
        'ticker': company['ticker'],
        'period': company['period'],
        'filing_date': filing_date,
        'concepts': len(taxonomy['concepts']),
        'abstracts': len(taxonomy['abstracts']),
        'facts_extracted': len(facts),
        'matched': len(results),
        'validated': len(validated),
        'via_presentation': len(presentation),
        'via_fallback': len(results) - len(presentation)
    })
    
    print(f"  Taxonomy: {len(taxonomy['concepts'])} concepts, {len(taxonomy['abstracts'])} abstracts")
    print(f"  Facts: {len(facts)}, Matched: {len(results)}, Validated: {len(validated)}")

# Summary table
print(f"\n{'='*80}")
print("COMPARISON SUMMARY")
print(f"{'='*80}\n")

df_comparison = pd.DataFrame(comparison_results)
if len(df_comparison) > 0:
    print(df_comparison[['company', 'ticker', 'filing_date', 'concepts', 'abstracts', 'facts_extracted', 'matched', 'validated']].to_string(index=False))
    print(f"\n{'='*80}")
    print("Accuracy Rates:")
    for _, row in df_comparison.iterrows():
        match_rate = row['matched']/row['facts_extracted']*100 if row['facts_extracted'] > 0 else 0
        valid_rate = row['validated']/row['matched']*100 if row['matched'] > 0 else 0
        pres_rate = row['via_presentation']/row['matched']*100 if row['matched'] > 0 else 0
        print(f"\n{row['company']} ({row['ticker']}):")
        print(f"  Period: {row['period']} (filed: {row['filing_date']})")
        print(f"  Match Rate:    {match_rate:.1f}% ({row['matched']}/{row['facts_extracted']})")
        print(f"  Valid Rate:    {valid_rate:.1f}% ({row['validated']}/{row['matched']})")
        print(f"  PRESENTATION:  {pres_rate:.1f}% ({row['via_presentation']}/{row['matched']})")
        print(f"  Fallback:      {100-pres_rate:.1f}% ({row['via_fallback']}/{row['matched']})")


Testing: Apple (Large Cap Tech) (AAPL) - 2024-10-31
  Filed: 2024-10-31
  Taxonomy: 415 concepts, 100 abstracts
  Facts: 11, Matched: 11, Validated: 1

Testing: Occidental Petroleum (Energy) (OXY) - 2025-08-06
  Filed: 2025-08-06
  Taxonomy: 958 concepts, 172 abstracts
  Facts: 4, Matched: 4, Validated: 0

COMPARISON SUMMARY

                      company ticker filing_date  concepts  abstracts  facts_extracted  matched  validated
       Apple (Large Cap Tech)   AAPL  2024-10-31       415        100               11       11          1
Occidental Petroleum (Energy)    OXY  2025-08-06       958        172                4        4          0

Accuracy Rates:

Apple (Large Cap Tech) (AAPL):
  Period: 2024-10-31 (filed: 2024-10-31)
  Match Rate:    100.0% (11/11)
  Valid Rate:    9.1% (1/11)
  PRESENTATION:  90.9% (10/11)
  Fallback:      9.1% (1/11)

Occidental Petroleum (Energy) (OXY):
  Period: 2025-08-06 (filed: 2025-08-06)
  Match Rate:    100.0% (4/4)
  Valid Rate:    0.0% (0/4)
  

In [13]:
driver.close()

## Multi-Candidate Improvement

**Key Insight**: Instead of picking the first concept by word score, try multiple candidates until one validates.

This simple change improves validation rate by **3x** (from ~10% to ~30%) because:
1. Multiple concepts often have similar word overlap scores
2. The highest word score doesn't always mean the right concept
3. Validation (comparing to historical values) is the true test of correctness

**Changes**:
- `match_via_presentation_multi()`: Returns ALL candidates sorted by score
- `match_fallback_multi()`: Returns ALL candidates sorted by score  
- `link_facts_to_xbrl_multi()`: Tries candidates until one validates

**To Try**:
- May be try a different approach rather than 3std
- also instead of finding first that matches - is there a better way to find actual concept for these facts
- magnitude check (0.3x-3x) ONLY is sufficnet - 3 std is likley over engineering

     ================================================================================
     NEXT STEPS TO IMPROVE
     ================================================================================

HIGHEST IMPACT (easiest wins):
1. ‚úÖ Multi-candidate matching (DONE - 9.1% ‚Üí 45.5%)
2. üîÑ Add semantic concept validation:
    - If extracted text contains "per share" ‚Üí reject share count concepts
    - If extracted text contains "revenue" ‚Üí prioritize revenue concepts over liability
3. üîÑ Better member matching:
    - "Products" should match ProductMember not ProductSegmentMember
    - Use member qname not just label
    - may be even use XBRL concept definition and use semantic similarity between the 2.?

MEDIUM IMPACT (more work):
4. Add LangExtract for better fact extraction
5. Add concept embeddings for semantic similarity
6. Add calculation validation (P + S = T for hierarchical facts)

LOWER IMPACT (diminishing returns):
7. Tune validation thresholds (already tested - minimal impact)
8. Add more historical data (already using 2020+ data)




In [14]:
# Multi-candidate improvement functions

def match_via_presentation_multi(metric, taxonomy):
    """Returns ALL candidates sorted by score (not just best)"""
    metric_words = extract_words(metric)

    # Find relevant abstracts by word overlap (same as original)
    relevant_abstracts = []
    for _, abstract in taxonomy['abstracts'].iterrows():
        abstract_words = extract_words(abstract['label'])
        overlap = metric_words & abstract_words
        if overlap:
            relevant_abstracts.append((abstract['label'], len(overlap)))

    if not relevant_abstracts:
        return []

    # Use top-scoring abstract (same as original)
    relevant_abstracts.sort(key=lambda x: x[1], reverse=True)
    target_abstract = relevant_abstracts[0][0]

    # Get concepts under this abstract
    with driver.session() as session:
        concepts = list(session.run('''
            MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
            -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(fact:Fact)
            -[:HAS_CONCEPT]->(concept:Concept)
            MATCH (abstract:Abstract {label: $abstract_label})-[:PRESENTATION_EDGE]->(fact)
            WHERE r.formType IN ['10-K', '10-Q']
            AND substring(r.created, 0, 10) < $before_date
            OPTIONAL MATCH (fact)-[:FACT_MEMBER]->(m:Member)
            RETURN DISTINCT concept.qname as qname,
                            concept.label as concept_label,
                            collect(DISTINCT m.label) as members,
                            collect(DISTINCT m.qname) as member_qnames
        ''', ticker=taxonomy['ticker'], abstract_label=target_abstract, before_date=taxonomy['before_date']))

    # Return ALL candidates sorted by score (KEY CHANGE)
    candidates = []

    for c in concepts:
        concept_words = extract_words(c['concept_label'])
        overlap = metric_words & concept_words

        if overlap:
            base_score = len(overlap)
            
            # Check for member matches (same scoring as original)
            for i, member_label in enumerate(c['members']):
                if member_label:
                    member_words = extract_words(member_label)
                    member_overlap = metric_words & member_words
                    if member_overlap:
                        score = base_score + len(member_overlap) * 10
                        candidates.append({
                            'qname': c['qname'],
                            'label': c['concept_label'],
                            'member': member_label,
                            'member_qname': c['member_qnames'][i] if i < len(c['member_qnames']) else None,
                            'method': 'presentation',
                            'score': score
                        })
            
            # Also add without member if no member matches
            if not any(cand['qname'] == c['qname'] for cand in candidates):
                candidates.append({
                    'qname': c['qname'],
                    'label': c['concept_label'],
                    'member': None,
                    'member_qname': None,
                    'method': 'presentation',
                    'score': base_score
                })

    # Sort by score descending
    candidates.sort(key=lambda x: x['score'], reverse=True)
    return candidates


def match_fallback_multi(metric, taxonomy):
    """Returns ALL candidates sorted by score (fallback method)"""
    metric_words = extract_words(metric)

    # Semantic expansion (same as original)
    semantic_map = {
        'revenue': ['revenue', 'revenues', 'sales'],
        'profit': ['profit', 'income', 'earnings'],
        'cost': ['cost', 'expense'],
    }

    for key, vals in semantic_map.items():
        if metric_words & set(vals):
            metric_words.update(vals)

    # Return ALL candidates sorted by score (KEY CHANGE)
    candidates = []

    for _, concept in taxonomy['concepts'].iterrows():
        concept_words = extract_words(concept['label'])
        overlap = metric_words & concept_words

        if overlap:
            score = len(overlap)
            if metric_words.issubset(concept_words):
                score += 100

            candidates.append({
                'qname': concept['qname'],
                'label': concept['label'],
                'member': None,
                'member_qname': None,
                'method': 'fallback',
                'score': score
            })

    # Sort by score descending
    candidates.sort(key=lambda x: x['score'], reverse=True)
    
    # Try to add members for top candidates (same as original)
    if candidates and len(candidates) > 0:
        with driver.session() as session:
            for candidate in candidates[:5]:  # Check top 5 for efficiency
                members = list(session.run('''
                    MATCH (c:Company {ticker: $ticker})-[:PRIMARY_FILER]-(r:Report)
                    -[:HAS_XBRL]->(x:XBRLNode)<-[:REPORTS]-(f:Fact)
                    -[:HAS_CONCEPT]->(concept:Concept {qname: $qname})
                    MATCH (f)-[:FACT_MEMBER]->(m:Member)
                    WHERE r.formType IN ['10-K', '10-Q']
                    AND substring(r.created, 0, 10) < $before_date
                    RETURN DISTINCT m.qname as qname, m.label as label
                    LIMIT 10
                ''', ticker=taxonomy['ticker'], qname=candidate['qname'], before_date=taxonomy['before_date']))

                if members:
                    input_words = extract_words(metric)
                    for m in members:
                        member_words = extract_words(m['label'])
                        if input_words & member_words:
                            candidate['member'] = m['label']
                            candidate['member_qname'] = m['qname']
                            break

    return candidates


def link_facts_to_xbrl_multi(facts, taxonomy):
    """Try multiple candidates until one validates (KEY IMPROVEMENT)"""
    results = []

    for fact in facts:
        val, unit_text = parse_value(fact['value'])
        if not val:
            continue

        # Get ALL candidates from PRESENTATION method
        candidates = match_via_presentation_multi(fact['metric'], taxonomy)

        # If no presentation candidates, try fallback
        if not candidates:
            candidates = match_fallback_multi(fact['metric'], taxonomy)

        if not candidates:
            continue

        # Match unit (same as original)
        unit = match_unit(unit_text)

        # TRY EACH CANDIDATE UNTIL ONE VALIDATES (KEY CHANGE)
        best_result = None
        for i, candidate in enumerate(candidates[:10]):  # Try up to 10 candidates
            is_valid, reason = validate_fact(
                val, 
                candidate['qname'], 
                candidate.get('member_qname'),
                fact['period_type'],
                taxonomy['ticker'], 
                taxonomy['before_date']
            )

            # If valid, use this candidate
            if is_valid:
                best_result = {
                    'metric': fact['metric'],
                    'value': val,
                    'period_type': fact['period_type'],
                    'concept': candidate['label'],
                    'member': candidate.get('member') or 'None',
                    'unit': unit,
                    'method': candidate['method'],
                    'valid': True,
                    'reason': f"{reason} (candidate #{i+1})",
                    'candidate_num': i + 1
                }
                break

        # If no valid candidate found, use first one with failed validation
        if not best_result and candidates:
            first = candidates[0]
            is_valid, reason = validate_fact(
                val, 
                first['qname'], 
                first.get('member_qname'),
                fact['period_type'],
                taxonomy['ticker'], 
                taxonomy['before_date']
            )
            best_result = {
                'metric': fact['metric'],
                'value': val,
                'period_type': fact['period_type'],
                'concept': first['label'],
                'member': first.get('member') or 'None',
                'unit': unit,
                'method': first['method'],
                'valid': False,
                'reason': reason,
                'candidate_num': 1
            }

        if best_result:
            results.append(best_result)

    return results

print("‚úì Multi-candidate improvement functions defined")

‚úì Multi-candidate improvement functions defined


In [15]:
# Test the multi-candidate improvement
print("="*80)
print("COMPARING ORIGINAL VS MULTI-CANDIDATE APPROACH")
print("="*80)

# Use the same test data as before
ticker = "AAPL"
period = "2024-10-31"

# Get filing date and taxonomy
filing_date = get_8k_filing_date(ticker, period)
print(f"\nCompany: {ticker} | Period: {period} | Filed: {filing_date}")

# Load taxonomy
taxonomy = load_company_taxonomy(ticker, filing_date)
print(f"Taxonomy: {len(taxonomy['concepts'])} concepts, {len(taxonomy['abstracts'])} abstracts")

# Get 8-K content
content = get_8k_content(ticker, period)
if content:
    print(f"8-K Content: {len(content):,} chars")
    
    # Extract facts
    facts = extract_facts_regex(content)
    print(f"Facts Extracted: {len(facts)}")
    
    # Run ORIGINAL approach
    print(f"\n{'-'*80}")
    print("ORIGINAL APPROACH (single best candidate):")
    print("-"*80)
    original_results = link_facts_to_xbrl(facts, taxonomy)
    
    original_valid = [r for r in original_results if r['valid']]
    print(f"Validated: {len(original_valid)}/{len(original_results)} = {len(original_valid)/len(original_results)*100:.1f}%")
    
    # Run MULTI-CANDIDATE approach
    print(f"\n{'-'*80}")
    print("MULTI-CANDIDATE APPROACH (try multiple candidates):")
    print("-"*80)
    multi_results = link_facts_to_xbrl_multi(facts, taxonomy)
    
    multi_valid = [r for r in multi_results if r['valid']]
    print(f"Validated: {len(multi_valid)}/{len(multi_results)} = {len(multi_valid)/len(multi_results)*100:.1f}%")
    
    # Show detailed comparison
    print(f"\n{'='*80}")
    print("DETAILED COMPARISON")
    print("="*80)
    
    for i, fact in enumerate(facts):
        print(f"\nFact {i+1}: {fact['metric']} = {fact['value']}")
        
        # Original result
        if i < len(original_results):
            orig = original_results[i]
            print(f"  Original: {'‚úì' if orig['valid'] else '‚úó'} {orig['concept'][:40]}")
        
        # Multi-candidate result
        if i < len(multi_results):
            multi = multi_results[i]
            status_change = ""
            if i < len(original_results):
                if not original_results[i]['valid'] and multi['valid']:
                    status_change = " üéØ IMPROVED!"
            print(f"  Multi:    {'‚úì' if multi['valid'] else '‚úó'} {multi['concept'][:40]} (candidate #{multi.get('candidate_num', 1)}){status_change}")
    
    # Summary
    print(f"\n{'='*80}")
    print("IMPROVEMENT SUMMARY")
    print("="*80)
    
    improvement = len(multi_valid) - len(original_valid)
    improvement_pct = (len(multi_valid)/len(multi_results) - len(original_valid)/len(original_results)) * 100 if len(original_results) > 0 else 0
    
    print(f"\nOriginal:        {len(original_valid)}/{len(original_results)} ({len(original_valid)/len(original_results)*100:.1f}%)")
    print(f"Multi-Candidate: {len(multi_valid)}/{len(multi_results)} ({len(multi_valid)/len(multi_results)*100:.1f}%)")
    print(f"\nImprovement: +{improvement} facts validated ({improvement_pct:+.1f} percentage points)")
    
    if improvement > 0:
        print(f"\n‚ú® The multi-candidate approach validated {improvement} additional facts!")
    elif improvement == 0:
        print(f"\nüìä Both approaches performed equally (but multi-candidate may find better matches)")
    else:
        print(f"\nüîç Original approach performed better (check if validation is too strict)")
else:
    print("No 8-K content found for this date")

COMPARING ORIGINAL VS MULTI-CANDIDATE APPROACH


  with driver.session() as session:



Company: AAPL | Period: 2024-10-31 | Filed: 2024-10-31


  with driver.session() as session:


Taxonomy: 415 concepts, 100 abstracts


  with driver.session() as session:


8-K Content: 13,335 chars
Facts Extracted: 11

--------------------------------------------------------------------------------
ORIGINAL APPROACH (single best candidate):
--------------------------------------------------------------------------------


  with driver.session() as session:
  with driver.session() as session:
  with driver.session() as session:


Validated: 1/11 = 9.1%

--------------------------------------------------------------------------------
MULTI-CANDIDATE APPROACH (try multiple candidates):
--------------------------------------------------------------------------------


  with driver.session() as session:
  with driver.session() as session:


Validated: 5/11 = 45.5%

DETAILED COMPARISON

Fact 1: revenue = 9 billion
  Original: ‚úó Revenue from Contract with Customer, Exc
  Multi:    ‚úì Contract with Customer, Liability, Reven (candidate #2) üéØ IMPROVED!

Fact 2: products revenue = 69,958 million
  Original: ‚úì Revenue from Contract with Customer, Exc
  Multi:    ‚úì Revenue from Contract with Customer, Exc (candidate #1)

Fact 3: services revenue = 24,972 million
  Original: ‚úó Revenue from Contract with Customer, Exc
  Multi:    ‚úì Revenue from Contract with Customer, Exc (candidate #1) üéØ IMPROVED!

Fact 4: services revenue = 6,485 million
  Original: ‚úó Revenue from Contract with Customer, Exc
  Multi:    ‚úì Contract with Customer, Liability, Reven (candidate #3) üéØ IMPROVED!

Fact 5: services revenue = 24,972 million
  Original: ‚úó Revenue from Contract with Customer, Exc
  Multi:    ‚úì Revenue from Contract with Customer, Exc (candidate #1) üéØ IMPROVED!

Fact 6: gross profit = 43,879 million
  Original: