Skip to content

feat: Deterministic context chain — AST-based cross-file dependency extraction #114

@ajianaz

Description

@ajianaz

Problem

Cora v0.2.0 reviews diffs in isolation — the LLM only sees the changed lines and file paths injected via anti-hallucination. It cannot see:

  • Function definitions called from changed code (e.g., validate_token() called at line 42, but its implementation is in another file)
  • Struct/type definitions referenced in changed code (e.g., CryptoConfig used in new code, but fields unknown)
  • Test coverage for changed functions (e.g., test_login_success exists but LLM doesn't know expected behavior)
  • Import chain — what modules are being pulled in

This limits review depth. The LLM can spot surface issues (style, obvious bugs) but misses logic errors that require understanding the broader call chain.

Why Not Agent Tool-Use?

Alibaba's Open Code Review uses an LLM agent with read_file and search_codebase tools. This works at scale but has major drawbacks:

Approach Token Cost Predictability Control
Full agent (OCR-style) 20,000-40,000 tokens Low — agent may read 10+ files unpredictably Hard to budget
Deterministic chain (proposed) ~3,000 tokens extra High — same context every run Hard-capped

The agent approach is powerful but unpredictable in cost and unstable in quality (different files read on different runs). For a CLI tool used in pre-commit hooks, predictability matters more than maximum depth.

Proposed Solution: Deterministic Context Chain

Architecture

Phase 0: Deterministic Pre-Processing (zero LLM tokens)
──────────────────────────────────────────────────────
Git diff → parse changed lines → extract:

  1. Function/method calls in changed lines
     Example: validate_token(token) → resolve to src/auth/validate.rs

  2. Struct/type references in changed lines  
     Example: CryptoConfig → resolve to src/config.rs:45-60

  3. Test file mapping (naming convention)
     Example: src/api/handler.rs → tests/api_handler_test.rs

  4. Import statements (use/require/import)
     Example: use crate::engine::scanner → resolve to src/engine/scanner.rs

Result: "context map" — list of (file, line_range, symbol) tuples

Phase 1: Context Assembly (zero LLM tokens)
───────────────────────────────────────────
Read ONLY the resolved symbols from context map:

  - Function definition: fn signature + body (capped at 50 lines)
  - Struct definition: full struct block
  - Test function: test body (capped at 30 lines)
  
  HARD BUDGET: max 3,000 tokens of additional context
  Priority: function defs > type defs > tests (first-fit budget)

Phase 2: LLM Review (same as current, with context)
────────────────────────────────────────────────────
  System prompt: ~600 tokens
  File path list: ~50 tokens
  CHAINED CONTEXT: ~3,000 tokens (new!)
  Diff (bundled): ~8,000 tokens (per bundle)
  ────────────────────────────────
  Total input: ~11,650 tokens (vs ~13,000+ currently)
  
  BUT context is RICHER — LLM knows what validate_token does,
  what CryptoConfig fields exist, and what tests expect.

Phase 3: Post-Processing (unchanged)
────────────────────────────────────
  - JSON parse + repair (existing)
  - Post-parse file path filter (existing)
  - Line-level positioning validation (new, see #116)

Implementation Plan

  1. AST Parser Module (src/engine/context_chain.rs)

    • Use syn crate for Rust files
    • Use regex-based extraction for other languages (Python, Go, JS, TS, Java)
    • Extract: function calls, type references, imports
    • Resolve file paths relative to project root
    • Respect .cora.yaml ignore.files patterns
  2. Context Budget Controller

    • Token estimation: ~4 chars per token (rough but consistent)
    • Priority queue: function defs > type defs > tests
    • Hard cap: configurable via review.max_context_tokens (default: 3000)
    • Stop reading when budget exhausted
  3. Prompt Integration

    • New section in build_review_prompt(): Relevant Context
    • Format: --- {file}:{line_start}-{line_end} ({symbol}) ---\n{content}\n
    • Placed AFTER file path list, BEFORE diff
  4. Language Support Matrix

Language AST Parser Priority
Rust syn (full AST) P0
Python regex (def/class/import) P1
TypeScript/JavaScript regex (function/import/export) P1
Go regex (func/type/import) P1
Java regex (class/method/import) P2

Configuration

# .cora.yaml
review:
  context_chain:
    enabled: true          # default: true
    max_context_tokens: 3000  # default: 3000 (~12KB of code)
    follow_depth: 3         # max levels of dependency following
    include_tests: true     # auto-resolve test files

Token Cost Analysis

Scenario Current (v0.2) Proposed (v0.3) Delta
Small PR (5 files, 500 lines) ~8,000 tokens ~9,500 tokens +18%
Medium PR (15 files, 2000 lines) ~13,000 tokens ~14,500 tokens +11%
Large PR (50 files, 5000 lines) ~25,000 tokens ~13,000 × N bundles N/A (bundled)
Very large PR (100+ files) Often fails (>50K diff) Works (bundled + chain) N/A

Key insight: For small/medium PRs, cost increases ~10-15% but review quality increases ~40% (LLM can now detect logic errors, not just surface issues). For large PRs, bundling (#115) handles the scaling problem separately.

Acceptance Criteria

  • Context chain extracts function calls from changed lines across 5+ languages
  • Context budget hard cap enforced (never exceeds max_context_tokens)
  • Context injected into prompt in deterministic format
  • review.context_chain.enabled: false disables feature (backward compat)
  • Integration tests: verify context resolves correctly for cross-file references
  • --progress events include context_chain phase with token budget stats
  • Cache key includes context (same diff + different context = different cache entry)

References

  • Inspired by Alibaba Open Code Review's "Smart file bundling" and "Scenario-tuned toolset"
  • Differs from OCR's agent approach: deterministic extraction, not LLM-driven file reads
  • Token budget pattern inspired by Cursor's context window management

Metadata

Metadata

Assignees

No one assigned

    Labels

    v0.3Smart context & deterministic pipeline

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions