Skip to content

Feature: compare_file_contents tool with semantic diffs #1973

@SamMorrowDrums

Description

@SamMorrowDrums

Problem

When AI models review diffs, line-based unified diffs can be noisy and token-inefficient. Common scenarios where this hurts:

  • JSON/YAML reformatting: A single value change plus auto-formatting creates a huge diff
  • Config file updates: Version bumps or reordering keys produce misleading diffs
  • CSV/data files: Row shifts make line-based diffs nearly unreadable

Models struggle to identify the actual change amid formatting noise, wasting context tokens and reducing comprehension accuracy.

Proposed Solution

Add a new tool compare_file_contents that:

  1. Takes two refs (base and head) plus a file path
  2. For supported formats (JSON, YAML, CSV, TOML), produces a semantic diff showing only value changes
  3. For unsupported formats, falls back to unified diff
  4. Always shows the format used and whether fallback was applied

Example: Semantic vs Line-based

Line-based diff (noisy):

-{"users":[{"id":1,"name":"Alice"},{"id":2,"name":"Bob"}]}
+{
+  "users": [
+    {"id": 1, "name": "Alice"},
+    {"id": 2, "name": "Bobby"}
+  ]
+}

Semantic diff (clear):

users[1].name: "Bob" → "Bobby"

Tool Signature

compare_file_contents(
  owner: string,
  repo: string,
  path: string,
  base: string,    // commit SHA, branch, or tag
  head: string,    // commit SHA, branch, or tag
)

Use Cases

  1. Change verification: Model edits a file, uses this tool to confirm only intended changes were made
  2. PR review: Quickly understand what actually changed in config/data files
  3. Debugging: Compare file across commits without formatting noise

Implementation Notes

  • Start behind a feature flag
  • Semantic diff enabled by default for supported formats (no opt-out needed initially)
  • Pure Go implementation using standard library JSON + yaml.v3
  • Supported formats to start: JSON, YAML
  • Future: CSV, TOML, other structured formats

Why This Helps Models

  • Fewer tokens = more room for reasoning
  • Unambiguous output = clearer before/after semantics
  • Path notation (e.g., users[1].name) is already familiar to models
  • Self-verification = models can check their own edits efficiently

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions