Skip to content

v1.4.0

@Anuj7411 Anuj7411 tagged this 04 Jun 10:35
Source: Claude-in-chat audit (Anuj's session, 2026-06-04) found two
math bugs in Sipcode's flagship metrics. These would have destroyed
launch credibility if a technical reader found them publicly first.
Both fixed below with regression guards so they cannot recur.

============================================================
BUG 1 — Idle context cost was MATHEMATICALLY IMPOSSIBLE
============================================================

Symptom: verify_sipcode_impact reported recoverableTokens of ~48.5
BILLION against totalTokens of ~5.47 billion. You cannot recover
9x what you spent. The idle-context analyzer was charging the cost
of holding a file once-resident-in-context as file_tokens × idle_turns.
A 363K-token .docx held idle for 451 turns reported 164M "wasted"
tokens — in a session that totaled 316M.

Root cause: src/modules/transcript/analyzers/idleContext.ts line ~99:
    const cost = first.cost * idleTurns;   // ← multiplicative inflation

Fix: count the wasted one-time read, full stop:
    const cost = first.cost;               // ← honest

The idle-turns count is still reported in the result (so the user can
SEE "this file sat for 451 turns") but no longer scales the cost.

Why the bug looked plausible: prompt caching DOES recharge cache_read
on subsequent turns, but cacheReadTokens is already counted from the
session's API response — multiplying by turns double-counted.

Regression guard: tests/guards/idle-cost-invariant.test.ts. Asserts
idle.idleTokenCost <= totalTokens AND
(idle + duplicates) <= totalTokens on every transcript fixture in the
repo. If anyone reintroduces a multiplicative cost formula, this test
fails the build. Mathematical impossibility is now structurally
prevented from shipping.

============================================================
BUG 2 — Output ratio denominator was misleading
============================================================

Symptom: cache-heavy sessions reported output ratios near 0.5%, with
the implication that "99.5% of tokens are waste." But 91.5% of those
"total" tokens are cacheReadTokens — the GOOD/EFFICIENT path. Prompt
caching working as intended got conflated with waste. A skeptical
analyst would (correctly) shred the framing.

Root cause: src/modules/transcript/analyzers/tokens.ts line 84:
    const total = input + output + cacheRead + cacheCreation;
    const outputRatio = total > 0 ? output / total : 0;

Cache reads dominate the denominator, crushing the ratio.

Fix: subtract cacheRead from the denominator:
    const effectiveDenom = input + output + cacheCreation;
    const outputRatio = effectiveDenom > 0 ? output / effectiveDenom : 0;

Now the ratio answers an honest question: "of the new-token work this
session, what fraction became code output?" Comparison-style usage
(impact tool's before/after) is unaffected because both sides use the
same formula — only the absolute scale changes.

Test updates: tests/modules/transcript/analyzers/tokens.test.ts now
asserts the new formula explicitly. Snapshot tests under tests/modules/
why/__snapshots__/ re-baselined to the corrected ratio. Existing
synthetic-data tests in benchmark/runSuite.test.ts annotated with a
comment but unchanged (their fixtures have zero cacheRead so formula
swap is a no-op for them).

============================================================
WHAT THIS CHANGES FOR EXTERNAL CONSUMERS
============================================================

The JSON output of `sipcode why` and `sipcode stats` now reports:
  - outputRatio: REDUCED IMPACT — same field, more honest formula
    (excludes cacheRead from denominator). Numbers go UP in absolute
    value on cache-heavy sessions.
  - idleTokenCost: MUCH SMALLER on long sessions. The previous values
    were inflated by 10-100x in pathological cases.
  - recoverableTokens (computed downstream from idleTokenCost): same
    correction propagates. No more "recoverable > spent."

JSON schema unchanged. Field names unchanged. Only the math changed.

============================================================
PRE-LAUNCH IMPACT
============================================================

Caught before any public reader audited the claims. This is the gold-
standard scenario: an external technical reviewer (Claude in another
session) audited the integrity contract and found the gap. We shipped
the fix in the same session.

This is exactly the loop the integrity contract was DESIGNED to
support — the tool refused to lie about insufficient data, the
analyst was therefore willing to inspect what the tool WAS reporting,
and that inspection caught the real bug. The brand pillar working as
designed.

854/854 tests pass locally. 5-gate pipeline green.

Minor version bump (v1.3.4 -> v1.4.0) — not a patch, because numeric
values returned by the tool have changed (correction, not regression,
but deserves the bump).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assets 2
Loading