Preflight Checklist
Type of Behavior Issue
Other unexpected behavior
What You Asked Claude to Do
Context
I run /home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.
I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g., london:25°C in April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.
Case 2 — wrong diagnosis on a restart loop
A paper bot (paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing -u flag". It applied that fix. The bot still died. The actual bug was in the bot's main loop: if now - last_status >= 300 (status logged every 5 minutes) versus the watchdog's stale_min: 3 (kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.
Case 3 — almost killing user's active bots
Claude saw 6 paper bots (paper_edge_*, paper_strategies, paper_monitor) running on the system. They didn't appear in the user's MEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.
Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed pnl=$+0.00 after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internal total_pnl field was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.
Pattern summary
In each case the truth was accessible by:
- one
grep / one file read
- one
WebFetch
- one CLOB / RPC call
- one clarifying question to the user
In each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:
- Memos become stale silently — there's no mechanism to detect that a memo's claim no longer matches the current code state
- Contradictory memos can coexist; nothing flags the contradiction
- The 200-line limit on
MEMORY.md (the index loaded into context) means once a project accumulates many memos, the older ones are silently truncated
- When a memo says "X exists at line Y", Claude often takes that as fact instead of re-verifying — exactly the prediction-over-verification pattern, but applied to memory content
- There's no way for the user to mark a memo as "high priority, always reload" vs "background reference"
Trading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
-
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in CLAUDE.md) that puts Claude in a posture where:
- It must verify any factual claim with a tool before stating it
- It must read code in full before proposing fixes
- It must mark any inference as "hypothesis: ..." explicitly
- It refuses to take destructive actions (
kill, rm, restart) without an explicit confirmation step
-
Self-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
-
Memory consistency check: before relying on a memo that claims X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.
-
Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
-
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
- Acknowledgment that the prediction-default behavior is a real problem on capital-at-risk projects, not a "user education" issue
- A path forward — either via Mythos broader access (if Project Glasswing is the only model with significantly better verification habits), or via product changes to make the existing models verify more aggressively
- A way to opt into stricter behavior without needing to fight the model in every prompt
Technical context
- Claude Code version: (paste from
claude-code --version)
- Model: Claude Opus 4.6 (1M context),
claude-opus-4-6[1m]
- Use case: Python trading bots, real capital, ~6 active bots running 24/7
- Project size: ~12k lines Python, 80+ scripts, 30+ memos in auto-memory
This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
What Claude Actually Did
Context
I run /home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.
I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g., london:25°C in April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.
Case 2 — wrong diagnosis on a restart loop
A paper bot (paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing -u flag". It applied that fix. The bot still died. The actual bug was in the bot's main loop: if now - last_status >= 300 (status logged every 5 minutes) versus the watchdog's stale_min: 3 (kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.
Case 3 — almost killing user's active bots
Claude saw 6 paper bots (paper_edge_*, paper_strategies, paper_monitor) running on the system. They didn't appear in the user's MEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.
Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed pnl=$+0.00 after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internal total_pnl field was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.
Pattern summary
In each case the truth was accessible by:
- one
grep / one file read
- one
WebFetch
- one CLOB / RPC call
- one clarifying question to the user
In each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:
- Memos become stale silently — there's no mechanism to detect that a memo's claim no longer matches the current code state
- Contradictory memos can coexist; nothing flags the contradiction
- The 200-line limit on
MEMORY.md (the index loaded into context) means once a project accumulates many memos, the older ones are silently truncated
- When a memo says "X exists at line Y", Claude often takes that as fact instead of re-verifying — exactly the prediction-over-verification pattern, but applied to memory content
- There's no way for the user to mark a memo as "high priority, always reload" vs "background reference"
Trading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
-
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in CLAUDE.md) that puts Claude in a posture where:
- It must verify any factual claim with a tool before stating it
- It must read code in full before proposing fixes
- It must mark any inference as "hypothesis: ..." explicitly
- It refuses to take destructive actions (
kill, rm, restart) without an explicit confirmation step
-
Self-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
-
Memory consistency check: before relying on a memo that claims X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.
-
Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
-
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
- Acknowledgment that the prediction-default behavior is a real problem on capital-at-risk projects, not a "user education" issue
- A path forward — either via Mythos broader access (if Project Glasswing is the only model with significantly better verification habits), or via product changes to make the existing models verify more aggressively
- A way to opt into stricter behavior without needing to fight the model in every prompt
Technical context
- Claude Code version: (paste from
claude-code --version)
- Model: Claude Opus 4.6 (1M context),
claude-opus-4-6[1m]
- Use case: Python trading bots, real capital, ~6 active bots running 24/7
- Project size: ~12k lines Python, 80+ scripts, 30+ memos in auto-memory
This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
Expected Behavior
Context
I run /home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.
I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g., london:25°C in April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.
Case 2 — wrong diagnosis on a restart loop
A paper bot (paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing -u flag". It applied that fix. The bot still died. The actual bug was in the bot's main loop: if now - last_status >= 300 (status logged every 5 minutes) versus the watchdog's stale_min: 3 (kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.
Case 3 — almost killing user's active bots
Claude saw 6 paper bots (paper_edge_*, paper_strategies, paper_monitor) running on the system. They didn't appear in the user's MEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.
Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed pnl=$+0.00 after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internal total_pnl field was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.
Pattern summary
In each case the truth was accessible by:
- one
grep / one file read
- one
WebFetch
- one CLOB / RPC call
- one clarifying question to the user
In each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:
- Memos become stale silently — there's no mechanism to detect that a memo's claim no longer matches the current code state
- Contradictory memos can coexist; nothing flags the contradiction
- The 200-line limit on
MEMORY.md (the index loaded into context) means once a project accumulates many memos, the older ones are silently truncated
- When a memo says "X exists at line Y", Claude often takes that as fact instead of re-verifying — exactly the prediction-over-verification pattern, but applied to memory content
- There's no way for the user to mark a memo as "high priority, always reload" vs "background reference"
Trading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
-
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in CLAUDE.md) that puts Claude in a posture where:
- It must verify any factual claim with a tool before stating it
- It must read code in full before proposing fixes
- It must mark any inference as "hypothesis: ..." explicitly
- It refuses to take destructive actions (
kill, rm, restart) without an explicit confirmation step
-
Self-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
-
Memory consistency check: before relying on a memo that claims X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.
-
Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
-
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
- Acknowledgment that the prediction-default behavior is a real problem on capital-at-risk projects, not a "user education" issue
- A path forward — either via Mythos broader access (if Project Glasswing is the only model with significantly better verification habits), or via product changes to make the existing models verify more aggressively
- A way to opt into stricter behavior without needing to fight the model in every prompt
Technical context
- Claude Code version: (paste from
claude-code --version)
- Model: Claude Opus 4.6 (1M context),
claude-opus-4-6[1m]
- Use case: Python trading bots, real capital, ~6 active bots running 24/7
- Project size: ~12k lines Python, 80+ scripts, 30+ memos in auto-memory
This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
Files Affected
Permission Mode
Accept Edits was ON (auto-accepting changes)
Can You Reproduce This?
Haven't tried to reproduce
Steps to Reproduce
No response
Claude Model
Opus
Relevant Conversation
Impact
Critical - Data loss or corrupted project
Claude Code Version
4.6
Platform
Other
Additional Context
VScode and claude code on my VPS
Preflight Checklist
Type of Behavior Issue
Other unexpected behavior
What You Asked Claude to Do
Context
I run
/home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (
paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g.,london:25°Cin April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.Case 2 — wrong diagnosis on a restart loop
A paper bot (
paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing-uflag". It applied that fix. The bot still died. The actual bug was in the bot's main loop:if now - last_status >= 300(status logged every 5 minutes) versus the watchdog'sstale_min: 3(kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.Case 3 — almost killing user's active bots
Claude saw 6 paper bots (
paper_edge_*,paper_strategies,paper_monitor) running on the system. They didn't appear in the user'sMEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed
pnl=$+0.00after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internaltotal_pnlfield was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.Pattern summary
In each case the truth was accessible by:
grep/ one file readWebFetchIn each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (
.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:MEMORY.md(the index loaded into context) means once a project accumulates many memos, the older ones are silently truncatedTrading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in
CLAUDE.md) that puts Claude in a posture where:kill,rm, restart) without an explicit confirmation stepSelf-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
Memory consistency check: before relying on a memo that claims
X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
Technical context
claude-code --version)claude-opus-4-6[1m]This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
What Claude Actually Did
Context
I run
/home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (
paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g.,london:25°Cin April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.Case 2 — wrong diagnosis on a restart loop
A paper bot (
paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing-uflag". It applied that fix. The bot still died. The actual bug was in the bot's main loop:if now - last_status >= 300(status logged every 5 minutes) versus the watchdog'sstale_min: 3(kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.Case 3 — almost killing user's active bots
Claude saw 6 paper bots (
paper_edge_*,paper_strategies,paper_monitor) running on the system. They didn't appear in the user'sMEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed
pnl=$+0.00after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internaltotal_pnlfield was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.Pattern summary
In each case the truth was accessible by:
grep/ one file readWebFetchIn each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (
.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:MEMORY.md(the index loaded into context) means once a project accumulates many memos, the older ones are silently truncatedTrading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in
CLAUDE.md) that puts Claude in a posture where:kill,rm, restart) without an explicit confirmation stepSelf-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
Memory consistency check: before relying on a memo that claims
X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
Technical context
claude-code --version)claude-opus-4-6[1m]This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
Expected Behavior
Context
I run
/home/jerem/weatherspoon, a Python codebase that operates real Polymarket trading bots on Polygon. Capital is at stake — buggy code or wrong diagnoses cost real money. I've been using Claude Code with Claude Opus 4.6 (1M context) for several weeks across many sessions.I'm posting this not because the model is bad, but because I've identified a structural pattern that I think is worth surfacing. It's not about a specific bug — it's about how the model defaults to completion instead of verification.
The pattern: prediction over verification
Claude is a probabilistic model. Its default reflex is to complete the most plausible answer rather than to verify the actual state of the world. On most casual coding tasks this is fine. On a project where being wrong costs real money, it's a recurring source of failure.
In a single session today (2026-04-10, ~6h of work), I observed at least the following cases where Claude inferred instead of verified, and was wrong:
Case 1 — bot design with a structural bug
Claude wrote a contrarian trading bot (
paper_no_contrarian.py) and on the first run it bought NO tokens on brackets that were structurally impossible to win (e.g.,london:25°Cin April, where YES was already at $0.002 and NO at $0.998 — no profit margin possible). The fix was a one-line precondition check (if not touched_97: return). If Claude had fetched 2-3 real bracket prices before writing the code, the bug would have been obvious. Instead it wrote plausible-looking logic without checking. In a live bot this would have grilled capital instantly.Case 2 — wrong diagnosis on a restart loop
A paper bot (
paper_dca_dip.py) was in a watchdog restart/kill loop. Claude's first diagnosis was "Python stdout buffering, missing-uflag". It applied that fix. The bot still died. The actual bug was in the bot's main loop:if now - last_status >= 300(status logged every 5 minutes) versus the watchdog'sstale_min: 3(kill after 3 minutes). The bot was being killed before its second status log could fire. Claude could have read the bot's main loop in 30 seconds before guessing — the bug would have been visible. Instead it pattern-matched on "buffering issue" and applied the most common fix.Case 3 — almost killing user's active bots
Claude saw 6 paper bots (
paper_edge_*,paper_strategies,paper_monitor) running on the system. They didn't appear in the user'sMEMORY.md, so Claude inferred they were "zombie processes from old tests" and proposed killing them. The user had to interrupt: "those are strategies I'm currently testing." Claude could have asked one clarifying question ("are you still using these?") before suggesting a destructive action. It chose plausible inference over a 5-second clarification.Case 4 — false claim about a product not in training
The user mentioned a model called "Mythos." Claude responded "I have no knowledge of Mythos, and the latest Anthropic models I know are Opus 4.6, Sonnet 4.6, Haiku 4.5." The user pushed back, saying "Mythos is the new Anthropic model." Claude could have web-fetched Anthropic's site at that point. It didn't. It re-stated its uncertainty without verifying. Eventually the user got frustrated and Claude finally fetched — confirming Mythos exists (released April 7, 2026, gated under Project Glasswing). This wasn't a hard verification — it was one WebFetch call. Claude's default was to repeat its training-based answer instead of checking.
Case 5 — analytical claim from extrapolation instead of data
The user asked Claude to analyze whether "inverting" a losing strategy would work. Claude produced a confident "+$700 theoretical EV" calculation based on extrapolated averages. The user pushed: "did you check if brackets that drop below $0.80 actually rebound?". Claude then ran the actual numbers from the historical CSV and discovered the pattern was non-monotonic — exactly the opposite of what its initial extrapolation suggested. The right answer was accessible from data the whole time. Claude went straight to a story instead of the data.
Case 6 — accepting a misleading metric without auditing it
The user noticed the live bot's status log showed
pnl=$+0.00after a successful trade that should have shown +$4. Claude initially explained this away as "PnL is reset by the daily MM routine, that's normal." The user pushed: "the internal PnL is buggy, we have a cron job that calculates the real PnL on-chain." Claude then read the cron job and the calculation logic, and confirmed the user was right — the internaltotal_pnlfield was a polluted historical accumulator (-$1953). The user had to do the work of explaining what should have been obvious if Claude had audited the metric instead of accepting it.Pattern summary
In each case the truth was accessible by:
grep/ one file readWebFetchIn each case Claude defaulted to inference. The user paid the cost in time, in trust, and (potentially) in capital.
Other systemic issues observed
Memory system
The auto-memory system (
.claude/projects/.../memory/) has known weaknesses that compound the prediction problem:MEMORY.md(the index loaded into context) means once a project accumulates many memos, the older ones are silently truncatedTrading domain specifically
Claude Code's training clearly does not cover the specifics of prediction markets, on-chain mechanics, market microstructure (bid/ask vs mid, tick sizes, book depth, neg-risk complement), or trading bot patterns (DCA, kill zones, contrarian entries). Claude can be coached on these, but it requires the user to explain everything multiple times across sessions. The memory system is supposed to fix this, but per the previous point, it doesn't reliably.
Suggestions for action
Default verification mode for "high-stakes" projects: a project-level setting (e.g., a flag in
CLAUDE.md) that puts Claude in a posture where:kill,rm, restart) without an explicit confirmation stepSelf-audit before code changes: when Claude proposes a fix, it should be required to justify it by quoting the relevant code lines it read. This catches cases where Claude pattern-matches on the symptom without reading the cause.
Memory consistency check: before relying on a memo that claims
X exists at file:line, automatically verify that the file/line still exists and matches. If not, flag the memo as stale.Make the prediction-vs-verification tradeoff visible to the user: when Claude is about to give a confident answer based purely on training (no tool calls), surface that fact. E.g., "This answer is from my training only; want me to verify?"
Domain-knowledge gap awareness: when a project clearly involves a specialized domain (trading, medical, legal, etc.), Claude should be more aggressive about asking clarifying questions early in the session rather than guessing the conventions.
What I want from Anthropic
Technical context
claude-code --version)claude-opus-4-6[1m]This issue was drafted during a session with Claude itself, after the model acknowledged the pattern when explicitly challenged. The model is capable of recognizing the issue when prompted — the problem is that the default behavior doesn't surface it proactively.
Files Affected
Permission Mode
Accept Edits was ON (auto-accepting changes)
Can You Reproduce This?
Haven't tried to reproduce
Steps to Reproduce
No response
Claude Model
Opus
Relevant Conversation
Impact
Critical - Data loss or corrupted project
Claude Code Version
4.6
Platform
Other
Additional Context
VScode and claude code on my VPS