[experiments] Daily Experiment Report — 2026-05-10 #31318
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by daily-experiment-report. A newer discussion is available at Discussion #31462. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🧪 Daily Experiment Report — 2026-05-10
5 experiments analysed across 5 workflows. All experiments are in EXTEND status — insufficient data to reach statistical significance. No experiments reached min_samples thresholds. No guardrail violations detected (except 100% failure rate in
daily-issues-reportwhich warrants investigation).prompt_style·daily-astrostylelite-markdown-spellcheckConcise prompt reduces token consumption ≥20% without degrading fix precision. H0: no difference in fix rate.
Recommendation: EXTEND — The
concisevariant has 0 runs; experiment has just started. Thedetailedcontrol shows 80% success rate, slightly below therun_success_rate >=0.90guardrail — worth monitoring.prompt_style·daily-community-attribution(not specified — bare array experiment)
Recommendation: EXTEND — Both variants show 100% success, but sample sizes are far too small (3 vs 2) to draw any conclusions. Duration difference (604s vs 360s) is not statistically significant (p≈0.17).
output_format·daily-issues-reportH0: no change in discussion engagement score. H1: inline format produces ≥20% higher reactions+replies.
Recommendation: EXTEND⚠️ — All 4 runs have failed (0% success rate for both variants). This is a critical signal that the
daily-issues-reportworkflow itself is broken, not that one variant outperforms another. Investigation recommended before collecting more experiment data.output_format·deep-reportH0: no change in discussion engagement or token cost. H1: executive_brief reduces token usage by ≥20% without reducing engagement; annotated_brief improves actionability.
Recommendation: EXTEND — Only 3 runs total,
annotated_briefhas no data yet. All runs succeeded so far — good signal, but far from min_samples=15 per variant.caveman·smoke-copilot(not specified — bare array experiment)
Recommendation: EXTEND — Most-sampled experiment (11 runs total). No statistically significant difference in success rate (p=0.888) or duration (p≈0.55). The
yes(caveman mode) variant is slightly faster on average (1134s vs 1382s) but with high variance and wide CIs. Need ~14–15 more runs per variant to reach min_samples.📊 Summary
Warning
Firewall blocked 1 domain
The following domain was blocked by the firewall during workflow execution:
proxy.golang.orgSee Network Configuration for more information.
Beta Was this translation helpful? Give feedback.
All reactions