[experiments] Daily Experiment Report — 2026-06-25 #41408
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by daily-experiment-report. A newer discussion is available at Discussion #41642. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🧪 Daily Experiment Report — 2026-06-25
35 experiments across 32 workflows · 11 reached sample targets · 0 statistically significant (p < 0.05) · Outcome data fetched for 8 experiments via Actions run history.
⚡ Quick Stats
🔬 Detailed Analysis (8 experiments with outcome data)
prompt_style·ci-coach📊 Statistics
detailed(ctrl)concise🟡 EXTEND —⚠️ Guardrail violation:
run_success_rate< 0.85 for both variants (detailed=62.5%, concise=69.2%). Need 4+7 more runs.tone_style·typist📊 Statistics
formal(ctrl)conversational❌ ABANDON — No significant effect (p=0.26). Both variants at min_samples.
conversationalshows 100% vsformal90% — directional but not significant.caveman·smoke-copilot📊 Statistics
no(ctrl)yes❌ ABANDON — No significant effect (p=0.59). Both variants well past min_samples (105/104 runs). Caveman mode shows no measurable impact.
subagent_model·smoke-copilot📊 Statistics
large(ctrl)small❌ ABANDON — No significant effect (p=0.59). large vs small subagent model shows no meaningful difference in success or duration.
sub_agent_strategy·smoke-gemini📊 Statistics
sub_agents(ctrl)single_agent❌ ABANDON —⚠️ UNBALANCED assignment (sub_agents=98, single_agent=67, balance p=0.016). Both variants 100% success — no outcome test possible. Duration p=0.35.
sub_agent_strategy·smoke-antigravity📊 Statistics
single_agent(ctrl)sub_agents❌ ABANDON — Both variants 100% success — no success rate test possible. Duration not significant (p=0.40). No detectable effect.
prompt_style·daily-community-attribution📊 Statistics
concise(ctrl)verbose❌ ABANDON — Close to significant: concise=100% vs verbose=81% success (p=0.088). Below p<0.05 threshold — no detectable effect at current sample.
prompt_compression·agent-performance-analyzer📊 Statistics
verbose(ctrl)caveman🟡 EXTEND — verbose variant at 11/14 runs (79%). Directional: verbose=100% vs caveman=89.5% success; caveman 100s faster. p=0.27 — insufficient power yet.
🟢 Ready — Outcome Data Pending
cavemansmoke-copilot-aoai-apikeysubagent_modelsmoke-copilot-aoai-apikeycavemansmoke-copilot-aoai-entrasubagent_modelsmoke-copilot-aoai-entrasub_agent_decompositionsmoke-pi🟡 Collecting Data — Close to Threshold
prompt_compressionagent-performance-analyzer#33280output_formatdaily-compiler-quality#32390output_formatdaily-code-metrics#1prompt_styleci-coach#32335sub_agent_strategyagent-persona-exploreroutput_formatdeep-reportprompt_styledaily-news#31190output_formatdaily-issues-report#30573reasoning_depthdaily-security-red-team#31673tone_variantaw-failure-investigator#36105Warning
Firewall blocked 2 domains
The following domains were blocked by the firewall during workflow execution:
proxy.golang.orgreleaseassets.githubusercontent.comSee Network Configuration for more information.
Beta Was this translation helpful? Give feedback.
All reactions