[experiments] Daily Experiment Report — 2026-05-11 #31462

2026-05-11T09:16:33Z

github-actions[bot]
Bot May 11, 2026

🧪 Daily Experiment Report — 2026-05-11

5 active experiments analysed across 5 workflows. All experiments remain in EXTEND status — none have yet reached the minimum sample threshold. One experiment (daily-issues-report / output_format) has a guardrail violation with 0% success rate on all runs, triggering an ABANDON recommendation pending infrastructure investigation.

`output_format` · `deep-report`

Variants: full_briefing vs executive_brief vs annotated_brief · Window: last 30 runs · Analysed: 3 runs with assignments
min_samples: 15 per variant

H0: no change. H1: executive_brief reduces token usage ≥20% without reducing engagement; annotated_brief improves actionability.

Experiment : output_format
Workflow   : deep-report.lock.yml
Hypothesis : H0: no change. H1: executive_brief -20% tokens; annotated_brief improves actionability.
Window     : last 30 runs  |  Analysed: 3 runs with artifacts
min_samples: 15 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| full_briefing    |   1  | 100.0%   |    849         |       N/A          |  (ref)    |  1/15 ( 7%)   |
| executive_brief  |   2  | 100.0%   |    762         | [  -45 , 1568]     |   N/A     |  2/15 (13%)   |
| annotated_brief  |   0  |   N/A    |    N/A         |       N/A          |   N/A     |  0/15 ( 0%)   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed, compared against the control (first) variant.

Guardrails:
  empty_output_rate ==0  : cannot evaluate (insufficient runs)
  issue_creation_success >=0.8 : cannot evaluate (insufficient runs)

Recommendation: EXTEND
Rationale     : Only 3 total runs since experiment start (2026-05-06); annotated_brief has zero runs — far below min_samples=15 per variant.

Recommendation: EXTEND — Collect more data; annotated_brief has received no runs yet.

`output_format` · `daily-issues-report`

Variants: collapsible vs inline · Window: last 30 runs · Analysed: 5 runs with assignments
min_samples: 30 per variant · Tracking issue: #30573

H0: no change. H1: inline format produces ≥20% higher reactions+replies by making charts/recommendations immediately visible.

Experiment : output_format
Workflow   : daily-issues-report.lock.yml
Hypothesis : H0: no change. H1: inline format +20% engagement.
Window     : last 30 runs  |  Analysed: 5 runs with artifacts
min_samples: 30 per variant

+-------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant     |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+-------------+------+----------+----------------+--------------------+-----------+---------------+
| collapsible |   3  |   0.0%   |    310         | [ 261 ,  360]      |  (ref)    | 3/30 (10%)    |
| inline      |   2  |   0.0%   |    308         |       N/A          |   N/A     | 2/30 ( 7%)    |
+-------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed, compared against the control (first) variant.

Guardrails:
  empty_output_rate ==0 : FAIL (inferred ≈1.0 — both variants have 0% run success) ← ABANDON

Recommendation: ABANDON
Rationale     : All 5 runs have failed regardless of variant, indicating a systemic workflow failure; guardrail empty_output_rate==0 is violated.

Recommendation: ABANDON — ⚠️ Guardrail violation: the workflow has 0% success rate on both variants since the experiment started (2026-05-07). This suggests a systemic infrastructure issue unrelated to the experiment variants. Investigate and fix the underlying failure before resuming the experiment.

`caveman` · `smoke-copilot`

Variants: no (control) vs yes · Window: last 30 runs · Analysed: 13 runs with assignments
min_samples: 20 per variant

Hypothesis: (not specified)

Experiment : caveman
Workflow   : smoke-copilot.lock.yml
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 13 runs with artifacts
min_samples: 20 per variant

+---------+------+----------+----------------+--------------------+-----------+---------------+
| Variant |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+---------+------+----------+----------------+--------------------+-----------+---------------+
| no      |   6  |  66.7%   |   1292         | [  546 , 2037]     |  (ref)    |  6/20 (30%)   |
| yes     |   7  |  85.7%   |   1064         | [  556 , 1571]     |  0.4164   |  7/20 (35%)   |
+---------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed (success rate), compared against the control (no) variant.
Duration p-value (Welch t-test): 0.5379

Guardrails: none declared

Recommendation: EXTEND
Rationale     : Both variants below min_samples=20; preliminary p-value for success rate is 0.4164 — no significant effect detected yet.

Recommendation: EXTEND — Early data shows yes (caveman) trending 19pp higher success rate (85.7% vs 66.7%), but n=6/7 is well below min_samples=20 and p=0.42 is not significant. Collect more runs.

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

Variants: detailed (control) vs concise · Window: last 30 runs · Analysed: 6 runs with assignments
min_samples: 30 per variant

Concise prompt reduces token consumption ≥20% without degrading fix precision. H0: no difference in fix rate.

Experiment : prompt_style
Workflow   : daily-astrostylelite-markdown-spellcheck.lock.yml
Hypothesis : Concise prompt reduces token consumption ≥20% without degrading fix precision.
Window     : last 30 runs  |  Analysed: 6 runs with artifacts
min_samples: 30 per variant

+----------+------+----------+----------------+--------------------+-----------+---------------+
| Variant  |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+----------+------+----------+----------------+--------------------+-----------+---------------+
| detailed |   5  |  80.0%   |    398         | [  220 ,  575]     |  (ref)    |  5/30 (17%)   |
| concise  |   1  | 100.0%   |    478         |       N/A          |  0.6242   |  1/30 ( 3%)   |
+----------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed (success rate), compared against the control (detailed) variant.

Guardrails:
  empty_output_rate <0.10 : cannot evaluate (insufficient runs; 1 run for concise)
  run_success_rate >=0.90 : WARN detailed=0.80 below threshold (insufficient data to conclude)

Recommendation: EXTEND
Rationale     : Severely imbalanced allocation (detailed:5 vs concise:1); both far below min_samples=30 — collect more runs.

Recommendation: EXTEND — Severely imbalanced (5 detailed vs 1 concise), far from min_samples=30. Note that detailed is currently at 80% success rate, slightly below its ≥0.90 guardrail threshold — worth monitoring.

`prompt_style` · `daily-community-attribution`

Variants: concise (control) vs verbose · Window: last 30 runs · Analysed: 6 runs with assignments
min_samples: 20 per variant

Hypothesis: (not specified)

Experiment : prompt_style
Workflow   : daily-community-attribution.lock.yml
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 6 runs with artifacts
min_samples: 20 per variant

+----------+------+----------+----------------+--------------------+-----------+---------------+
| Variant  |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+----------+------+----------+----------------+--------------------+-----------+---------------+
| concise  |   3  | 100.0%   |    604         | [  23  , 1185]     |  (ref)    |  3/20 (15%)   |
| verbose  |   3  | 100.0%   |    335         | [ 131  ,  539]     |  N/A *    |  3/20 (15%)   |
+----------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed, compared against the control (first) variant.
* Success rate p-value: N/A (both variants 100%). Duration p-value: 0.1754.

Guardrails: none declared

Recommendation: EXTEND
Rationale     : Only 6 total runs (3 per variant), well below min_samples=20; p≥0.05 for duration.

Recommendation: EXTEND — Both variants show 100% success rate so far, but sample sizes are tiny. Interesting early signal: verbose runs 44% faster (335s vs 604s mean), though CIs overlap and p=0.18 — not significant yet.

📊 Summary

Experiment	Workflow	Control	Best variant	p-value	Guardrails	Recommendation
output_format	deep-report	full_briefing	executive_brief	N/A	⚠️ cannot evaluate	EXTEND
output_format	daily-issues-report	collapsible	—	N/A	❌ FAIL (0% success)	ABANDON
caveman	smoke-copilot	no	yes (+19pp)	0.4164	✅ none declared	EXTEND
prompt_style	daily-astrostylelite	detailed	concise	0.6242	⚠️ detailed<0.90	EXTEND
prompt_style	daily-community-attribution	concise	verbose (-44% dur)	0.1754	✅ none declared	EXTEND

Analysis window: last 30 runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: §25660166266

Warning

Firewall blocked 2 domains

The following domains were blocked by the firewall during workflow execution:

productionresultssa12.blob.core.windows.net
proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "productionresultssa12.blob.core.windows.net"
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by daily-experiment-report · ● 63M · ◷

expires on May 14, 2026, 9:16 AM UTC

2026-05-12T09:03:22Z

github-actions[bot]
Bot May 12, 2026
Author

This discussion has been marked as outdated by daily-experiment-report.

A newer discussion is available at Discussion #31659.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-11 #31462

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-11 #31462

Uh oh!

github-actions[bot] Bot May 11, 2026

🧪 Daily Experiment Report — 2026-05-11

output_format · deep-report

output_format · daily-issues-report

caveman · smoke-copilot

prompt_style · daily-astrostylelite-markdown-spellcheck

prompt_style · daily-community-attribution

📊 Summary

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 12, 2026 Author

github-actions[bot]
Bot May 11, 2026

`output_format` · `deep-report`

`output_format` · `daily-issues-report`

`caveman` · `smoke-copilot`

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

`prompt_style` · `daily-community-attribution`

github-actions[bot]
Bot May 12, 2026
Author