[experiments] Daily Experiment Report — 2026-05-10 #31318

2026-05-10T08:43:44Z

github-actions[bot]
Bot May 10, 2026

🧪 Daily Experiment Report — 2026-05-10

5 experiments analysed across 5 workflows. All experiments are in EXTEND status — insufficient data to reach statistical significance. No experiments reached min_samples thresholds. No guardrail violations detected (except 100% failure rate in daily-issues-report which warrants investigation).

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

Variants: detailed (control) vs concise · Window: last 30 runs · Analysed: 5 runs with artifacts
min_samples: 30 per variant

Concise prompt reduces token consumption ≥20% without degrading fix precision. H0: no difference in fix rate.

Experiment : prompt_style
Workflow   : daily-astrostylelite-markdown-spellcheck
Hypothesis : Concise prompt reduces token consumption ≥20% without degrading fix precision.
Window     : last 30 runs  |  Analysed: 5 runs with artifacts
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+----------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples    |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
| detailed (ctrl)  |   5  |  80.0%   |    397.6       | [220.1 , 575.1]    |  (ref)    |  5/30 (17%)   |
| concise          |   0  |   N/A    |     N/A        |     N/A            |   N/A     |  0/30 ( 0%)   |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value is two-tailed, compared against the control (first) variant.

Guardrails:
  empty_output_rate <0.10  : INSUFFICIENT DATA
  run_success_rate  >=0.90 : FAIL for detailed (0.80) — below threshold

Recommendation: EXTEND
Rationale     : The concise variant has 0 runs; experiment has not started accumulating data for that variant.

Recommendation: EXTEND — The concise variant has 0 runs; experiment has just started. The detailed control shows 80% success rate, slightly below the run_success_rate >=0.90 guardrail — worth monitoring.

`prompt_style` · `daily-community-attribution`

Variants: concise (control) vs verbose · Window: last 30 runs · Analysed: 5 runs with artifacts
min_samples: 20 per variant (default)

(not specified — bare array experiment)

Experiment : prompt_style
Workflow   : daily-community-attribution
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 5 runs with artifacts
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+----------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples    |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
| concise (ctrl)   |   3  | 100.0%   |    604.0       | [  23.0, 1185.0]   |  (ref)    |  3/20 (15%)   |
| verbose          |   2  | 100.0%   |    360.0       | [N/A — n<2 wide]   |  N/A      |  2/20 (10%)   |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
Note: Success rate z-test N/A (no variance — all runs succeeded for both variants).
Duration Welch t-test: t=1.60, df≈5, p≈0.17 (not significant).
Significance: * p<0.05   ** p<0.01   *** p<0.001

Guardrails  : (none declared)

Recommendation: EXTEND
Rationale     : Neither variant has reached min_samples=20; only 5 total runs recorded.

Recommendation: EXTEND — Both variants show 100% success, but sample sizes are far too small (3 vs 2) to draw any conclusions. Duration difference (604s vs 360s) is not statistically significant (p≈0.17).

`output_format` · `daily-issues-report`

Variants: collapsible (control) vs inline · Window: last 30 runs · Analysed: 4 runs with artifacts
min_samples: 30 per variant · Tracking issue: #30573

H0: no change in discussion engagement score. H1: inline format produces ≥20% higher reactions+replies.

Experiment : output_format
Workflow   : daily-issues-report
Hypothesis : H1: inline format produces ≥20% higher reactions+replies
Window     : last 30 runs  |  Analysed: 4 runs with artifacts
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+----------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples    |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
| collapsible(ctrl)|   3  |   0.0%   |    310.3       | [261.0 , 359.6]    |  (ref)    |  3/30 (10%)   |
| inline           |   1  |   0.0%   |    335.0       | N/A (n<2)          |   N/A     |  1/30 ( 3%)   |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
Note: All 4 runs failed (conclusion=failure). Statistical tests N/A due to zero variance in success rate.

Guardrails:
  empty_output_rate ==0 : UNKNOWN — all runs failed; likely violation but unconfirmed

⚠️  WARNING: 100% failure rate across all 4 runs (both variants). This workflow may have a bug.

Recommendation: EXTEND
Rationale     : Insufficient data (max n=3) and 100% failure rate suggests an underlying issue
                that should be investigated before proceeding with the experiment.

Recommendation: EXTEND ⚠️ — All 4 runs have failed (0% success rate for both variants). This is a critical signal that the daily-issues-report workflow itself is broken, not that one variant outperforms another. Investigation recommended before collecting more experiment data.

`output_format` · `deep-report`

Variants: full_briefing (control) vs executive_brief vs annotated_brief · Window: last 30 runs · Analysed: 3 runs
min_samples: 15 per variant

H0: no change in discussion engagement or token cost. H1: executive_brief reduces token usage by ≥20% without reducing engagement; annotated_brief improves actionability.

Experiment : output_format
Workflow   : deep-report
Hypothesis : H1: executive_brief reduces token usage ≥20%; annotated_brief improves actionability
Window     : last 30 runs  |  Analysed: 3 runs with artifacts
min_samples: 15 per variant
Bonferroni-corrected alpha for 3 variants: 0.0167

+--------------------+------+----------+----------------+--------------------+-----------+----------------+
| Variant            |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples    |
+--------------------+------+----------+----------------+--------------------+-----------+----------------+
| full_briefing(ctrl)|   1  | 100.0%   |    849.0       | N/A (n<2)          |  (ref)    |  1/15 ( 7%)   |
| executive_brief    |   2  | 100.0%   |    761.5       | [-45.1, 1568.1]    |   N/A     |  2/15 (13%)   |
| annotated_brief    |   0  |   N/A    |     N/A        |     N/A            |   N/A     |  0/15 ( 0%)   |
+--------------------+------+----------+----------------+--------------------+-----------+----------------+
Note: Statistical tests N/A — control has n=1, annotated_brief has 0 runs.
Bonferroni alpha: 0.0167 (applies when all variants have sufficient data).

Guardrails:
  empty_output_rate ==0 : INSUFFICIENT DATA
  issue_creation_success_rate >=0.80 : INSUFFICIENT DATA

Recommendation: EXTEND
Rationale     : Experiment just started (3 total runs); annotated_brief has 0 runs recorded yet.

Recommendation: EXTEND — Only 3 runs total, annotated_brief has no data yet. All runs succeeded so far — good signal, but far from min_samples=15 per variant.

`caveman` · `smoke-copilot`

Variants: no (control) vs yes · Window: last 30 runs · Analysed: 11 runs with artifacts
min_samples: 20 per variant (default)

(not specified — bare array experiment)

Experiment : caveman
Workflow   : smoke-copilot
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 11 runs with artifacts
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+----------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples    |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
| no (control)     |   5  |  80.0%   |   1382.2       | [ 446.2, 2318.2]   |  (ref)    |  5/20 (25%)   |
| yes              |   6  |  83.3%   |   1134.0       | [ 540.5, 1727.5]   |  0.888    | 6/20 (30%)    |
+------------------+------+----------+----------------+--------------------+-----------+----------------+
Success rate z-test: z=-0.14, p=0.888 (not significant).
Duration Welch t-test: t=0.61, df≈26, p≈0.55 (not significant).
Significance: * p<0.05   ** p<0.01   *** p<0.001

Guardrails  : (none declared)

Recommendation: EXTEND
Rationale     : Neither variant has reached min_samples=20 (no=5, yes=6); no significant difference detected.

Recommendation: EXTEND — Most-sampled experiment (11 runs total). No statistically significant difference in success rate (p=0.888) or duration (p≈0.55). The yes (caveman mode) variant is slightly faster on average (1134s vs 1382s) but with high variance and wide CIs. Need ~14–15 more runs per variant to reach min_samples.

📊 Summary

Experiment	Workflow	Control	Best variant	p-value	Guardrails	Recommendation
prompt_style	daily-astrostylelite-markdown-spellcheck	detailed	concise (0 runs)	N/A	⚠️ success_rate below 0.90	EXTEND
prompt_style	daily-community-attribution	concise	verbose	N/A	PASS (none)	EXTEND
output_format	daily-issues-report	collapsible	inline (1 run)	N/A	⚠️ 100% failure rate	EXTEND
output_format	deep-report	full_briefing	executive_brief	N/A	INSUFFICIENT DATA	EXTEND
caveman	smoke-copilot	no	yes (83.3%)	0.888	PASS (none)	EXTEND

Analysis window: last 30 runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: 25624052383

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by daily-experiment-report · ● 45.7M · ◷

expires on May 13, 2026, 8:43 AM UTC

2026-05-11T09:16:34Z

github-actions[bot]
Bot May 11, 2026
Author

This discussion has been marked as outdated by daily-experiment-report.

A newer discussion is available at Discussion #31462.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-10 #31318

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-10 #31318

Uh oh!

github-actions[bot] Bot May 10, 2026

🧪 Daily Experiment Report — 2026-05-10

prompt_style · daily-astrostylelite-markdown-spellcheck

prompt_style · daily-community-attribution

output_format · daily-issues-report

output_format · deep-report

caveman · smoke-copilot

📊 Summary

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 11, 2026 Author

github-actions[bot]
Bot May 10, 2026

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

`prompt_style` · `daily-community-attribution`

`output_format` · `daily-issues-report`

`output_format` · `deep-report`

`caveman` · `smoke-copilot`

github-actions[bot]
Bot May 11, 2026
Author