[experiments] Daily Experiment Report — 2026-05-09 #31182

2026-05-09T08:38:06Z

github-actions[bot]
Bot May 9, 2026

🧪 Daily Experiment Report — 2026-05-09

5 experiments analysed across 5 workflows. All experiments are in early-stage accumulation — none have reached statistical significance thresholds or minimum sample sizes. Recommendation for all: EXTEND.

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

Variants: concise vs detailed · Window: last 30 runs · Analysed: 4 runs with assignments
min_samples: 30 per variant

Concise prompt reduces token consumption ≥20% without degrading fix precision. H0: no difference in fix rate.

⚠️ Routing anomaly: Only the detailed variant has received any runs (4 of 4). The concise variant has 0 runs. Experiment assignment may be misconfigured.

Experiment : prompt_style
Workflow   : daily-astrostylelite-markdown-spellcheck
Hypothesis : Concise prompt reduces token consumption ≥20% without degrading fix
             precision. H0: no difference in fix rate.
Window     : last 30 runs  |  Analysed: 4 runs with assignments
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| concise (ctrl)   |   0  |   N/A    |     N/A        |        N/A         |  (ref)    |  0/30  ( 0%)  |
| detailed         |   4  |  75.0%   |    384         | [  127 ,  641 ]    |   N/A     |  4/30  (13%)  |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value N/A: control has n=0 — no comparison possible.

Guardrails:
  empty_output_rate  <0.10  : UNKNOWN (cannot evaluate without output data)
  run_success_rate  >=0.90  : FAIL (detailed=75.0%)

Recommendation: EXTEND
Rationale     : Control variant has received 0 runs; routing anomaly must be
                investigated before analysis can proceed.

Recommendation: EXTEND — The concise variant has 0 runs; experiment routing appears broken. Investigate assignment logic.

`prompt_style` · `daily-community-attribution`

Variants: concise vs verbose · Window: last 30 runs · Analysed: 4 runs with assignments
min_samples: 20 per variant

Hypothesis: (not specified)

Experiment : prompt_style
Workflow   : daily-community-attribution
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 4 runs with assignments
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| concise (ctrl)   |   2  | 100.0%   |    673         |   N/A (n<2 durs)   |  (ref)    |  2/20  (10%)  |
| verbose          |   2  | 100.0%   |    360         |   N/A (n<2 durs)   |   N/A     |  2/20  (10%)  |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value N/A: both variants < min_samples; test has insufficient power.

Guardrails: (none declared)

Recommendation: EXTEND
Rationale     : Both variants at 10% of min_samples (2/20); need 18 more runs
                each before drawing conclusions.

Recommendation: EXTEND — Only 10% of minimum sample size reached on both sides.

`output_format` · `daily-issues-report`

Variants: collapsible vs inline · Window: last 30 runs · Analysed: 3 runs with assignments
min_samples: 30 per variant · Tracking issue: #30573

H0: no change in discussion engagement score. H1: inline format produces ≥20% higher reactions+replies by making charts and recommendations immediately visible.

⚠️ Workflow health concern: All 30 fetched runs for daily-issues-report have conclusion: failure. The workflow appears to be broken independently of the experiment. Guardrail evaluation is limited.

Experiment : output_format
Workflow   : daily-issues-report
Hypothesis : H0: no change in discussion engagement. H1: inline format produces
             ≥20% higher reactions+replies.
Window     : last 30 runs  |  Analysed: 3 runs with assignments
min_samples: 30 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| collapsible(ctrl)|   2  |   0.0%   |    319         | [ 153  ,  484  ]   |  (ref)    |  2/30  ( 6%)  |
| inline           |   1  |   0.0%   |    335         |   N/A (n=1)        |   N/A     |  1/30  ( 3%)  |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
p-value N/A: both variants at 0% success — comparison is meaningless until
             workflow health is restored.

Guardrails:
  empty_output_rate ==0 : LIKELY FAIL — all 3 experiment runs ended in failure;
                          empty output is expected.

Recommendation: EXTEND
Rationale     : Workflow is broken (100% failure rate across all 30 sampled runs);
                fix the workflow before interpreting experiment data.

Recommendation: EXTEND — Workflow is broken (0% success rate on all 30 sampled runs). Fix required before experiment results are meaningful.

`output_format` · `deep-report`

Variants: full_briefing vs executive_brief vs annotated_brief · Window: last 30 runs · Analysed: 3 runs with assignments
min_samples: 15 per variant

H0: no change in discussion engagement or token cost. H1: executive_brief reduces token usage by ≥20% without reducing engagement; annotated_brief improves actionability.

Experiment : output_format
Workflow   : deep-report
Hypothesis : H0: no change in engagement/token cost. H1: executive_brief reduces
             token usage ≥20%; annotated_brief improves actionability.
Window     : last 30 runs  |  Analysed: 3 runs with assignments
min_samples: 15 per variant
Bonferroni α: 0.0167 (3 variants, K=3)

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| full_brf (ctrl)  |   1  | 100.0%   |    849         |   N/A (n=1)        |  (ref)    |  1/15  ( 6%)  |
| executive_brief  |   2  | 100.0%   |    761         | [  N/A ,   N/A ]   |   N/A     |  2/15  (13%)  |
| annotated_brief  |   0  |   N/A    |     N/A        |        N/A         |   N/A     |  0/15  ( 0%)  |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
Bonferroni-corrected threshold for K=3: α=0.0167

Guardrails:
  empty_output_rate      ==0  : UNKNOWN (insufficient outcome data)
  issue_creation_success >=0.8: UNKNOWN (insufficient outcome data)

Recommendation: EXTEND
Rationale     : annotated_brief has 0 runs; all variants far below min_samples=15.

Recommendation: EXTEND — Three-variant experiment needs 15 runs per arm; annotated_brief has not run yet.

`caveman` · `smoke-copilot`

Variants: no vs yes · Window: last 30 runs · Analysed: 10 runs with assignments
min_samples: 20 per variant

Hypothesis: (not specified)

Experiment : caveman
Workflow   : smoke-copilot
Hypothesis : (not specified)
Window     : last 30 runs  |  Analysed: 10 runs with assignments
min_samples: 20 per variant

+------------------+------+----------+----------------+--------------------+-----------+---------------+
| Variant          |  n   | Succ %   | Mean dur (s)   | 95% CI (s)         |  p-value  | min_samples   |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
| no (ctrl)        |   5  |  80.0%   |   1382         | [  445 , 2318  ]   |  (ref)    |  5/20  (25%)  |
| yes              |   5  |  80.0%   |   1212         | [  475 , 1950  ]   |  p=1.00   |  5/20  (25%)  |
+------------------+------+----------+----------------+--------------------+-----------+---------------+
Significance: * p<0.05   ** p<0.01   *** p<0.001
z=0.000, p=1.00 (two-tailed) — identical success rates observed so far.

Guardrails: (none declared)

Recommendation: EXTEND
Rationale     : Both variants at 25% of min_samples (5/20); more data needed
                before conclusions can be drawn.

Recommendation: EXTEND — Both variants at 25% progress toward min_samples=20. Identical success rates so far (80% each).

📊 Summary

Experiment	Workflow	Control	Best variant	p-value	Guardrails	Recommendation
prompt_style	daily-astrostylelite-markdown-spellcheck	concise (n=0)	detailed (75%)	N/A	run_success_rate FAIL	EXTEND ⚠️ routing anomaly
prompt_style	daily-community-attribution	concise (100%)	tied (100%)	N/A	N/A	EXTEND
output_format	daily-issues-report	collapsible (0%)	tied (0%)	N/A	empty_output_rate LIKELY FAIL	EXTEND ⚠️ workflow broken
output_format	deep-report	full_briefing (100%)	tied (100%)	N/A	UNKNOWN	EXTEND
caveman	smoke-copilot	no (80%)	tied (80%)	1.00	N/A	EXTEND

Analysis window: last 30 runs per workflow · Significance threshold: p < 0.05 (two-tailed)
Run: §25596419525

Attention items:

🔴 daily-astrostylelite-markdown-spellcheck: concise variant has received 0 runs — investigate experiment routing
🔴 daily-issues-report: 100% failure rate across all sampled runs — workflow is broken; experiment data is invalid until fixed

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Generated by daily-experiment-report · ● 50M · ◷

expires on May 12, 2026, 8:38 AM UTC

2026-05-10T08:43:45Z

github-actions[bot]
Bot May 10, 2026
Author

This discussion has been marked as outdated by daily-experiment-report.

A newer discussion is available at Discussion #31318.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiments] Daily Experiment Report — 2026-05-09 #31182

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[experiments] Daily Experiment Report — 2026-05-09 #31182

Uh oh!

github-actions[bot] Bot May 9, 2026

🧪 Daily Experiment Report — 2026-05-09

prompt_style · daily-astrostylelite-markdown-spellcheck

prompt_style · daily-community-attribution

output_format · daily-issues-report

output_format · deep-report

caveman · smoke-copilot

📊 Summary

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 10, 2026 Author

github-actions[bot]
Bot May 9, 2026

`prompt_style` · `daily-astrostylelite-markdown-spellcheck`

`prompt_style` · `daily-community-attribution`

`output_format` · `daily-issues-report`

`output_format` · `deep-report`

`caveman` · `smoke-copilot`

github-actions[bot]
Bot May 10, 2026
Author