🧪 Experiment Campaign: daily-semgrep-scan
Workflow file: .github/workflows/daily-semgrep-scan.md
Selected dimension: output_format
Triggered by: ab-testing-advisor on 2026-05-17
Background
The daily-semgrep-scan workflow performs a daily static analysis security scan of the repository using Semgrep, focusing on SQL injection and other vulnerability patterns. It uses the Semgrep MCP server and emits findings via create-code-scanning-alert. The output_format dimension was selected because the workflow currently has a terse one-line prompt (Scan the repository for SQL injection vulnerabilities using Semgrep.), and how the agent structures its findings — as structured JSON-like alerts, bullet-point prose, or a grouped findings report — directly affects both the quality of code scanning alerts created and the actionability of output for developers.
Hypothesis
- Null hypothesis (H0): The format in which the agent structures vulnerability findings does not affect code scanning alert creation success rate, output completeness, or run duration.
- Alternative hypothesis (H1): A
structured_sections format (grouped by severity/rule) produces more complete and actionable code scanning alerts than a flat bullet_list or unstructured prose format, increasing alert creation rate by ≥15%.
Experiment Configuration
Add the following experiments: block to the workflow frontmatter:
experiments:
semgrep_output_format:
variants: [bullet_list, structured_sections, prose]
description: "Tests whether the structure of Semgrep findings output (bullet list vs. grouped sections vs. prose) affects code scanning alert creation rate and output completeness."
hypothesis: "H0: no change in alert creation rate across formats. H1: structured_sections produces ≥15% more alerts successfully created vs. baseline bullet_list."
metric: alert_creation_rate
secondary_metrics: [run_duration_ms, output_length_chars, findings_reported]
guardrail_metrics:
- name: run_success_rate
direction: min
threshold: 0.85
min_samples: 30
weight: [34, 33, 33]
start_date: "2026-05-17"
analysis_type: proportion_test
tags: [security, output-quality, semgrep]
issue: 0
Variant descriptions:
bullet_list: Agent reports each finding as a flat bullet point with file, line, rule, and severity inline. Minimal structure, easy to scan, but may miss grouping context.
structured_sections: Agent groups findings by severity (Critical → High → Medium → Low), then by rule ID, with a summary table at the top. Expected to produce more complete alerts.
prose: Agent writes a narrative security report describing patterns found, with findings embedded in prose paragraphs. Highest readability, lowest structured data fidelity.
Workflow Changes Required
The current prompt is a single line. Extend it with a conditional block based on the experiment variant.
Before:
Scan the repository for SQL injection vulnerabilities using Semgrep.
After:
Scan the repository for SQL injection vulnerabilities using Semgrep.
{{#if experiments.semgrep_output_format == "bullet_list" }}
Report each finding as a flat bullet point in this format:
- **[SEVERITY]** `<file>:<line>` — Rule: `<rule-id>` — <short description>
Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "structured_sections" }}
Structure your findings report with:
1. A summary table: | Severity | Count |
2. Sections grouped by severity (Critical, High, Medium, Low), then by rule ID
3. For each finding: file path, line number, rule, and recommended fix
Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "prose" }}
Write a narrative security assessment describing the vulnerability patterns found. Embed specific findings (file, line, rule) within the prose. Conclude with a prioritized remediation list.
Create one code scanning alert per finding.
{{/if}}
Success Metrics
| Metric |
Type |
Target |
| alert_creation_rate |
Primary |
≥15% lift for winning variant |
| findings_reported |
Secondary |
Count of distinct findings emitted |
| run_duration_ms |
Secondary |
Should not increase >20% |
| output_length_chars |
Secondary |
Signal for verbosity tradeoff |
| run_success_rate |
Guardrail |
Must stay ≥ 85% |
Statistical Design
- Variants:
bullet_list, structured_sections, prose
- Assignment: Round-robin via
gh-aw experiments runtime (cache-based)
- Minimum runs per variant: 30 (total 90 runs)
- Expected experiment duration: ~90 days at 1 run/day (daily schedule)
- Analysis approach: Proportion test (z-test for alert_creation_rate per variant)
Implementation Steps
Infrastructure Note
All three advanced experiment infrastructure fields (analysis_type, tags, notify) are already fully implemented in pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. No infrastructure improvements are needed — the experiment can be instrumented with the full schema immediately.
References
- A/B Testing in gh-aw
- Workflow file:
.github/workflows/daily-semgrep-scan.md
- Semgrep MCP:
.github/workflows/shared/mcp/semgrep.md
Generated by 🧪 Daily A/B Testing Advisor · ● 4.4M · ◷
🧪 Experiment Campaign: daily-semgrep-scan
Workflow file:
.github/workflows/daily-semgrep-scan.mdSelected dimension:
output_formatTriggered by:
ab-testing-advisoron 2026-05-17Background
The
daily-semgrep-scanworkflow performs a daily static analysis security scan of the repository using Semgrep, focusing on SQL injection and other vulnerability patterns. It uses the Semgrep MCP server and emits findings viacreate-code-scanning-alert. Theoutput_formatdimension was selected because the workflow currently has a terse one-line prompt (Scan the repository for SQL injection vulnerabilities using Semgrep.), and how the agent structures its findings — as structured JSON-like alerts, bullet-point prose, or a grouped findings report — directly affects both the quality of code scanning alerts created and the actionability of output for developers.Hypothesis
structured_sectionsformat (grouped by severity/rule) produces more complete and actionable code scanning alerts than a flatbullet_listor unstructuredproseformat, increasing alert creation rate by ≥15%.Experiment Configuration
Add the following
experiments:block to the workflow frontmatter:Variant descriptions:
bullet_list: Agent reports each finding as a flat bullet point with file, line, rule, and severity inline. Minimal structure, easy to scan, but may miss grouping context.structured_sections: Agent groups findings by severity (Critical → High → Medium → Low), then by rule ID, with a summary table at the top. Expected to produce more complete alerts.prose: Agent writes a narrative security report describing patterns found, with findings embedded in prose paragraphs. Highest readability, lowest structured data fidelity.Workflow Changes Required
The current prompt is a single line. Extend it with a conditional block based on the experiment variant.
Before:
After:
Success Metrics
Statistical Design
bullet_list,structured_sections,prosegh-awexperiments runtime (cache-based)Implementation Steps
experiments:section to frontmatter of.github/workflows/daily-semgrep-scan.md{{#if experiments.semgrep_output_format == "<variant>" }}(value-comparison form — never use the internal__GH_AW_EXPERIMENTS__env-var syntax)gh aw compile daily-semgrep-scanto regenerate lock file/tmp/gh-aw/experiments/state.jsonInfrastructure Note
All three advanced experiment infrastructure fields (
analysis_type,tags,notify) are already fully implemented inpkg/workflow/compiler_experiments.goandactions/setup/js/pick_experiment.cjs. No infrastructure improvements are needed — the experiment can be instrumented with the full schema immediately.References
.github/workflows/daily-semgrep-scan.md.github/workflows/shared/mcp/semgrep.md