[ab-advisor] Experiment campaign for daily-semgrep-scan: A/B test output_format

### 🧪 Experiment Campaign: daily-semgrep-scan

**Workflow file**: `.github/workflows/daily-semgrep-scan.md`
**Selected dimension**: `output_format`
**Triggered by**: `ab-testing-advisor` on 2026-05-17

---

### Background

The `daily-semgrep-scan` workflow performs a daily static analysis security scan of the repository using Semgrep, focusing on SQL injection and other vulnerability patterns. It uses the Semgrep MCP server and emits findings via `create-code-scanning-alert`. The `output_format` dimension was selected because the workflow currently has a terse one-line prompt (`Scan the repository for SQL injection vulnerabilities using Semgrep.`), and how the agent structures its findings — as structured JSON-like alerts, bullet-point prose, or a grouped findings report — directly affects both the quality of code scanning alerts created and the actionability of output for developers.

### Hypothesis

- **Null hypothesis (H0)**: The format in which the agent structures vulnerability findings does not affect code scanning alert creation success rate, output completeness, or run duration.
- **Alternative hypothesis (H1)**: A `structured_sections` format (grouped by severity/rule) produces more complete and actionable code scanning alerts than a flat `bullet_list` or unstructured `prose` format, increasing alert creation rate by ≥15%.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  semgrep_output_format:
    variants: [bullet_list, structured_sections, prose]
    description: "Tests whether the structure of Semgrep findings output (bullet list vs. grouped sections vs. prose) affects code scanning alert creation rate and output completeness."
    hypothesis: "H0: no change in alert creation rate across formats. H1: structured_sections produces ≥15% more alerts successfully created vs. baseline bullet_list."
    metric: alert_creation_rate
    secondary_metrics: [run_duration_ms, output_length_chars, findings_reported]
    guardrail_metrics:
      - name: run_success_rate
        direction: min
        threshold: 0.85
    min_samples: 30
    weight: [34, 33, 33]
    start_date: "2026-05-17"
    analysis_type: proportion_test
    tags: [security, output-quality, semgrep]
    issue: 0
```

**Variant descriptions**:
- `bullet_list`: Agent reports each finding as a flat bullet point with file, line, rule, and severity inline. Minimal structure, easy to scan, but may miss grouping context.
- `structured_sections`: Agent groups findings by severity (Critical → High → Medium → Low), then by rule ID, with a summary table at the top. Expected to produce more complete alerts.
- `prose`: Agent writes a narrative security report describing patterns found, with findings embedded in prose paragraphs. Highest readability, lowest structured data fidelity.

### Workflow Changes Required

The current prompt is a single line. Extend it with a conditional block based on the experiment variant.

**Before:**
```
Scan the repository for SQL injection vulnerabilities using Semgrep.
```

**After:**
```
Scan the repository for SQL injection vulnerabilities using Semgrep.

{{#if experiments.semgrep_output_format == "bullet_list" }}
Report each finding as a flat bullet point in this format:
- **[SEVERITY]** `<file>:<line>` — Rule: `<rule-id>` — <short description>

Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "structured_sections" }}
Structure your findings report with:
1. A summary table: | Severity | Count |
2. Sections grouped by severity (Critical, High, Medium, Low), then by rule ID
3. For each finding: file path, line number, rule, and recommended fix

Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "prose" }}
Write a narrative security assessment describing the vulnerability patterns found. Embed specific findings (file, line, rule) within the prose. Conclude with a prioritized remediation list.

Create one code scanning alert per finding.
{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| alert_creation_rate | Primary | ≥15% lift for winning variant |
| findings_reported | Secondary | Count of distinct findings emitted |
| run_duration_ms | Secondary | Should not increase >20% |
| output_length_chars | Secondary | Signal for verbosity tradeoff |
| run_success_rate | Guardrail | Must stay ≥ 85% |

### Statistical Design

- **Variants**: `bullet_list`, `structured_sections`, `prose`
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 30 (total 90 runs)
- **Expected experiment duration**: ~90 days at 1 run/day (daily schedule)
- **Analysis approach**: Proportion test (z-test for alert_creation_rate per variant)

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter of `.github/workflows/daily-semgrep-scan.md`
- [ ] Add conditional blocks to workflow prompt body using `{{#if experiments.semgrep_output_format == "<variant>" }}` (value-comparison form — never use the internal `__GH_AW_EXPERIMENTS__` env-var syntax)
- [ ] Run `gh aw compile daily-semgrep-scan` to regenerate lock file
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After sufficient runs (~90), analyze variant distribution via workflow run artifacts
- [ ] Document findings and promote winning variant

### Infrastructure Note

All three advanced experiment infrastructure fields (`analysis_type`, `tags`, `notify`) are already fully implemented in `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs`. No infrastructure improvements are needed — the experiment can be instrumented with the full schema immediately.

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/daily-semgrep-scan.md`
- Semgrep MCP: `.github/workflows/shared/mcp/semgrep.md`







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25988767872) · ● 4.4M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 31, 2026, 10:54 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for daily-semgrep-scan: A/B test output_format #32795

🧪 Experiment Campaign: daily-semgrep-scan

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Note

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
alert_creation_rate	Primary	≥15% lift for winning variant
findings_reported	Secondary	Count of distinct findings emitted
run_duration_ms	Secondary	Should not increase >20%
output_length_chars	Secondary	Signal for verbosity tradeoff
run_success_rate	Guardrail	Must stay ≥ 85%

[ab-advisor] Experiment campaign for daily-semgrep-scan: A/B test output_format #32795

Description

🧪 Experiment Campaign: daily-semgrep-scan

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

Infrastructure Note

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions