Skip to content

[ab-advisor] Experiment campaign for daily-semgrep-scan: A/B test output_format #32795

@github-actions

Description

@github-actions

🧪 Experiment Campaign: daily-semgrep-scan

Workflow file: .github/workflows/daily-semgrep-scan.md
Selected dimension: output_format
Triggered by: ab-testing-advisor on 2026-05-17


Background

The daily-semgrep-scan workflow performs a daily static analysis security scan of the repository using Semgrep, focusing on SQL injection and other vulnerability patterns. It uses the Semgrep MCP server and emits findings via create-code-scanning-alert. The output_format dimension was selected because the workflow currently has a terse one-line prompt (Scan the repository for SQL injection vulnerabilities using Semgrep.), and how the agent structures its findings — as structured JSON-like alerts, bullet-point prose, or a grouped findings report — directly affects both the quality of code scanning alerts created and the actionability of output for developers.

Hypothesis

  • Null hypothesis (H0): The format in which the agent structures vulnerability findings does not affect code scanning alert creation success rate, output completeness, or run duration.
  • Alternative hypothesis (H1): A structured_sections format (grouped by severity/rule) produces more complete and actionable code scanning alerts than a flat bullet_list or unstructured prose format, increasing alert creation rate by ≥15%.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  semgrep_output_format:
    variants: [bullet_list, structured_sections, prose]
    description: "Tests whether the structure of Semgrep findings output (bullet list vs. grouped sections vs. prose) affects code scanning alert creation rate and output completeness."
    hypothesis: "H0: no change in alert creation rate across formats. H1: structured_sections produces ≥15% more alerts successfully created vs. baseline bullet_list."
    metric: alert_creation_rate
    secondary_metrics: [run_duration_ms, output_length_chars, findings_reported]
    guardrail_metrics:
      - name: run_success_rate
        direction: min
        threshold: 0.85
    min_samples: 30
    weight: [34, 33, 33]
    start_date: "2026-05-17"
    analysis_type: proportion_test
    tags: [security, output-quality, semgrep]
    issue: 0

Variant descriptions:

  • bullet_list: Agent reports each finding as a flat bullet point with file, line, rule, and severity inline. Minimal structure, easy to scan, but may miss grouping context.
  • structured_sections: Agent groups findings by severity (Critical → High → Medium → Low), then by rule ID, with a summary table at the top. Expected to produce more complete alerts.
  • prose: Agent writes a narrative security report describing patterns found, with findings embedded in prose paragraphs. Highest readability, lowest structured data fidelity.

Workflow Changes Required

The current prompt is a single line. Extend it with a conditional block based on the experiment variant.

Before:

Scan the repository for SQL injection vulnerabilities using Semgrep.

After:

Scan the repository for SQL injection vulnerabilities using Semgrep.

{{#if experiments.semgrep_output_format == "bullet_list" }}
Report each finding as a flat bullet point in this format:
- **[SEVERITY]** `<file>:<line>` — Rule: `<rule-id>` — <short description>

Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "structured_sections" }}
Structure your findings report with:
1. A summary table: | Severity | Count |
2. Sections grouped by severity (Critical, High, Medium, Low), then by rule ID
3. For each finding: file path, line number, rule, and recommended fix

Create one code scanning alert per finding.
{{/if}}
{{#if experiments.semgrep_output_format == "prose" }}
Write a narrative security assessment describing the vulnerability patterns found. Embed specific findings (file, line, rule) within the prose. Conclude with a prioritized remediation list.

Create one code scanning alert per finding.
{{/if}}

Success Metrics

Metric Type Target
alert_creation_rate Primary ≥15% lift for winning variant
findings_reported Secondary Count of distinct findings emitted
run_duration_ms Secondary Should not increase >20%
output_length_chars Secondary Signal for verbosity tradeoff
run_success_rate Guardrail Must stay ≥ 85%

Statistical Design

  • Variants: bullet_list, structured_sections, prose
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 30 (total 90 runs)
  • Expected experiment duration: ~90 days at 1 run/day (daily schedule)
  • Analysis approach: Proportion test (z-test for alert_creation_rate per variant)

Implementation Steps

  • Add experiments: section to frontmatter of .github/workflows/daily-semgrep-scan.md
  • Add conditional blocks to workflow prompt body using {{#if experiments.semgrep_output_format == "<variant>" }} (value-comparison form — never use the internal __GH_AW_EXPERIMENTS__ env-var syntax)
  • Run gh aw compile daily-semgrep-scan to regenerate lock file
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After sufficient runs (~90), analyze variant distribution via workflow run artifacts
  • Document findings and promote winning variant

Infrastructure Note

All three advanced experiment infrastructure fields (analysis_type, tags, notify) are already fully implemented in pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. No infrastructure improvements are needed — the experiment can be instrumented with the full schema immediately.

References

  • A/B Testing in gh-aw
  • Workflow file: .github/workflows/daily-semgrep-scan.md
  • Semgrep MCP: .github/workflows/shared/mcp/semgrep.md

Generated by 🧪 Daily A/B Testing Advisor · ● 4.4M ·

  • expires on May 31, 2026, 10:54 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions