Skip to content

[ab-advisor] Experiment campaign for typist: A/B test tone_variant #34032

@github-actions

Description

@github-actions

🧪 Experiment Campaign: typist

Workflow file: .github/workflows/typist.md
Selected dimension: tone_variant
Triggered by: ab-testing-advisor on 2026-05-22


Background

The Typist workflow analyzes Go codebases to identify duplicated type definitions and untyped usages (e.g., interface{}, any), then creates a comprehensive discussion with refactoring recommendations. The current prompt is highly detailed and structured, but we've selected tone_variant to test whether a more conversational, engaging tone improves discussion readability and developer engagement with the recommendations without sacrificing technical accuracy.

Hypothesis

H0 (null hypothesis): Changing the tone from formal/technical to conversational does not significantly affect discussion engagement metrics (views, reactions, comments) or the quality/completeness of the analysis.

H1 (alternative hypothesis): A conversational tone increases discussion engagement by 20%+ (measured by views + reactions + comments) while maintaining analysis quality, as developers find the recommendations more approachable and actionable.

Experiment Configuration

Add the following experiments: block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):

experiments:
  tone_style:
    variants: [formal, conversational]
    description: "Test whether conversational tone improves discussion engagement"
    hypothesis: "H0: no change in engagement. H1: conversational tone increases discussion views+reactions+comments by 20%+ while maintaining analysis quality"
    metric: discussion_engagement_score
    secondary_metrics: [discussion_views, discussion_reactions, discussion_comments, output_length_tokens]
    guardrail_metrics:
      - name: analysis_completeness
        direction: min
        threshold: 0.9
      - name: technical_accuracy
        direction: min
        threshold: 0.95
    min_samples: 10
    weight: [50, 50]
    start_date: "2026-05-23"
    issue: #aw_typist_tone
    analysis_type: mann_whitney
    tags: [ux, engagement, prompt_engineering]
    notify:
      issue: #aw_typist_tone

Variant descriptions:

  • formal: Current technical, structured tone with precise terminology and formal language (baseline)
  • conversational: Friendlier, more approachable tone using "you"/"we" language, analogies, and clearer explanations while maintaining technical precision

Workflow Changes Required

The experiment uses handlebars conditional blocks to swap prompt tone based on the selected variant. Always compare against a specific variant value using the correct syntax: {{#if experiments.tone_style == "formal"}}.

Changes to apply:

1. Mission Section (lines 44-50)

Before:

## Mission

Analyze all Go source files in the repository to identify:
1. **Duplicated type definitions** - Same or similar types defined in multiple locations
2. **Untyped usages** - Use of `interface{}`, `any`, or untyped constants that should be strongly typed

Generate a single formatted discussion summarizing all refactoring opportunities.

After:

## Mission

{{#if experiments.tone_style == "formal"}}
Analyze all Go source files in the repository to identify:
1. **Duplicated type definitions** - Same or similar types defined in multiple locations
2. **Untyped usages** - Use of `interface{}`, `any`, or untyped constants that should be strongly typed

Generate a single formatted discussion summarizing all refactoring opportunities.
{{/if}}
{{#if experiments.tone_style == "conversational"}}
Let's hunt for type consistency issues in the Go codebase! Your mission is to find:
1. **Duplicated type definitions** - Where we've defined the same (or nearly the same) type in multiple places
2. **Untyped usages** - Places using `interface{}` or `any` that should have specific types for safety

When you're done, create a discussion that explains what you found and how to fix it—think of it as a friendly code review that helps the team improve type safety.
{{/if}}

2. Phase Introductions

Add conversational framing before each phase:

After line 68 (Phase 0):

{{#if experiments.tone_style == "conversational"}}
### Phase 0: Setup and Activation

First things first—let's activate Serena and discover all the Go files we need to analyze.
{{/if}}
{{#if experiments.tone_style == "formal"}}
### Phase 0: Setup and Activation
{{/if}}

3. Discussion Template Intro (lines 165-169)

Before:

*Analysis of repository: ${{ github.repository }}*

## Executive Summary

[1-2 paragraphs summarizing:

After:

*Analysis of repository: ${{ github.repository }}*

## Executive Summary

{{#if experiments.tone_style == "formal"}}
[1-2 paragraphs summarizing:
- Total files analyzed
- Number of duplicated types found
- Number of untyped usages identified
- Overall impact and priority of recommendations]
{{/if}}
{{#if experiments.tone_style == "conversational"}}
[Write 1-2 friendly paragraphs that:
- Explain what you discovered in plain language
- Highlight the most interesting findings ("I found 8 places where we're defining the same Config type!")
- Give a quick sense of impact ("Fixing these will save us from runtime type assertion bugs")
- Set a positive, helpful tone—we're here to make the code better together]
{{/if}}

4. Recommendation Section Tone (around line 321)

Add before "Refactoring Recommendations" header:

{{#if experiments.tone_style == "conversational"}}
## 🎯 What Should We Do About This?

Here's my suggested action plan, prioritized by impact. Let's start with the biggest wins!
{{/if}}
{{#if experiments.tone_style == "formal"}}
## Refactoring Recommendations
{{/if}}

Success Metrics

Metric Type Target
discussion_engagement_score Primary Sum of (views + 5×reactions + 10×comments) - higher is better
discussion_views Secondary Track visibility
discussion_reactions Secondary Track positive feedback (👍, ❤️, 🎉)
discussion_comments Secondary Track developer interaction
output_length_tokens Secondary Ensure conversational variant doesn't become verbose
analysis_completeness Guardrail ≥0.9 - Must cover all type clusters found
technical_accuracy Guardrail ≥0.95 - Recommendations must be technically sound

Statistical Design

  • Variants: formal (baseline), conversational
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 10 (given daily schedule, this will take ~20 days)
  • Expected experiment duration: ~20 days until minimum sample size reached
  • Analysis approach: Mann-Whitney U test (non-parametric, suitable for engagement scores which may not be normally distributed)
  • Significance threshold: p < 0.05
  • Minimum detectable effect: 20% increase in engagement score
  • Power: 80% at α=0.05

Implementation Steps

  • Add experiments: section to frontmatter with all metadata fields
  • Add conditional blocks to workflow prompt body using {{#if experiments.tone_style == "formal"}} (value-comparison form)
  • Run gh aw compile typist to regenerate lock file and verify handlebars syntax
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After sufficient runs (~20 days), analyze variant distribution via workflow run artifacts
  • Manually collect discussion engagement metrics (views, reactions, comments) for each run's created discussion
  • Perform Mann-Whitney U test on engagement scores between variants
  • Document findings and promote winning variant (or declare no significant difference)

Measurement Notes

Primary metric calculation:

  • Use GitHub API to fetch discussion stats for each run's output discussion
  • discussion_engagement_score = views + (5 × total_reactions) + (10 × comment_count)
  • Weights reflect relative value: comments >> reactions >> views
  • Collect metrics 7 days after discussion creation to allow time for engagement

Guardrail metric evaluation:

  • analysis_completeness: Manual review—did the discussion cover all clusters found in Phase 1?
  • technical_accuracy: Manual review—are the refactoring recommendations technically correct?
  • Sample 20% of runs per variant for guardrail evaluation

Infrastructure Status

Experiment infrastructure is complete! All three advanced fields (analysis_type, tags, notify) are fully implemented in both pkg/workflow/compiler_experiments.go and actions/setup/js/pick_experiment.cjs. The experiments system supports:

  • Statistical test declarations (analysis_type)
  • Free-form tagging for dashboards (tags)
  • Automated significance notifications (notify)
  • Guardrail metrics with thresholds
  • Weighted randomization
  • Date-window gating
  • Per-run traceability
  • OTEL span integration

References

Generated by 🧪 Daily A/B Testing Advisor · ● 1M ·

  • expires on Jun 5, 2026, 2:47 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions