[ab-advisor] Experiment campaign for typist: A/B test tone_variant

### 🧪 Experiment Campaign: typist

**Workflow file**: `.github/workflows/typist.md`
**Selected dimension**: tone_variant
**Triggered by**: `ab-testing-advisor` on 2026-05-22

---

### Background

The Typist workflow analyzes Go codebases to identify duplicated type definitions and untyped usages (e.g., `interface{}`, `any`), then creates a comprehensive discussion with refactoring recommendations. The current prompt is highly detailed and structured, but we've selected **tone_variant** to test whether a more conversational, engaging tone improves discussion readability and developer engagement with the recommendations without sacrificing technical accuracy.

### Hypothesis

**H0 (null hypothesis)**: Changing the tone from formal/technical to conversational does not significantly affect discussion engagement metrics (views, reactions, comments) or the quality/completeness of the analysis.

**H1 (alternative hypothesis)**: A conversational tone increases discussion engagement by 20%+ (measured by views + reactions + comments) while maintaining analysis quality, as developers find the recommendations more approachable and actionable.

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter (use the rich object form so all metadata is self-documenting):

```yaml
experiments:
  tone_style:
    variants: [formal, conversational]
    description: "Test whether conversational tone improves discussion engagement"
    hypothesis: "H0: no change in engagement. H1: conversational tone increases discussion views+reactions+comments by 20%+ while maintaining analysis quality"
    metric: discussion_engagement_score
    secondary_metrics: [discussion_views, discussion_reactions, discussion_comments, output_length_tokens]
    guardrail_metrics:
      - name: analysis_completeness
        direction: min
        threshold: 0.9
      - name: technical_accuracy
        direction: min
        threshold: 0.95
    min_samples: 10
    weight: [50, 50]
    start_date: "2026-05-23"
    issue: #aw_typist_tone
    analysis_type: mann_whitney
    tags: [ux, engagement, prompt_engineering]
    notify:
      issue: #aw_typist_tone
```

**Variant descriptions**:
- `formal`: Current technical, structured tone with precise terminology and formal language (baseline)
- `conversational`: Friendlier, more approachable tone using "you"/"we" language, analogies, and clearer explanations while maintaining technical precision

### Workflow Changes Required

The experiment uses handlebars conditional blocks to swap prompt tone based on the selected variant. **Always compare against a specific variant value** using the correct syntax: `{{#if experiments.tone_style == "formal"}}`.

**Changes to apply:**

#### 1. Mission Section (lines 44-50)

**Before:**
```markdown
## Mission

Analyze all Go source files in the repository to identify:
1. **Duplicated type definitions** - Same or similar types defined in multiple locations
2. **Untyped usages** - Use of `interface{}`, `any`, or untyped constants that should be strongly typed

Generate a single formatted discussion summarizing all refactoring opportunities.
```

**After:**
```handlebars
## Mission

{{#if experiments.tone_style == "formal"}}
Analyze all Go source files in the repository to identify:
1. **Duplicated type definitions** - Same or similar types defined in multiple locations
2. **Untyped usages** - Use of `interface{}`, `any`, or untyped constants that should be strongly typed

Generate a single formatted discussion summarizing all refactoring opportunities.
{{/if}}
{{#if experiments.tone_style == "conversational"}}
Let's hunt for type consistency issues in the Go codebase! Your mission is to find:
1. **Duplicated type definitions** - Where we've defined the same (or nearly the same) type in multiple places
2. **Untyped usages** - Places using `interface{}` or `any` that should have specific types for safety

When you're done, create a discussion that explains what you found and how to fix it—think of it as a friendly code review that helps the team improve type safety.
{{/if}}
```

#### 2. Phase Introductions

Add conversational framing before each phase:

**After line 68 (Phase 0):**
```handlebars
{{#if experiments.tone_style == "conversational"}}
### Phase 0: Setup and Activation

First things first—let's activate Serena and discover all the Go files we need to analyze.
{{/if}}
{{#if experiments.tone_style == "formal"}}
### Phase 0: Setup and Activation
{{/if}}
```

#### 3. Discussion Template Intro (lines 165-169)

**Before:**
```markdown
*Analysis of repository: ${{ github.repository }}*

## Executive Summary

[1-2 paragraphs summarizing:
```

**After:**
```handlebars
*Analysis of repository: ${{ github.repository }}*

## Executive Summary

{{#if experiments.tone_style == "formal"}}
[1-2 paragraphs summarizing:
- Total files analyzed
- Number of duplicated types found
- Number of untyped usages identified
- Overall impact and priority of recommendations]
{{/if}}
{{#if experiments.tone_style == "conversational"}}
[Write 1-2 friendly paragraphs that:
- Explain what you discovered in plain language
- Highlight the most interesting findings ("I found 8 places where we're defining the same Config type!")
- Give a quick sense of impact ("Fixing these will save us from runtime type assertion bugs")
- Set a positive, helpful tone—we're here to make the code better together]
{{/if}}
```

#### 4. Recommendation Section Tone (around line 321)

**Add before "Refactoring Recommendations" header:**
```handlebars
{{#if experiments.tone_style == "conversational"}}
## 🎯 What Should We Do About This?

Here's my suggested action plan, prioritized by impact. Let's start with the biggest wins!
{{/if}}
{{#if experiments.tone_style == "formal"}}
## Refactoring Recommendations
{{/if}}
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| discussion_engagement_score | Primary | Sum of (views + 5×reactions + 10×comments) - higher is better |
| discussion_views | Secondary | Track visibility |
| discussion_reactions | Secondary | Track positive feedback (👍, ❤️, 🎉) |
| discussion_comments | Secondary | Track developer interaction |
| output_length_tokens | Secondary | Ensure conversational variant doesn't become verbose |
| analysis_completeness | Guardrail | ≥0.9 - Must cover all type clusters found |
| technical_accuracy | Guardrail | ≥0.95 - Recommendations must be technically sound |

### Statistical Design

- **Variants**: formal (baseline), conversational
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 10 (given daily schedule, this will take ~20 days)
- **Expected experiment duration**: ~20 days until minimum sample size reached
- **Analysis approach**: Mann-Whitney U test (non-parametric, suitable for engagement scores which may not be normally distributed)
- **Significance threshold**: p < 0.05
- **Minimum detectable effect**: 20% increase in engagement score
- **Power**: 80% at α=0.05

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter with all metadata fields
- [ ] Add conditional blocks to workflow prompt body using `{{#if experiments.tone_style == "formal"}}` (value-comparison form)
- [ ] Run `gh aw compile typist` to regenerate lock file and verify handlebars syntax
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After sufficient runs (~20 days), analyze variant distribution via workflow run artifacts
- [ ] Manually collect discussion engagement metrics (views, reactions, comments) for each run's created discussion
- [ ] Perform Mann-Whitney U test on engagement scores between variants
- [ ] Document findings and promote winning variant (or declare no significant difference)

### Measurement Notes

**Primary metric calculation:**
- Use GitHub API to fetch discussion stats for each run's output discussion
- `discussion_engagement_score = views + (5 × total_reactions) + (10 × comment_count)`
- Weights reflect relative value: comments >> reactions >> views
- Collect metrics 7 days after discussion creation to allow time for engagement

**Guardrail metric evaluation:**
- `analysis_completeness`: Manual review—did the discussion cover all clusters found in Phase 1?
- `technical_accuracy`: Manual review—are the refactoring recommendations technically correct?
- Sample 20% of runs per variant for guardrail evaluation

### Infrastructure Status

✅ **Experiment infrastructure is complete!** All three advanced fields (`analysis_type`, `tags`, `notify`) are fully implemented in both `pkg/workflow/compiler_experiments.go` and `actions/setup/js/pick_experiment.cjs`. The experiments system supports:

- Statistical test declarations (`analysis_type`)
- Free-form tagging for dashboards (`tags`)
- Automated significance notifications (`notify`)
- Guardrail metrics with thresholds
- Weighted randomization
- Date-window gating
- Per-run traceability
- OTEL span integration

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/typist.md`
- [Mann-Whitney U test]((en.wikipedia.org/redacted)
- [Experiments compiler reference](https://github.com/github/gh-aw/blob/main/pkg/workflow/compiler_experiments.go)







> Generated by [🧪 Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/26294227315) · ● 1M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on Jun 5, 2026, 2:47 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for typist: A/B test tone_variant #34032

🧪 Experiment Campaign: typist

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

1. Mission Section (lines 44-50)

2. Phase Introductions

3. Discussion Template Intro (lines 165-169)

4. Recommendation Section Tone (around line 321)

Success Metrics

Statistical Design

Implementation Steps

Measurement Notes

Infrastructure Status

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
discussion_engagement_score	Primary	Sum of (views + 5×reactions + 10×comments) - higher is better
discussion_views	Secondary	Track visibility
discussion_reactions	Secondary	Track positive feedback (👍, ❤️, 🎉)
discussion_comments	Secondary	Track developer interaction
output_length_tokens	Secondary	Ensure conversational variant doesn't become verbose
analysis_completeness	Guardrail	≥0.9 - Must cover all type clusters found
technical_accuracy	Guardrail	≥0.95 - Recommendations must be technically sound

[ab-advisor] Experiment campaign for typist: A/B test tone_variant #34032

Description

🧪 Experiment Campaign: typist

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

1. Mission Section (lines 44-50)

2. Phase Introductions

3. Discussion Template Intro (lines 165-169)

4. Recommendation Section Tone (around line 321)

Success Metrics

Statistical Design

Implementation Steps

Measurement Notes

Infrastructure Status

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions