Skip to content

Snapshot recording silently fails for experiments with many arms #53

Description

@jjroelofs

Bug

Experiments with many arms (e.g. 100+) never record any snapshots, causing the experiment detail report to show "No data yet. Charts appear after the experiment receives traffic." despite having thousands of impressions and conversions.

Root Cause

SnapshotStorage::shouldRecordSnapshot() uses ($total_turns % $interval) === 0 to decide whether to record. This assumes total_turns increments by 1 per request. However, ExperimentDataStorage::recordTurns() (plural) increments total_turns by the arm count per request.

Two consequences:

  1. First window is always skipped. With 208 arms, the first recordTurns call sets total_turns to 208, instantly exceeding the first_window of 19 (= 40% of floor(10000/208)). No snapshot is ever recorded in the early-traffic phase.

  2. Middle interval never aligns. The modulo check (total_turns % interval) requires an exact zero hit. Because total_turns jumps in steps of arm_count (e.g. 208) and interval grows proportionally with total_turns (roughly total_turns / middle_budget), the step size and interval share no common factor that produces a zero remainder. Simulation across all 82 page views (17,160 total turns) confirms zero hits.

Impact

Any experiment whose arm count exceeds the first_window (roughly 0.4 * floor(10000 / arm_count)) will never record snapshots and will show empty charts regardless of traffic volume. In practice this affects any experiment with more than ~50 arms.

Confirmed affected on a production site: experiment ai_sorting-help_center_categories-block_1 with 208 arms, 17,160 total turns, and 128 conversions shows "No data yet".

Simulation

// With 208 arms: snapshots_per_arm=48, first_window=19
// Step size per request = 208 (each recordTurns adds arm_count)
// First call puts total_turns at 208, already past first_window of 19
// Simulation: 0 snapshot hits across all 82 page views

Fix

shouldRecordSnapshot() needs to check whether the range [total_turns - step_size + 1, total_turns] crosses an interval boundary, rather than requiring an exact modulo-zero at a single point. This can be expressed as:

floor(total_turns / interval) != floor((total_turns - step_size) / interval)

The same range-crossing approach should also apply to isMilestone() and the first_window boundary check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions