Skip to content

feat(code-review): Register experiment controllers for code-review#107230

Closed
giovanni-guidini wants to merge 5 commits intomasterfrom
feat/cw-696-register-noop-experiment-flag
Closed

feat(code-review): Register experiment controllers for code-review#107230
giovanni-guidini wants to merge 5 commits intomasterfrom
feat/cw-696-register-noop-experiment-flag

Conversation

@giovanni-guidini
Copy link
Copy Markdown
Contributor

@giovanni-guidini giovanni-guidini commented Jan 29, 2026

Summary

Register organizations:code-review-experiments-enabled feature flag and code-review.experiments to enable A/B testing infrastructure for code review experiments.

Why a feature flag and an option?

We want targeted assignment to orgs - some orgs should get this, others shouldn't, and we want good segmentation. That is why we use the feature flag to evaluate if a given org should be part of any experiment or not.

We also want variable PR assignment. Apparently that is not one of the identify fields for flags. So we'd have to implement a new context transformer that I'm reluctant to do.

The alternative is to have an option that will hold the active experiments and their rollout rate, as explained in the code.

That's why both :)

Changes

  • Add feature flag organizations:code-review-experiments-enabled in src/sentry/features/temporary.py
    • Uses FLAGPOLE strategy for remote control via identity fields (org_id)
    • Not exposed via API (api_expose=False)
  • Add option code-review.experiments in src/sentry/options/defaults.py
    • Holds list of [experiment_name, rollout_rate] tuples
    • Evaluated in order (first match wins)
    • Modifiable via Options Automator
  • Add assignment logic in src/sentry/seer/code_review/assignment.py
    • get_code_review_experiment(): Main function to determine PR experiment assignment
    • _hash_assignment_key(): Deterministic hashing for per-PR variable rollout
    • Uses org_id:pr_id:experiment_name as hash key for consistent assignment
  • Add comprehensive test coverage in tests/sentry/seer/code_review/test_assignment.py

Refs: CW-696

🤖 Generated with Claude Code

Register organizations:code-review.noop-experiment feature flag to enable
A/B testing infrastructure for code review experiments. This flag uses
Flagpole for remote control and is not exposed via API.

Refs: CW-696

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented Jan 29, 2026

@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 29, 2026
We need to run A/B tests where different PRs from the same org can get
different experiments, instead of all PRs from an org being in the same
experiment group.

Use hybrid Flagpole + Options approach:
- Flagpole controls org eligibility (segment support for priority orgs,
  early adopters, gradual rollout)
- Options control which experiments are active and their rollout rates

Experiments are evaluated in order (first match wins), allowing multiple
concurrent experiments with independent rollout percentages.

Rename flag: organizations:code-review.noop-experiment
         → organizations:code-review-experiments-enabled

Add option: bug-prediction.experiments (list of [name, rollout_rate])

Example config: [["noop-experiment", 0.5], ["cost-optimized", 0.3]]
Result: 50% of PRs get noop, 30% get cost-optimized, 20% get baseline

Refs: CW-696

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Comment thread src/sentry/seer/code_review/assignment.py Outdated
Mypy requires all test functions to have return type annotations.

Refs: CW-696
@giovanni-guidini giovanni-guidini force-pushed the feat/cw-696-register-noop-experiment-flag branch from ad6a58d to 25936f3 Compare January 30, 2026 11:05
@giovanni-guidini giovanni-guidini changed the title feat(code-review): Register noop-experiment flag for A/B testing feat(code-review): Register experiment controllers for code-review Jan 30, 2026
@giovanni-guidini giovanni-guidini marked this pull request as ready for review January 30, 2026 12:12
@giovanni-guidini giovanni-guidini requested a review from a team as a code owner January 30, 2026 12:12
Comment on lines +52 to +90
Examples:
>>> # Option: [["noop", 1.0], ["cost", 0.5]]
>>> # All PRs get "noop" (100% rollout takes priority)
>>>
>>> # Option: [["noop", 0.0], ["cost", 0.5]]
>>> # noop disabled, 50% of PRs get "cost", 50% get "baseline"
>>>
>>> # Option: []
>>> # No experiments configured → all PRs get "baseline"
"""
# Check if org is eligible for experiments via Flagpole
if not features.has(
"organizations:code-review-experiments-enabled",
organization,
actor=user,
):
return "baseline"

# Get experiment configurations from Options
experiments: list[tuple[str, float]] = options.get("code-review.experiments")

# Evaluate experiments in order - first match wins
for experiment_name, rollout_rate in experiments:
# Skip disabled experiments
if rollout_rate <= 0.0:
continue

# Fully released experiment (100%)
if rollout_rate >= 1.0:
return experiment_name

# Per-PR variable assignment using hash of org + PR + experiment
# Including experiment_name ensures independent rollout per experiment
assignment_key = f"{organization.id}:{pr_id}:{experiment_name}"
if _hash_assignment_key(assignment_key, rollout_rate):
return experiment_name

# No experiment matched
return "baseline"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the structure of the option, an array of tuples describing each treatment 👍

I'm worried though that we're including the experiment name in the hash, so as we iterate through the list we're computing a percent-of-the-leftovers which will make it really hard to tell what a given population is going to be for an experiment, and thus how confident we are in the results. Also changes to other experiments would affect populations of things downstream which isn't easy to reason about.

I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).

To fix i think it would be better to hash each PR once, not including the experiment_name.

We can do some math to compare.
If we include experiment_name in the hash (like the pr has now) given the options: [a, 0.5], [b, 0.5], [c, 0.5] a PR would have 50% chance of being in group A, 25% chance of being in B, and only 12.5% chance of being in C. Wildly different populations even though they all say 0.5.

But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C. But looking closely the %'s add up to 1.5, so if we enforce that things should add to 1.0 then we'd actually write the options to be [a, 0.333], [b, 0.333], [c, 0.333] and every treatment will get equal sized populations, and consistent sizes as new treatments are added and removed.

Each PR would not want to rebalancing the %'s each time someone adds a new experiment, changing from 0.5, 0.5, 0.5 to 0.33, 0.33, 0.33, that's not the way. We should require that the sum of all treatments always adds up to 1.0. What people would need to do is borrow some sample size from the baseline/noop treatment and use in their experiment, giving it back when the experiment is done. The PRs would look like this:

# start new experiment
[
-   ['noop', 1.0],
+   ['noop', 0.9],
+   ['cost', 0.1],
]

#end old experiment
[
-   ['noop', 0.7],
+   ['noop', 0.8],
-   ['cost', 0.1],
    ['perf', 0.2],
]

As for the flagpole check at the top, it should work really well too.
We'll have like, say on the order of 1,000 orgs with Seer. Of that group maybe we turn on the experiment for those with Early Adopter enabled or something like that, so population of orgs is on the order of say 100's. But they they'll produce like 10-100x PRs, so we'll have on the order of 100,000 PRs that are eligible. Of those some % will get the noop treatment, but even with sample sizes of 0.02 and above we could get 20,000 PRs per month into a treatment group.

Copy link
Copy Markdown
Contributor Author

@giovanni-guidini giovanni-guidini Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE

I missed the forest for the trees a little there, but I agree with the point of ensuring things add up to 1 and the rebalancing of probabilities. And also that we should make just 1 hash and compare properly. Those are good changes for sure.

I don't love that we'll just leave for the devs to organize this tho. Ideally the code can sort itself out. I kinda thought of the CSS flex-grow property? If all treatments have a "grow" of 1 they all get the same probability, otherwise if one of the treatments is say 10 and another is 1 then the probabilities are 10/11 and 1/11.


(I had written this before the update)

I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).

That's a fair assumption indeed.

But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C.

1st I assume you say 50% A, 50% B, 0% C. But that is not correct. If we hash once and all treatments have the same rollout rate we compare same-HASH vs same-rollout. So it's actually either A or baseline depending on the hash value.

But if we hash the experiment name in there it might go to B or C

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i did mean 50% A, 50% B, 0% C`. Typo!

So one thing that's a constant with all this is that one request can only be part of one group at a time. We'll skip over caching that and maintaining the same group through the life of the PR, we probably don't need that, but it's another topic. Lets drill back down into it.

If we're treat the system as if it were multiple A/B experiments then the outcome of the first experiment check will impact the population size of the second, because the user can only be in one thing at a time, if they skip the first experiment they become eligible for the second.
To implement this, especially we might do something like enabled() if hash < rollout_percent else disabled(). For three treatments it's a little different though.

So for a multi treatment experiment the total population is split across all the treatments available. I find it's helpful to think of all the hash values as falling on a number line, from 0.0 to 1.0, and the rollouts actually represent ranges on that line.

Here we have 4 treatments, each representing 25% of the total line.
When we hash keys they call somewhere on the line, and we want to return the treatment that corresponds to that segment.

0.0                                            1.0
+-----------+-----------+------------+-----------+
|    base   |     T1    |     T2     |    T3     |
+-----------+-----------+------------+-----------+
    ^                          ^
    hash(key1)                 hash(key2)

We don't run enabled() if hash < rollout_percent else disabled() in a loop, that would be more like a casino game where people keep rolling to get hit something.
Instead of doing if hash < rollout_percent we'd want to do more like if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.
In the past i've seen rollouts defined in start/end terms like:

# Array<tuple<name, start, end>>
# start is inclusive, end is exclusive
Option: [
  ["noop", 0.0, 0.25],
  ["cost", 0.25, 0.50],
  ["cost", 0.50, 0.75],
  ["cost", 0.75, 1.0],
]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing if hash < rollout_percent we'd want to do more like if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.

Isn't that exactly what's implemented in the current version?
Except that the range is calculated by the code so we don't have to manually maintain it, just assign relative weights to the treatments.

cumulative = 0.0 # sum of previous treatments
for experiment_name, weight in active_experiments:
    # Calculate cumulative threshold as percentage (0-100)
    cumulative += (weight / total_weight) * 100 # end range of current

    if bucket < cumulative: # is the hash in this experiments range?
        return experiment_name

…nment

This changes the experiment assignment system from independent per-experiment
hashing to a single-hash cumulative weight-based approach (similar to CSS
flex-grow).

## Why This Change

The previous approach had several problems:

1. **Unpredictable population sizes**: With independent hashing per experiment,
   the rollout rates were "percent-of-leftovers" which made it difficult to
   predict actual population sizes. For example, with [["a", 0.5], ["b", 0.5]],
   experiment A would get 50% but B would only get 25% (50% of the remaining
   50%).

2. **Difficult to reason about**: Changes to one experiment would affect the
   population sizes of all downstream experiments in unpredictable ways.

3. **Inconsistent populations**: Three experiments all configured at 0.5 would
   result in wildly different populations (50%, 25%, 12.5%) despite having
   the same configuration value.

## Benefits of Weight-Based Approach

1. **Predictable populations**: Each experiment gets exactly its proportional
   share of traffic. Weights [["a", 1], ["b", 1]] = 50/50 split.

2. **Intuitive configuration**: Like CSS flex-grow, weights are relative:
   - [["a", 10], ["b", 1]] = 90.9% A, 9.1% B
   - [["a", 3], ["b", 2], ["c", 1]] = 50% A, 33% B, 17% C

3. **Stable assignments**: Adding or removing experiments doesn't affect
   the population sizes of other experiments since each PR is hashed once
   to a consistent bucket.

4. **Easy to analyze**: Population sizes are deterministic and can be
   calculated upfront, making it easier to determine statistical confidence.

5. **Self-balancing**: No need to manually calculate percentages that sum
   to 1.0 - the system automatically normalizes weights to 100%.

## Implementation Details

- Each PR is hashed once to a bucket (0-99) using org_id:pr_id
- Experiments are assigned based on cumulative weight ranges
- Weight 0 disables an experiment
- All positive weights are normalized to create proportional ranges

Refs: CW-696
Change the experiment assignment to return None instead of the string
"baseline" when a PR is not assigned to any experiment. This aligns
with Seer's API expectations which use None to indicate control group.

Changes:
- Update return type: str -> str | None
- Replace all "baseline" returns with None
- Update all tests to check for None instead of "baseline"
- Update docstrings and comments to use "control group" terminology

Refs: CW-696
@giovanni-guidini
Copy link
Copy Markdown
Contributor Author

Updated "baseline" --> None to align with feedback that was received in this PR: https://github.com/getsentry/seer/pull/4658

Comment on lines +70 to +73
experiments: list[tuple[str, float]] = options.get("code-review.experiments")

# Filter out disabled experiments (weight 0) and calculate total weight
active_experiments = [(name, weight) for name, weight in experiments if weight > 0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The code does not validate the structure of the code-review.experiments option, causing a runtime crash if the data is malformed via the Options Automator.
Severity: CRITICAL

Suggested Fix

Add defensive validation before processing the experiments list. Iterate through the items, check if each is a list/tuple of length 2, and verify that the weight is a numeric type before appending it to active_experiments. This will filter out malformed entries and prevent runtime crashes.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/seer/code_review/assignment.py#L70-L73

Potential issue: The code processes the `code-review.experiments` option, which is
expected to be a list of `[name, weight]` pairs. However, there is no validation to
ensure the structure of the data. The Options Automator, which can modify this value at
runtime, only validates that the option is a list, not that its elements are 2-item
lists with a numeric weight. If a malformed value like `[["exp1"]]` or `[["exp1",
"not-a-number"]]` is set, the list comprehension will fail with a `ValueError` or
`TypeError` respectively. This will cause a runtime crash for any code review that
triggers this logic, effectively breaking the feature for all users.

Did we get this right? 👍 / 👎 to inform future reviews.

@giovanni-guidini
Copy link
Copy Markdown
Contributor Author

I'm going to close this for now because we'll slightly change the architecture here.

giovanni-guidini added a commit that referenced this pull request Feb 3, 2026
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments.

Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal.

Refs CW-696
giovanni-guidini added a commit that referenced this pull request Feb 3, 2026
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments.

Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal.

Refs CW-696
giovanni-guidini added a commit that referenced this pull request Feb 3, 2026
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments.

Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal.

Refs CW-696
@github-actions github-actions Bot locked and limited conversation to collaborators Feb 18, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants