feat(code-review): Register experiment controllers for code-review by giovanni-guidini · Pull Request #107230 · getsentry/sentry

giovanni-guidini · 2026-01-29T13:57:22Z

Summary

Register organizations:code-review-experiments-enabled feature flag and code-review.experiments to enable A/B testing infrastructure for code review experiments.

Why a feature flag and an option?

We want targeted assignment to orgs - some orgs should get this, others shouldn't, and we want good segmentation. That is why we use the feature flag to evaluate if a given org should be part of any experiment or not.

We also want variable PR assignment. Apparently that is not one of the identify fields for flags. So we'd have to implement a new context transformer that I'm reluctant to do.

The alternative is to have an option that will hold the active experiments and their rollout rate, as explained in the code.

That's why both :)

Changes

Add feature flag organizations:code-review-experiments-enabled in src/sentry/features/temporary.py
- Uses FLAGPOLE strategy for remote control via identity fields (org_id)
- Not exposed via API (api_expose=False)
Add option code-review.experiments in src/sentry/options/defaults.py
- Holds list of [experiment_name, rollout_rate] tuples
- Evaluated in order (first match wins)
- Modifiable via Options Automator
Add assignment logic in src/sentry/seer/code_review/assignment.py
- get_code_review_experiment(): Main function to determine PR experiment assignment
- _hash_assignment_key(): Deterministic hashing for per-PR variable rollout
- Uses org_id:pr_id:experiment_name as hash key for consistent assignment
Add comprehensive test coverage in tests/sentry/seer/code_review/test_assignment.py

Refs: CW-696

🤖 Generated with Claude Code

Register organizations:code-review.noop-experiment feature flag to enable A/B testing infrastructure for code review experiments. This flag uses Flagpole for remote control and is not exposed via API. Refs: CW-696 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

linear · 2026-01-29T13:57:25Z

CW-696 Register feature flags and configure Flagpole

We need to run A/B tests where different PRs from the same org can get different experiments, instead of all PRs from an org being in the same experiment group. Use hybrid Flagpole + Options approach: - Flagpole controls org eligibility (segment support for priority orgs, early adopters, gradual rollout) - Options control which experiments are active and their rollout rates Experiments are evaluated in order (first match wins), allowing multiple concurrent experiments with independent rollout percentages. Rename flag: organizations:code-review.noop-experiment → organizations:code-review-experiments-enabled Add option: bug-prediction.experiments (list of [name, rollout_rate]) Example config: [["noop-experiment", 0.5], ["cost-optimized", 0.3]] Result: 50% of PRs get noop, 30% get cost-optimized, 20% get baseline Refs: CW-696 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Mypy requires all test functions to have return type annotations. Refs: CW-696

ryan953 · 2026-01-30T19:00:54Z

+    Examples:
+        >>> # Option: [["noop", 1.0], ["cost", 0.5]]
+        >>> # All PRs get "noop" (100% rollout takes priority)
+        >>>
+        >>> # Option: [["noop", 0.0], ["cost", 0.5]]
+        >>> # noop disabled, 50% of PRs get "cost", 50% get "baseline"
+        >>>
+        >>> # Option: []
+        >>> # No experiments configured → all PRs get "baseline"
+    """
+    # Check if org is eligible for experiments via Flagpole
+    if not features.has(
+        "organizations:code-review-experiments-enabled",
+        organization,
+        actor=user,
+    ):
+        return "baseline"
+
+    # Get experiment configurations from Options
+    experiments: list[tuple[str, float]] = options.get("code-review.experiments")
+
+    # Evaluate experiments in order - first match wins
+    for experiment_name, rollout_rate in experiments:
+        # Skip disabled experiments
+        if rollout_rate <= 0.0:
+            continue
+
+        # Fully released experiment (100%)
+        if rollout_rate >= 1.0:
+            return experiment_name
+
+        # Per-PR variable assignment using hash of org + PR + experiment
+        # Including experiment_name ensures independent rollout per experiment
+        assignment_key = f"{organization.id}:{pr_id}:{experiment_name}"
+        if _hash_assignment_key(assignment_key, rollout_rate):
+            return experiment_name
+
+    # No experiment matched
+    return "baseline"


I like the structure of the option, an array of tuples describing each treatment 👍

I'm worried though that we're including the experiment name in the hash, so as we iterate through the list we're computing a percent-of-the-leftovers which will make it really hard to tell what a given population is going to be for an experiment, and thus how confident we are in the results. Also changes to other experiments would affect populations of things downstream which isn't easy to reason about.

I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).

To fix i think it would be better to hash each PR once, not including the experiment_name.

We can do some math to compare.
If we include experiment_name in the hash (like the pr has now) given the options: [a, 0.5], [b, 0.5], [c, 0.5] a PR would have 50% chance of being in group A, 25% chance of being in B, and only 12.5% chance of being in C. Wildly different populations even though they all say 0.5.

But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C. But looking closely the %'s add up to 1.5, so if we enforce that things should add to 1.0 then we'd actually write the options to be [a, 0.333], [b, 0.333], [c, 0.333] and every treatment will get equal sized populations, and consistent sizes as new treatments are added and removed.

Each PR would not want to rebalancing the %'s each time someone adds a new experiment, changing from 0.5, 0.5, 0.5 to 0.33, 0.33, 0.33, that's not the way. We should require that the sum of all treatments always adds up to 1.0. What people would need to do is borrow some sample size from the baseline/noop treatment and use in their experiment, giving it back when the experiment is done. The PRs would look like this:

# start new experiment [ - ['noop', 1.0], + ['noop', 0.9], + ['cost', 0.1], ] #end old experiment [ - ['noop', 0.7], + ['noop', 0.8], - ['cost', 0.1], ['perf', 0.2], ]

As for the flagpole check at the top, it should work really well too.
We'll have like, say on the order of 1,000 orgs with Seer. Of that group maybe we turn on the experiment for those with Early Adopter enabled or something like that, so population of orgs is on the order of say 100's. But they they'll produce like 10-100x PRs, so we'll have on the order of 100,000 PRs that are eligible. Of those some % will get the noop treatment, but even with sample sizes of 0.02 and above we could get 20,000 PRs per month into a treatment group.

UPDATE

I missed the forest for the trees a little there, but I agree with the point of ensuring things add up to 1 and the rebalancing of probabilities. And also that we should make just 1 hash and compare properly. Those are good changes for sure.

I don't love that we'll just leave for the devs to organize this tho. Ideally the code can sort itself out. I kinda thought of the CSS flex-grow property? If all treatments have a "grow" of 1 they all get the same probability, otherwise if one of the treatments is say 10 and another is 1 then the probabilities are 10/11 and 1/11.

(I had written this before the update)

I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).

That's a fair assumption indeed.

But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C.

1st I assume you say 50% A, 50% B, 0% C. But that is not correct. If we hash once and all treatments have the same rollout rate we compare same-HASH vs same-rollout. So it's actually either A or baseline depending on the hash value.

But if we hash the experiment name in there it might go to B or C

Oh, i did mean 50% A, 50% B, 0% C`. Typo!

So one thing that's a constant with all this is that one request can only be part of one group at a time. We'll skip over caching that and maintaining the same group through the life of the PR, we probably don't need that, but it's another topic. Lets drill back down into it.

If we're treat the system as if it were multiple A/B experiments then the outcome of the first experiment check will impact the population size of the second, because the user can only be in one thing at a time, if they skip the first experiment they become eligible for the second.
To implement this, especially we might do something like enabled() if hash < rollout_percent else disabled(). For three treatments it's a little different though.

So for a multi treatment experiment the total population is split across all the treatments available. I find it's helpful to think of all the hash values as falling on a number line, from 0.0 to 1.0, and the rollouts actually represent ranges on that line.

Here we have 4 treatments, each representing 25% of the total line.
When we hash keys they call somewhere on the line, and we want to return the treatment that corresponds to that segment.

0.0 1.0 +-----------+-----------+------------+-----------+ | base | T1 | T2 | T3 | +-----------+-----------+------------+-----------+ ^ ^ hash(key1) hash(key2)

We don't run enabled() if hash < rollout_percent else disabled() in a loop, that would be more like a casino game where people keep rolling to get hit something.
Instead of doing if hash < rollout_percent we'd want to do more like if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.
In the past i've seen rollouts defined in start/end terms like:

# Array<tuple<name, start, end>> # start is inclusive, end is exclusive Option: [ ["noop", 0.0, 0.25], ["cost", 0.25, 0.50], ["cost", 0.50, 0.75], ["cost", 0.75, 1.0], ]

Instead of doing if hash < rollout_percent we'd want to do more like if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.

Isn't that exactly what's implemented in the current version?
Except that the range is calculated by the code so we don't have to manually maintain it, just assign relative weights to the treatments.

cumulative = 0.0 # sum of previous treatments for experiment_name, weight in active_experiments: # Calculate cumulative threshold as percentage (0-100) cumulative += (weight / total_weight) * 100 # end range of current if bucket < cumulative: # is the hash in this experiments range? return experiment_name

…nment This changes the experiment assignment system from independent per-experiment hashing to a single-hash cumulative weight-based approach (similar to CSS flex-grow). ## Why This Change The previous approach had several problems: 1. **Unpredictable population sizes**: With independent hashing per experiment, the rollout rates were "percent-of-leftovers" which made it difficult to predict actual population sizes. For example, with [["a", 0.5], ["b", 0.5]], experiment A would get 50% but B would only get 25% (50% of the remaining 50%). 2. **Difficult to reason about**: Changes to one experiment would affect the population sizes of all downstream experiments in unpredictable ways. 3. **Inconsistent populations**: Three experiments all configured at 0.5 would result in wildly different populations (50%, 25%, 12.5%) despite having the same configuration value. ## Benefits of Weight-Based Approach 1. **Predictable populations**: Each experiment gets exactly its proportional share of traffic. Weights [["a", 1], ["b", 1]] = 50/50 split. 2. **Intuitive configuration**: Like CSS flex-grow, weights are relative: - [["a", 10], ["b", 1]] = 90.9% A, 9.1% B - [["a", 3], ["b", 2], ["c", 1]] = 50% A, 33% B, 17% C 3. **Stable assignments**: Adding or removing experiments doesn't affect the population sizes of other experiments since each PR is hashed once to a consistent bucket. 4. **Easy to analyze**: Population sizes are deterministic and can be calculated upfront, making it easier to determine statistical confidence. 5. **Self-balancing**: No need to manually calculate percentages that sum to 1.0 - the system automatically normalizes weights to 100%. ## Implementation Details - Each PR is hashed once to a bucket (0-99) using org_id:pr_id - Experiments are assigned based on cumulative weight ranges - Weight 0 disables an experiment - All positive weights are normalized to create proportional ranges Refs: CW-696

Change the experiment assignment to return None instead of the string "baseline" when a PR is not assigned to any experiment. This aligns with Seer's API expectations which use None to indicate control group. Changes: - Update return type: str -> str | None - Replace all "baseline" returns with None - Update all tests to check for None instead of "baseline" - Update docstrings and comments to use "control group" terminology Refs: CW-696

giovanni-guidini · 2026-02-02T09:52:26Z

Updated "baseline" --> None to align with feedback that was received in this PR: https://github.com/getsentry/seer/pull/4658

sentry · 2026-02-02T09:52:58Z

+    experiments: list[tuple[str, float]] = options.get("code-review.experiments")
+
+    # Filter out disabled experiments (weight 0) and calculate total weight
+    active_experiments = [(name, weight) for name, weight in experiments if weight > 0]


Bug: The code does not validate the structure of the code-review.experiments option, causing a runtime crash if the data is malformed via the Options Automator.
_{Severity: CRITICAL}

Suggested Fix

Add defensive validation before processing the experiments list. Iterate through the items, check if each is a list/tuple of length 2, and verify that the weight is a numeric type before appending it to active_experiments. This will filter out malformed entries and prevent runtime crashes.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/sentry/seer/code_review/assignment.py#L70-L73 Potential issue: The code processes the `code-review.experiments` option, which is expected to be a list of `[name, weight]` pairs. However, there is no validation to ensure the structure of the data. The Options Automator, which can modify this value at runtime, only validates that the option is a list, not that its elements are 2-item lists with a numeric weight. If a malformed value like `[["exp1"]]` or `[["exp1", "not-a-number"]]` is set, the list comprehension will fail with a `ValueError` or `TypeError` respectively. This will cause a runtime crash for any code review that triggers this logic, effectively breaking the feature for all users.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

giovanni-guidini · 2026-02-03T09:31:40Z

I'm going to close this for now because we'll slightly change the architecture here.

Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments. Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal. Refs CW-696

github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 29, 2026

vercel Bot deployed to Preview January 29, 2026 13:59 View deployment

vercel Bot deployed to Preview January 30, 2026 09:30 View deployment

vercel Bot deployed to Preview January 30, 2026 10:29 View deployment

giovanni-guidini force-pushed the feat/cw-696-register-noop-experiment-flag branch from eb2a418 to ad6a58d Compare January 30, 2026 10:42

vercel Bot deployed to Preview January 30, 2026 10:44 View deployment

giovanni-guidini commented Jan 30, 2026

View reviewed changes

Comment thread src/sentry/seer/code_review/assignment.py Outdated

fix(tests): Add return type annotations to test functions

25936f3

Mypy requires all test functions to have return type annotations. Refs: CW-696

giovanni-guidini force-pushed the feat/cw-696-register-noop-experiment-flag branch from ad6a58d to 25936f3 Compare January 30, 2026 11:05

vercel Bot deployed to Preview January 30, 2026 11:07 View deployment

giovanni-guidini changed the title ~~feat(code-review): Register noop-experiment flag for A/B testing~~ feat(code-review): Register experiment controllers for code-review Jan 30, 2026

giovanni-guidini marked this pull request as ready for review January 30, 2026 12:12

giovanni-guidini requested a review from a team as a code owner January 30, 2026 12:12

ryan953 reviewed Jan 30, 2026

View reviewed changes

vercel Bot deployed to Preview February 2, 2026 09:22 View deployment

giovanni-guidini requested a review from ryan953 February 2, 2026 09:22

vercel Bot deployed to Preview February 2, 2026 09:51 View deployment

sentry Bot reviewed Feb 2, 2026

View reviewed changes

giovanni-guidini closed this Feb 3, 2026

giovanni-guidini mentioned this pull request Feb 3, 2026

feat(code-review): Register experiments feature flag #107478

Merged

github-actions Bot locked and limited conversation to collaborators Feb 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(code-review): Register experiment controllers for code-review#107230

feat(code-review): Register experiment controllers for code-review#107230
giovanni-guidini wants to merge 5 commits intomasterfrom
feat/cw-696-register-noop-experiment-flag

giovanni-guidini commented Jan 29, 2026 •

edited

Loading

Uh oh!

linear Bot commented Jan 29, 2026

Uh oh!

Uh oh!

ryan953 Jan 30, 2026

Uh oh!

giovanni-guidini Feb 2, 2026 •

edited

Loading

Uh oh!

ryan953 Feb 2, 2026

Uh oh!

giovanni-guidini Feb 3, 2026

Uh oh!

giovanni-guidini commented Feb 2, 2026

Uh oh!

sentry Bot Feb 2, 2026

Uh oh!

giovanni-guidini commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

giovanni-guidini commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a feature flag and an option?

Changes

Uh oh!

linear Bot commented Jan 29, 2026

Uh oh!

Uh oh!

ryan953 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

giovanni-guidini Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan953 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

giovanni-guidini Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

giovanni-guidini commented Feb 2, 2026

Uh oh!

sentry Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

giovanni-guidini commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

giovanni-guidini commented Jan 29, 2026 •

edited

Loading

giovanni-guidini Feb 2, 2026 •

edited

Loading