feat(code-review): Register experiment controllers for code-review#107230
feat(code-review): Register experiment controllers for code-review#107230giovanni-guidini wants to merge 5 commits intomasterfrom
Conversation
Register organizations:code-review.noop-experiment feature flag to enable A/B testing infrastructure for code review experiments. This flag uses Flagpole for remote control and is not exposed via API. Refs: CW-696 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
We need to run A/B tests where different PRs from the same org can get
different experiments, instead of all PRs from an org being in the same
experiment group.
Use hybrid Flagpole + Options approach:
- Flagpole controls org eligibility (segment support for priority orgs,
early adopters, gradual rollout)
- Options control which experiments are active and their rollout rates
Experiments are evaluated in order (first match wins), allowing multiple
concurrent experiments with independent rollout percentages.
Rename flag: organizations:code-review.noop-experiment
→ organizations:code-review-experiments-enabled
Add option: bug-prediction.experiments (list of [name, rollout_rate])
Example config: [["noop-experiment", 0.5], ["cost-optimized", 0.3]]
Result: 50% of PRs get noop, 30% get cost-optimized, 20% get baseline
Refs: CW-696
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
eb2a418 to
ad6a58d
Compare
Mypy requires all test functions to have return type annotations. Refs: CW-696
ad6a58d to
25936f3
Compare
| Examples: | ||
| >>> # Option: [["noop", 1.0], ["cost", 0.5]] | ||
| >>> # All PRs get "noop" (100% rollout takes priority) | ||
| >>> | ||
| >>> # Option: [["noop", 0.0], ["cost", 0.5]] | ||
| >>> # noop disabled, 50% of PRs get "cost", 50% get "baseline" | ||
| >>> | ||
| >>> # Option: [] | ||
| >>> # No experiments configured → all PRs get "baseline" | ||
| """ | ||
| # Check if org is eligible for experiments via Flagpole | ||
| if not features.has( | ||
| "organizations:code-review-experiments-enabled", | ||
| organization, | ||
| actor=user, | ||
| ): | ||
| return "baseline" | ||
|
|
||
| # Get experiment configurations from Options | ||
| experiments: list[tuple[str, float]] = options.get("code-review.experiments") | ||
|
|
||
| # Evaluate experiments in order - first match wins | ||
| for experiment_name, rollout_rate in experiments: | ||
| # Skip disabled experiments | ||
| if rollout_rate <= 0.0: | ||
| continue | ||
|
|
||
| # Fully released experiment (100%) | ||
| if rollout_rate >= 1.0: | ||
| return experiment_name | ||
|
|
||
| # Per-PR variable assignment using hash of org + PR + experiment | ||
| # Including experiment_name ensures independent rollout per experiment | ||
| assignment_key = f"{organization.id}:{pr_id}:{experiment_name}" | ||
| if _hash_assignment_key(assignment_key, rollout_rate): | ||
| return experiment_name | ||
|
|
||
| # No experiment matched | ||
| return "baseline" |
There was a problem hiding this comment.
I like the structure of the option, an array of tuples describing each treatment 👍
I'm worried though that we're including the experiment name in the hash, so as we iterate through the list we're computing a percent-of-the-leftovers which will make it really hard to tell what a given population is going to be for an experiment, and thus how confident we are in the results. Also changes to other experiments would affect populations of things downstream which isn't easy to reason about.
I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).
To fix i think it would be better to hash each PR once, not including the experiment_name.
We can do some math to compare.
If we include experiment_name in the hash (like the pr has now) given the options: [a, 0.5], [b, 0.5], [c, 0.5] a PR would have 50% chance of being in group A, 25% chance of being in B, and only 12.5% chance of being in C. Wildly different populations even though they all say 0.5.
But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C. But looking closely the %'s add up to 1.5, so if we enforce that things should add to 1.0 then we'd actually write the options to be [a, 0.333], [b, 0.333], [c, 0.333] and every treatment will get equal sized populations, and consistent sizes as new treatments are added and removed.
Each PR would not want to rebalancing the %'s each time someone adds a new experiment, changing from 0.5, 0.5, 0.5 to 0.33, 0.33, 0.33, that's not the way. We should require that the sum of all treatments always adds up to 1.0. What people would need to do is borrow some sample size from the baseline/noop treatment and use in their experiment, giving it back when the experiment is done. The PRs would look like this:
# start new experiment
[
- ['noop', 1.0],
+ ['noop', 0.9],
+ ['cost', 0.1],
]
#end old experiment
[
- ['noop', 0.7],
+ ['noop', 0.8],
- ['cost', 0.1],
['perf', 0.2],
]As for the flagpole check at the top, it should work really well too.
We'll have like, say on the order of 1,000 orgs with Seer. Of that group maybe we turn on the experiment for those with Early Adopter enabled or something like that, so population of orgs is on the order of say 100's. But they they'll produce like 10-100x PRs, so we'll have on the order of 100,000 PRs that are eligible. Of those some % will get the noop treatment, but even with sample sizes of 0.02 and above we could get 20,000 PRs per month into a treatment group.
There was a problem hiding this comment.
UPDATE
I missed the forest for the trees a little there, but I agree with the point of ensuring things add up to 1 and the rebalancing of probabilities. And also that we should make just 1 hash and compare properly. Those are good changes for sure.
I don't love that we'll just leave for the devs to organize this tho. Ideally the code can sort itself out. I kinda thought of the CSS flex-grow property? If all treatments have a "grow" of 1 they all get the same probability, otherwise if one of the treatments is say 10 and another is 1 then the probabilities are 10/11 and 1/11.
(I had written this before the update)
I think what we have here is one experiment (the code-review strategy experiment) with a variable list of treatments (which strategy to use).
That's a fair assumption indeed.
But if we hash once and use the same options config the chances are: 50% chance of being in treatment A, 50% chance of being in C, 0% for C.
1st I assume you say 50% A, 50% B, 0% C. But that is not correct. If we hash once and all treatments have the same rollout rate we compare same-HASH vs same-rollout. So it's actually either A or baseline depending on the hash value.
But if we hash the experiment name in there it might go to B or C
There was a problem hiding this comment.
Oh, i did mean 50% A, 50% B, 0% C`. Typo!
So one thing that's a constant with all this is that one request can only be part of one group at a time. We'll skip over caching that and maintaining the same group through the life of the PR, we probably don't need that, but it's another topic. Lets drill back down into it.
If we're treat the system as if it were multiple A/B experiments then the outcome of the first experiment check will impact the population size of the second, because the user can only be in one thing at a time, if they skip the first experiment they become eligible for the second.
To implement this, especially we might do something like enabled() if hash < rollout_percent else disabled(). For three treatments it's a little different though.
So for a multi treatment experiment the total population is split across all the treatments available. I find it's helpful to think of all the hash values as falling on a number line, from 0.0 to 1.0, and the rollouts actually represent ranges on that line.
Here we have 4 treatments, each representing 25% of the total line.
When we hash keys they call somewhere on the line, and we want to return the treatment that corresponds to that segment.
0.0 1.0
+-----------+-----------+------------+-----------+
| base | T1 | T2 | T3 |
+-----------+-----------+------------+-----------+
^ ^
hash(key1) hash(key2)
We don't run enabled() if hash < rollout_percent else disabled() in a loop, that would be more like a casino game where people keep rolling to get hit something.
Instead of doing if hash < rollout_percent we'd want to do more like if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.
In the past i've seen rollouts defined in start/end terms like:
# Array<tuple<name, start, end>>
# start is inclusive, end is exclusive
Option: [
["noop", 0.0, 0.25],
["cost", 0.25, 0.50],
["cost", 0.50, 0.75],
["cost", 0.75, 1.0],
]
There was a problem hiding this comment.
Instead of doing if hash < rollout_percent we'd want to do more like
if treatment.start < hash < treatment.end: where start is the sum of the rollouts previous, and end is the sum+the rollout size of this treatment.
Isn't that exactly what's implemented in the current version?
Except that the range is calculated by the code so we don't have to manually maintain it, just assign relative weights to the treatments.
cumulative = 0.0 # sum of previous treatments
for experiment_name, weight in active_experiments:
# Calculate cumulative threshold as percentage (0-100)
cumulative += (weight / total_weight) * 100 # end range of current
if bucket < cumulative: # is the hash in this experiments range?
return experiment_name…nment This changes the experiment assignment system from independent per-experiment hashing to a single-hash cumulative weight-based approach (similar to CSS flex-grow). ## Why This Change The previous approach had several problems: 1. **Unpredictable population sizes**: With independent hashing per experiment, the rollout rates were "percent-of-leftovers" which made it difficult to predict actual population sizes. For example, with [["a", 0.5], ["b", 0.5]], experiment A would get 50% but B would only get 25% (50% of the remaining 50%). 2. **Difficult to reason about**: Changes to one experiment would affect the population sizes of all downstream experiments in unpredictable ways. 3. **Inconsistent populations**: Three experiments all configured at 0.5 would result in wildly different populations (50%, 25%, 12.5%) despite having the same configuration value. ## Benefits of Weight-Based Approach 1. **Predictable populations**: Each experiment gets exactly its proportional share of traffic. Weights [["a", 1], ["b", 1]] = 50/50 split. 2. **Intuitive configuration**: Like CSS flex-grow, weights are relative: - [["a", 10], ["b", 1]] = 90.9% A, 9.1% B - [["a", 3], ["b", 2], ["c", 1]] = 50% A, 33% B, 17% C 3. **Stable assignments**: Adding or removing experiments doesn't affect the population sizes of other experiments since each PR is hashed once to a consistent bucket. 4. **Easy to analyze**: Population sizes are deterministic and can be calculated upfront, making it easier to determine statistical confidence. 5. **Self-balancing**: No need to manually calculate percentages that sum to 1.0 - the system automatically normalizes weights to 100%. ## Implementation Details - Each PR is hashed once to a bucket (0-99) using org_id:pr_id - Experiments are assigned based on cumulative weight ranges - Weight 0 disables an experiment - All positive weights are normalized to create proportional ranges Refs: CW-696
Change the experiment assignment to return None instead of the string "baseline" when a PR is not assigned to any experiment. This aligns with Seer's API expectations which use None to indicate control group. Changes: - Update return type: str -> str | None - Replace all "baseline" returns with None - Update all tests to check for None instead of "baseline" - Update docstrings and comments to use "control group" terminology Refs: CW-696
|
Updated |
| experiments: list[tuple[str, float]] = options.get("code-review.experiments") | ||
|
|
||
| # Filter out disabled experiments (weight 0) and calculate total weight | ||
| active_experiments = [(name, weight) for name, weight in experiments if weight > 0] |
There was a problem hiding this comment.
Bug: The code does not validate the structure of the code-review.experiments option, causing a runtime crash if the data is malformed via the Options Automator.
Severity: CRITICAL
Suggested Fix
Add defensive validation before processing the experiments list. Iterate through the items, check if each is a list/tuple of length 2, and verify that the weight is a numeric type before appending it to active_experiments. This will filter out malformed entries and prevent runtime crashes.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/seer/code_review/assignment.py#L70-L73
Potential issue: The code processes the `code-review.experiments` option, which is
expected to be a list of `[name, weight]` pairs. However, there is no validation to
ensure the structure of the data. The Options Automator, which can modify this value at
runtime, only validates that the option is a list, not that its elements are 2-item
lists with a numeric weight. If a malformed value like `[["exp1"]]` or `[["exp1",
"not-a-number"]]` is set, the list comprehension will fail with a `ValueError` or
`TypeError` respectively. This will cause a runtime crash for any code review that
triggers this logic, effectively breaking the feature for all users.
Did we get this right? 👍 / 👎 to inform future reviews.
|
I'm going to close this for now because we'll slightly change the architecture here. |
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments. Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal. Refs CW-696
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments. Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal. Refs CW-696
Ports only the feature flag from PR #107230 to control org-level eligibility for code review experiments. Per the updated architecture decision in CW-696, the option registration and hash bucket assignment logic will be moved directly to Seer. This allows Seer to test sentry-options integration directly while keeping the monolith implementation minimal. Refs CW-696
Summary
Register
organizations:code-review-experiments-enabledfeature flag andcode-review.experimentsto enable A/B testing infrastructure for code review experiments.Why a feature flag and an option?
We want targeted assignment to orgs - some orgs should get this, others shouldn't, and we want good segmentation. That is why we use the feature flag to evaluate if a given org should be part of any experiment or not.
We also want variable PR assignment. Apparently that is not one of the identify fields for flags. So we'd have to implement a new context transformer that I'm reluctant to do.
The alternative is to have an option that will hold the active experiments and their rollout rate, as explained in the code.
That's why both :)
Changes
Refs: CW-696
🤖 Generated with Claude Code