Change z-score calculations to incorporate `begins_distribution_change` #473

austin3dickey · 2022-11-30T20:41:05Z

See comments below for full discussion. Current plan: when calculating the z-score of a BenchmarkResult B on commit C,

for the distribution mean, use the mean of the means of all BenchmarkResults that:
- share the same case/context/hardware/repo as B
- are on the default branch
- are in the commit ancestry of B
- starting with either the earliest BenchmarkResult that matches these criteria, or the latest matching BenchmarkResult where begins_distribution_change is True, whichever is later
- ending with the latest matching BenchmarkResult before C
for the distribution standard deviation, calculate the sample standard deviation of the differences between the mean of each BenchmarkResult from the rolling distribution mean (which is calculated the same way as above)
- This always starts at the earliest BenchmarkResult that matches the criteria, not accounting for begins_distribution_change
use the mean of B for the z-score calculation

The text was updated successfully, but these errors were encountered:

jonkeane · 2023-01-09T17:03:14Z

A few options we discussed in #383, but with more summary and polish here:

We could (and IMHO should) treat the step-change markers we have from #472 and #574 as known-mean shifts in the distribution.

We currently use two measures of history: the mean of the measure (for tracking central tendency) and the standard deviation. If we centered all of our measures around the means from their segments before calculating the standard deviation for the whole set, we will have our measure of variability that is independent from the mean shifts. IIRC, @alistaire47 called this the standard deviation of the residuals (where the model is the mean of the segment). I'm happy to throw some code together that shows this off (in the style of what's in #383), but need to do that later.

We also should consider adding an exponential weighting so that past history has less (or effectively no) influence on recent measures. But IMO we should put that as a separate ticket and not try and tackle it here.

I also think even with just this scaling, we probably are at the point that we want (or possibly even need) to pull these calculations out of being executed exclusively in the database.

Finally, we should consider what this looks like in the UI + how we message to folks about this. I think improvements to the plots (some of #477) to show the area of uncertainty and if bots fall outside of it is absolutely critical. After that, we should describe why we're using history for this. And finally, ideally we would also have a robust and detailed document describing the approach + math for folks that want to dive that deeply

alistaire47 · 2023-01-09T17:04:44Z

To summarize my thoughts on how to do this:

Calculate the means (or median or whatever) of each segment (separated by step changes)
Subtract off (1) from the raw data to get residuals across segments
Take a rolling standard deviation weighted for recency to calculate bounds. Weighting could be exponential or "triangle" (linear decay) or otherwise; what matters is that more recent data gets weighted more so if spread changes after a step change, bounds won't blow out, but will adjust.

An alternative way to do (3) would be to weight entire segments together instead of points. Ontologically this is a more attractive way to do it, but the math is more complicated since you care about both how recent a segment is plus how many commits come after. You could mix two segments with a conjugate prior (normal-normal might work, though sd follows a chi distribution, I think? Could maybe use a beta distribution to parameterize mixing?), but if segments are short we'd have to chain that somehow. If we want to go this route, we should nerd-snipe a Bayesian statistician with it.

jonkeane · 2023-01-09T18:42:39Z

Here's some more code demonstrating this:

library(dplyr)
library(slider)
library(ggplot2)

fake_data_1 <- data.frame(
  time = rnorm(1000, mean = 5.28, sd = 0.01),
  step_change = FALSE
)

fake_data_2 <- data.frame(
  time = rnorm(10, mean = 11.2, sd = 0.01),
  step_change = FALSE
)
fake_data_2[1,"step_change"] <- TRUE

fake_data_3 <- data.frame(
  time = rnorm(90, mean = 5.28, sd = 0.01),
  step_change = FALSE
)
fake_data_3[1,"step_change"] <- TRUE

fake_data_4 <- data.frame(
  time = rnorm(90, mean = 2.28, sd = 0.01),
  step_change = FALSE
)
fake_data_4[1,"step_change"] <- TRUE

fake_data_5 <- data.frame(
  time = rnorm(120, mean = 5.28, sd = 0.01),
  step_change = FALSE
)
fake_data_5[1,"step_change"] <- TRUE

data <- bind_rows(fake_data_1, fake_data_2, fake_data_3, fake_data_4, fake_data_5)

# add row number
data <- mutate(data, num = row_number())

# our current strategy
data_proc <- data %>%
  mutate(
    sd = slider::slide_dbl(time, sd, .before = 100, .after = 0),
    mean = slider::slide_dbl(time, mean, .before = 100, .after = 0),
    z_score = (time - lag(mean)) / sd
  )

ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = mean - sd * 5, ymax = mean + sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations", subtitle = "the grey band is mean +/- sd*5 (or our z-score threshold)")


# With scaling
data_proc <- data %>%
  # lable the segments as different each time there's a step change
  mutate(segment = cumsum(data$step_change)) %>% 
  # calculate means per group
  group_by(segment) %>% 
  mutate(mean = mean(time)) %>%
  ungroup() %>%
  mutate(
    # scale the mean
    scaled = time - mean,
    # and the value for .before here can be increased without the bands expanding too much
    sd = slider::slide_dbl(scaled, sd, .before = 1000, .after = 0)
  )
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = mean - sd * 5, ymax = mean + sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is mean +/- sd*5 (or our z-score threshold)")

austin3dickey · 2023-01-09T20:48:54Z

Okay I finally understand this approach. That's a lot simpler than I thought, and I agree that we can tackle "weighting by recency" later.

jonkeane · 2023-01-09T21:06:56Z

Another facet of this that I didn't realize until I wrote out that simulation (discovered only because a bug in an earlier implementation 😬 !): with the possible exception of every-single-commit-is-its-own-segment (though even that may be fine, actually!), having an "extra" segment by labelling a commit that doesn't have a big difference as being the start of a new segment doesn't negatively impact that standard deviation bands.

alistaire47 · 2023-01-09T22:03:30Z

That example assumes standard deviation is constant across segments, though. The reason for weighting is when it's not—it will adjust smoothly (otherwise the old width will last until the old data ages out) and more quickly to the new spread, hopefully without too many false positives in the case where spread increases (though that's still a possibility).

...but we can hold off on weighting if we're not too concerned about spread changes, because eliminating the jumps from the computation does fix the artificially big bounds

jonkeane · 2023-01-10T21:11:34Z

That example assumes standard deviation is constant across segments, though. The reason for weighting is when it's not—it will adjust smoothly (otherwise the old width will last until the old data ages out) and more quickly to the new spread, hopefully without too many false positives in the case where spread increases (though that's still a possibility).

Totally, even without weighting, the shift will (very slowly!) reduce over time if we go from a high variability regime (back) to a low variability regime. But the time it would take to do that is quite long. And like you said: weighting will help smooth that out more quickly in a nice way so we should do it. If we can "easily" do the weighting here let's do it — but I don't want to get too distracted with perfecting the weighting that we delay (even by a little bit) getting out the long-awaited other more basic improvements we can get without it.

jgehrcke · 2023-01-16T16:42:56Z

I think we should try to

keep analysis as simple as it can be (and only as complex as it verifiably needs to be)
keep things explainable to a rather regular user

(maybe the proposals above are compatible with that perspective!)

The following paragraph is meant half serious half funny (I have not read all of above in all detail, and hope that my comment is not perceived as unfair!). After skimming through the paragraphs above I'd love to share an impression. In my life I've done and seen my fair share of advanced data analysis. I have the feeling (I might be wrong) that our space of business here does not really need super advanced data analysis. Before doing Bayesian Principal Component Analysis of the Fourier Transform of the exponentially weighted chi distribution (conjugated, of course 🚀!) -- my intuition is that we should focus on improving the raw data first (reduce noise, inspect distribution properties qualitatively, etc), and then apply really basic means for regression detection :). I know this is provocative, and over time I'd love to learn and convince myself how delicate and intricate our regression detection algorithms really need to be.

alistaire47 · 2023-01-18T18:55:11Z

Agreed, simple is good, provided it works. The question is how to mix data across segments when calculating the bounds so they adjust smoothly and responsively. To what extent it matters how well we do it depends on how big of shifts in spread between segments we see—if spread mostly doesn't change that much, we can do very little and be fine. If spread sometimes increases significantly, we'll see some noise alerts right after shifts unless we improve our mixing method.

Historically we've avoided these noise alerts because we include the jump in the calculating the bounds, blowing them out for a while anyway. Since we're going to stop doing that, we'll see soon if we need more sophistication. And even then, just weighting for recency might be good enough!

austin3dickey · 2023-01-19T21:38:01Z

I'm looking deeper at the examples above. I'm not super comfortable with centering the band around the mean of the entire segment, because that's affected by future data in the same segment, which could cause weird alerts. For example, I could see a scenario where we send a regression alert, but we don't see it in the webapp later, because the band's center drifted upwards after new data came in. So I propose we center the band around the rolling mean that excludes data from before the segment started. That code might look like this:

data_proc <- data %>%
  mutate(segment = cumsum(data$step_change)) %>% 
  group_by(segment) %>% 
  mutate(
    segment_mean = mean(time),
    rolling_mean = slider::slide_dbl(time, mean, .before = 1000, .after = 0)
  ) %>%
  ungroup() %>%
  mutate(
    scaled = time - segment_mean,
    rolling_sd = slider::slide_dbl(scaled, sd, .before = 1000, .after = 0)
  )
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = rolling_mean - rolling_sd * 5, ymax = rolling_mean + rolling_sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is rolling_mean +/- rolling_sd*5 (or our z-score threshold)")

While we're at it, we could do the same thing with the standard deviation: use the rolling standard deviation that excludes data from before the segment started. This solves Ed's problem of when the spread looks different between two segments. That would look like this:

data_proc <- data %>%
  mutate(segment = cumsum(data$step_change)) %>% 
  group_by(segment) %>% 
  mutate(
    rolling_mean = slider::slide_dbl(time, mean, .before = 1000, .after = 0),
    rolling_sd = slider::slide_dbl(time, sd, .before = 1000, .after = 0)
  ) %>%
  ungroup()
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = rolling_mean - rolling_sd * 5, ymax = rolling_mean + rolling_sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is rolling_mean +/- rolling_sd*5 (or our z-score threshold)")

Of course this has the issue where the first few points per segment are missing z-score.

Time for some plots. I used the same data-generating code you did except:

I changed the sd of the dummy data to 1 in all cases
In the big first segment, I made the data increase slightly over time, so that it shows how using the segment mean might not look great for all data points.

Jon's option:

My option 1:

My option 2:

austin3dickey · 2023-01-19T21:41:41Z

After typing that whole comment, I sorta convinced myself that maybe it's okay to use the segment mean as the center of the band, even if it uses future data. Justification:

We try to control for everything that might cause a shift in mean. For one segment, the mean really should stay constant. If it changes, that's cause for creating another segment. Once the new segment is created, the first segment should adjust such that we can see the correct alerts again, for the most part.
That first plot really looks nice after a "step change" compared to the other two plots. That consistency is probably worth the hand-wavy use of future data.

jonkeane · 2023-01-19T22:01:32Z

I'm looking deeper at the examples above. I'm not super comfortable with centering the band around the mean of the entire segment, because that's affected by future data in the same segment, which could cause weird alerts. For example, I could see a scenario where we send a regression alert, but we don't see it in the webapp later, because the band's center drifted upwards after new data came in. So I propose we center the band around the rolling mean that excludes data from before the segment started.

Yes, absolutely. I meant to say this in my previous comment that my quick-and-dirty code notably doesn't do this, and it really should be done only ever looking back like you say here.

While we're at it, we could do the same thing with the standard deviation: use the rolling standard deviation that excludes data from before the segment started.

I'm very strongly ➖ on only using SD data from the current segment (if that's what you mean by "excludes data from before the segment started). In the current architecture, we really should be using as much SD information as we can (especially from recent, but possibly different segments). As we evolve our alerting mechanisms we might change this to be smarter, or include other more simple difference calculations (that's ok!) but for now I don't think we can afford to basically say after any deviation "we have no idea what the spread could be, we can't possibly alert on anything (or worse: we continue to alert and use samples that have a small N to alert when that might not be valid)" which is effectively the current regime we are in when we blow out the SDs. One of the biggest new features of this project is (copied from https://github.com/conbench/conbench/milestone/1)

A history calculation that doesn't blow out the range of the distribution when there are large changes. (We might automatically detect these, or we might restrict this to be "only ignore changes which have been labelled regressions or improvements manually" cf #171)

Resetting each distribution after a segment is effectively similar to using the same data and waiting for it to age out after 100 commits: you're left with a period after a change where you're not sure enough that you can alert even though we do actually have data that can help with that (we're just not calculating things in a way that can show that).

alistaire47 · 2023-01-19T22:08:14Z

If the mean is drifting, something is probably wrong with the benchmark or setup.

Slightly off-topic but related in "bounds look weird": We might consider eventually calculating the upper and lower bounds separately so they're asymmetric, because it's not that unusual for benchmarks to have higher upward erraticity than downwards, and we really don't want any bounds that go negative. I may have made an issue for this at some point? I forget.

But in terms of what we want bounds to look like after a shift, I think the first option is a little too perfect, and assumes more knowledge about the new segment than we actually have. So ideally they'd blow out a little (not as much or as long as the old method) to capture the new uncertainty, then zoom back in as more data is collected, kind of like these two happened to end up:

Bayesian methods are nice here because they let you quantify that uncertainty based on how much data you have while still pulling information from your priors, but they are maybe too fancy for where we are.

jonkeane · 2023-01-19T22:12:04Z

If the mean is drifting, something is probably wrong with the benchmark or setup.

💯 this is something we actually should automatically be monitoring and then alert separately that something is up. This is a classic death-by-a-thousand-slow-downs trap where no individual change makes things perceptibly slower, but over time you've introduced enough overhead with each change that you're now 2x or 3x slower (or something is wrong with your entire benchmark itself, or something is wrong with your running environment like you mention). Conbench is in some ways uniquely positioned to catch these. But that's a project for after we do this + have more facility to calculating things like the slope of the mean over time. Would you mind making an issue for this @alistaire47 so we can brainstorm there | figure out where to slot that in to new project(s)?

alistaire47 · 2023-01-19T22:18:14Z

#608

austin3dickey · 2023-01-20T20:06:47Z

Thanks for the issue Ed! I agree with you both that the middle plot is probably our best option here. It uses out-of-segment SD data but not out-of-segment (or future!) mean data. I'm working on implementing that now.

austin3dickey added this to the History renovation project milestone Nov 30, 2022

austin3dickey mentioned this issue Nov 30, 2022

Don't blow out history when there's a large change #383

Closed

austin3dickey mentioned this issue Dec 7, 2022

Reworked queries to use git history #507

Merged

jonkeane mentioned this issue Jan 6, 2023

Ability to mark a commit as a discontinuity in the distribution history #171

Closed

austin3dickey mentioned this issue Jan 9, 2023

Change is_step_change to change_annotations #582

Merged

This was referenced Jan 13, 2023

history plot: add min-over-time, show raw mean&min, tweak size&spacing #591

Merged

Enable toggling step changes in webapp #575

Merged

alistaire47 mentioned this issue Jan 19, 2023

Alert on drifting means #608

Open

austin3dickey mentioned this issue Jan 23, 2023

history plot: split plot lines before a distribution change #615

Merged

austin3dickey changed the title ~~Change Distribution queries to only use all data after the latest step change~~ Change z-score calculations to incorporate begins_distribution_change Jan 27, 2023

This was referenced Jan 27, 2023

Updated z-score logic and calculate them on read #636

Merged

Weight the rolling_sd column of get_history() by recency #641

Open

austin3dickey closed this as completed in #636 Feb 3, 2023

jgehrcke mentioned this issue Feb 7, 2023

ui: history plot: distribution change button: document how this affects data analysis #691

Closed

jonkeane mentioned this issue Feb 8, 2023

Manually exclude BenchmarkResults from a Distribution #474

Open

jgehrcke added the method: lookback z-score ticket related to a specific analysis method label Mar 9, 2023

jgehrcke mentioned this issue Mar 10, 2023

Make history duration configurable #379

Closed

austin3dickey mentioned this issue Jul 10, 2023

lookback z-score method: pragmatic changelog: document evolution so far #1348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change z-score calculations to incorporate `begins_distribution_change` #473

Change z-score calculations to incorporate `begins_distribution_change` #473

austin3dickey commented Nov 30, 2022 •

edited

jonkeane commented Jan 9, 2023

alistaire47 commented Jan 9, 2023

jonkeane commented Jan 9, 2023 •

edited

austin3dickey commented Jan 9, 2023

jonkeane commented Jan 9, 2023

alistaire47 commented Jan 9, 2023

jonkeane commented Jan 10, 2023

jgehrcke commented Jan 16, 2023 •

edited

alistaire47 commented Jan 18, 2023

austin3dickey commented Jan 19, 2023

austin3dickey commented Jan 19, 2023 •

edited

jonkeane commented Jan 19, 2023

alistaire47 commented Jan 19, 2023

jonkeane commented Jan 19, 2023 •

edited

alistaire47 commented Jan 19, 2023

austin3dickey commented Jan 20, 2023

Change z-score calculations to incorporate begins_distribution_change #473

Change z-score calculations to incorporate begins_distribution_change #473

Comments

austin3dickey commented Nov 30, 2022 • edited

jonkeane commented Jan 9, 2023

alistaire47 commented Jan 9, 2023

jonkeane commented Jan 9, 2023 • edited

austin3dickey commented Jan 9, 2023

jonkeane commented Jan 9, 2023

alistaire47 commented Jan 9, 2023

jonkeane commented Jan 10, 2023

jgehrcke commented Jan 16, 2023 • edited

alistaire47 commented Jan 18, 2023

austin3dickey commented Jan 19, 2023

austin3dickey commented Jan 19, 2023 • edited

jonkeane commented Jan 19, 2023

alistaire47 commented Jan 19, 2023

jonkeane commented Jan 19, 2023 • edited

alistaire47 commented Jan 19, 2023

austin3dickey commented Jan 20, 2023

Change z-score calculations to incorporate `begins_distribution_change` #473

Change z-score calculations to incorporate `begins_distribution_change` #473

austin3dickey commented Nov 30, 2022 •

edited

jonkeane commented Jan 9, 2023 •

edited

jgehrcke commented Jan 16, 2023 •

edited

austin3dickey commented Jan 19, 2023 •

edited

jonkeane commented Jan 19, 2023 •

edited