Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change z-score calculations to incorporate begins_distribution_change #473

Closed
austin3dickey opened this issue Nov 30, 2022 · 16 comments · Fixed by #636
Closed

Change z-score calculations to incorporate begins_distribution_change #473

austin3dickey opened this issue Nov 30, 2022 · 16 comments · Fixed by #636
Labels
method: lookback z-score ticket related to a specific analysis method

Comments

@austin3dickey
Copy link
Member

austin3dickey commented Nov 30, 2022

See comments below for full discussion. Current plan: when calculating the z-score of a BenchmarkResult B on commit C,

  • for the distribution mean, use the mean of the means of all BenchmarkResults that:
    • share the same case/context/hardware/repo as B
    • are on the default branch
    • are in the commit ancestry of B
    • starting with either the earliest BenchmarkResult that matches these criteria, or the latest matching BenchmarkResult where begins_distribution_change is True, whichever is later
    • ending with the latest matching BenchmarkResult before C
  • for the distribution standard deviation, calculate the sample standard deviation of the differences between the mean of each BenchmarkResult from the rolling distribution mean (which is calculated the same way as above)
    • This always starts at the earliest BenchmarkResult that matches the criteria, not accounting for begins_distribution_change
  • use the mean of B for the z-score calculation
@jonkeane
Copy link
Collaborator

jonkeane commented Jan 9, 2023

A few options we discussed in #383, but with more summary and polish here:

We could (and IMHO should) treat the step-change markers we have from #472 and #574 as known-mean shifts in the distribution.

We currently use two measures of history: the mean of the measure (for tracking central tendency) and the standard deviation. If we centered all of our measures around the means from their segments before calculating the standard deviation for the whole set, we will have our measure of variability that is independent from the mean shifts. IIRC, @alistaire47 called this the standard deviation of the residuals (where the model is the mean of the segment). I'm happy to throw some code together that shows this off (in the style of what's in #383), but need to do that later.

We also should consider adding an exponential weighting so that past history has less (or effectively no) influence on recent measures. But IMO we should put that as a separate ticket and not try and tackle it here.

I also think even with just this scaling, we probably are at the point that we want (or possibly even need) to pull these calculations out of being executed exclusively in the database.

Finally, we should consider what this looks like in the UI + how we message to folks about this. I think improvements to the plots (some of #477) to show the area of uncertainty and if bots fall outside of it is absolutely critical. After that, we should describe why we're using history for this. And finally, ideally we would also have a robust and detailed document describing the approach + math for folks that want to dive that deeply

@alistaire47
Copy link
Contributor

To summarize my thoughts on how to do this:

  1. Calculate the means (or median or whatever) of each segment (separated by step changes)
  2. Subtract off (1) from the raw data to get residuals across segments
  3. Take a rolling standard deviation weighted for recency to calculate bounds. Weighting could be exponential or "triangle" (linear decay) or otherwise; what matters is that more recent data gets weighted more so if spread changes after a step change, bounds won't blow out, but will adjust.

An alternative way to do (3) would be to weight entire segments together instead of points. Ontologically this is a more attractive way to do it, but the math is more complicated since you care about both how recent a segment is plus how many commits come after. You could mix two segments with a conjugate prior (normal-normal might work, though sd follows a chi distribution, I think? Could maybe use a beta distribution to parameterize mixing?), but if segments are short we'd have to chain that somehow. If we want to go this route, we should nerd-snipe a Bayesian statistician with it.

@jonkeane
Copy link
Collaborator

jonkeane commented Jan 9, 2023

Here's some more code demonstrating this:

library(dplyr)
library(slider)
library(ggplot2)

fake_data_1 <- data.frame(
  time = rnorm(1000, mean = 5.28, sd = 0.01),
  step_change = FALSE
)

fake_data_2 <- data.frame(
  time = rnorm(10, mean = 11.2, sd = 0.01),
  step_change = FALSE
)
fake_data_2[1,"step_change"] <- TRUE

fake_data_3 <- data.frame(
  time = rnorm(90, mean = 5.28, sd = 0.01),
  step_change = FALSE
)
fake_data_3[1,"step_change"] <- TRUE

fake_data_4 <- data.frame(
  time = rnorm(90, mean = 2.28, sd = 0.01),
  step_change = FALSE
)
fake_data_4[1,"step_change"] <- TRUE

fake_data_5 <- data.frame(
  time = rnorm(120, mean = 5.28, sd = 0.01),
  step_change = FALSE
)
fake_data_5[1,"step_change"] <- TRUE

data <- bind_rows(fake_data_1, fake_data_2, fake_data_3, fake_data_4, fake_data_5)

# add row number
data <- mutate(data, num = row_number())

# our current strategy
data_proc <- data %>%
  mutate(
    sd = slider::slide_dbl(time, sd, .before = 100, .after = 0),
    mean = slider::slide_dbl(time, mean, .before = 100, .after = 0),
    z_score = (time - lag(mean)) / sd
  )

ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = mean - sd * 5, ymax = mean + sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations", subtitle = "the grey band is mean +/- sd*5 (or our z-score threshold)")


# With scaling
data_proc <- data %>%
  # lable the segments as different each time there's a step change
  mutate(segment = cumsum(data$step_change)) %>% 
  # calculate means per group
  group_by(segment) %>% 
  mutate(mean = mean(time)) %>%
  ungroup() %>%
  mutate(
    # scale the mean
    scaled = time - mean,
    # and the value for .before here can be increased without the bands expanding too much
    sd = slider::slide_dbl(scaled, sd, .before = 1000, .after = 0)
  )
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = mean - sd * 5, ymax = mean + sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is mean +/- sd*5 (or our z-score threshold)")

Rplot

@austin3dickey
Copy link
Member Author

Okay I finally understand this approach. That's a lot simpler than I thought, and I agree that we can tackle "weighting by recency" later.

@jonkeane
Copy link
Collaborator

jonkeane commented Jan 9, 2023

Another facet of this that I didn't realize until I wrote out that simulation (discovered only because a bug in an earlier implementation 😬 !): with the possible exception of every-single-commit-is-its-own-segment (though even that may be fine, actually!), having an "extra" segment by labelling a commit that doesn't have a big difference as being the start of a new segment doesn't negatively impact that standard deviation bands.

@alistaire47
Copy link
Contributor

That example assumes standard deviation is constant across segments, though. The reason for weighting is when it's not—it will adjust smoothly (otherwise the old width will last until the old data ages out) and more quickly to the new spread, hopefully without too many false positives in the case where spread increases (though that's still a possibility).

...but we can hold off on weighting if we're not too concerned about spread changes, because eliminating the jumps from the computation does fix the artificially big bounds

@jonkeane
Copy link
Collaborator

That example assumes standard deviation is constant across segments, though. The reason for weighting is when it's not—it will adjust smoothly (otherwise the old width will last until the old data ages out) and more quickly to the new spread, hopefully without too many false positives in the case where spread increases (though that's still a possibility).

Totally, even without weighting, the shift will (very slowly!) reduce over time if we go from a high variability regime (back) to a low variability regime. But the time it would take to do that is quite long. And like you said: weighting will help smooth that out more quickly in a nice way so we should do it. If we can "easily" do the weighting here let's do it — but I don't want to get too distracted with perfecting the weighting that we delay (even by a little bit) getting out the long-awaited other more basic improvements we can get without it.

@jgehrcke
Copy link
Member

jgehrcke commented Jan 16, 2023

I think we should try to

  • keep analysis as simple as it can be (and only as complex as it verifiably needs to be)
  • keep things explainable to a rather regular user

(maybe the proposals above are compatible with that perspective!)

The following paragraph is meant half serious half funny (I have not read all of above in all detail, and hope that my comment is not perceived as unfair!). After skimming through the paragraphs above I'd love to share an impression. In my life I've done and seen my fair share of advanced data analysis. I have the feeling (I might be wrong) that our space of business here does not really need super advanced data analysis. Before doing Bayesian Principal Component Analysis of the Fourier Transform of the exponentially weighted chi distribution (conjugated, of course 🚀!) -- my intuition is that we should focus on improving the raw data first (reduce noise, inspect distribution properties qualitatively, etc), and then apply really basic means for regression detection :). I know this is provocative, and over time I'd love to learn and convince myself how delicate and intricate our regression detection algorithms really need to be.

@alistaire47
Copy link
Contributor

Agreed, simple is good, provided it works. The question is how to mix data across segments when calculating the bounds so they adjust smoothly and responsively. To what extent it matters how well we do it depends on how big of shifts in spread between segments we see—if spread mostly doesn't change that much, we can do very little and be fine. If spread sometimes increases significantly, we'll see some noise alerts right after shifts unless we improve our mixing method.

Historically we've avoided these noise alerts because we include the jump in the calculating the bounds, blowing them out for a while anyway. Since we're going to stop doing that, we'll see soon if we need more sophistication. And even then, just weighting for recency might be good enough!

@austin3dickey
Copy link
Member Author

I'm looking deeper at the examples above. I'm not super comfortable with centering the band around the mean of the entire segment, because that's affected by future data in the same segment, which could cause weird alerts. For example, I could see a scenario where we send a regression alert, but we don't see it in the webapp later, because the band's center drifted upwards after new data came in. So I propose we center the band around the rolling mean that excludes data from before the segment started. That code might look like this:

data_proc <- data %>%
  mutate(segment = cumsum(data$step_change)) %>% 
  group_by(segment) %>% 
  mutate(
    segment_mean = mean(time),
    rolling_mean = slider::slide_dbl(time, mean, .before = 1000, .after = 0)
  ) %>%
  ungroup() %>%
  mutate(
    scaled = time - segment_mean,
    rolling_sd = slider::slide_dbl(scaled, sd, .before = 1000, .after = 0)
  )
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = rolling_mean - rolling_sd * 5, ymax = rolling_mean + rolling_sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is rolling_mean +/- rolling_sd*5 (or our z-score threshold)")

While we're at it, we could do the same thing with the standard deviation: use the rolling standard deviation that excludes data from before the segment started. This solves Ed's problem of when the spread looks different between two segments. That would look like this:

data_proc <- data %>%
  mutate(segment = cumsum(data$step_change)) %>% 
  group_by(segment) %>% 
  mutate(
    rolling_mean = slider::slide_dbl(time, mean, .before = 1000, .after = 0),
    rolling_sd = slider::slide_dbl(time, sd, .before = 1000, .after = 0)
  ) %>%
  ungroup()
  
ggplot(data_proc) +
  aes(x = num, y = time) +
  geom_line() +
  geom_linerange(mapping = aes(ymin = rolling_mean - rolling_sd * 5, ymax = rolling_mean + rolling_sd * 5), alpha = 0.2, color = "blue") +
  ggtitle("times plotted over iterations, with a slightly different calculation", subtitle = "the blue band is rolling_mean +/- rolling_sd*5 (or our z-score threshold)")

Of course this has the issue where the first few points per segment are missing z-score.

Time for some plots. I used the same data-generating code you did except:

  • I changed the sd of the dummy data to 1 in all cases
  • In the big first segment, I made the data increase slightly over time, so that it shows how using the segment mean might not look great for all data points.

Jon's option:
image

My option 1:
image

My option 2:
image

@austin3dickey
Copy link
Member Author

austin3dickey commented Jan 19, 2023

After typing that whole comment, I sorta convinced myself that maybe it's okay to use the segment mean as the center of the band, even if it uses future data. Justification:

  • We try to control for everything that might cause a shift in mean. For one segment, the mean really should stay constant. If it changes, that's cause for creating another segment. Once the new segment is created, the first segment should adjust such that we can see the correct alerts again, for the most part.
  • That first plot really looks nice after a "step change" compared to the other two plots. That consistency is probably worth the hand-wavy use of future data.

@jonkeane
Copy link
Collaborator

I'm looking deeper at the examples above. I'm not super comfortable with centering the band around the mean of the entire segment, because that's affected by future data in the same segment, which could cause weird alerts. For example, I could see a scenario where we send a regression alert, but we don't see it in the webapp later, because the band's center drifted upwards after new data came in. So I propose we center the band around the rolling mean that excludes data from before the segment started.

Yes, absolutely. I meant to say this in my previous comment that my quick-and-dirty code notably doesn't do this, and it really should be done only ever looking back like you say here.

While we're at it, we could do the same thing with the standard deviation: use the rolling standard deviation that excludes data from before the segment started.

I'm very strongly ➖ on only using SD data from the current segment (if that's what you mean by "excludes data from before the segment started). In the current architecture, we really should be using as much SD information as we can (especially from recent, but possibly different segments). As we evolve our alerting mechanisms we might change this to be smarter, or include other more simple difference calculations (that's ok!) but for now I don't think we can afford to basically say after any deviation "we have no idea what the spread could be, we can't possibly alert on anything (or worse: we continue to alert and use samples that have a small N to alert when that might not be valid)" which is effectively the current regime we are in when we blow out the SDs. One of the biggest new features of this project is (copied from https://github.com/conbench/conbench/milestone/1)

A history calculation that doesn't blow out the range of the distribution when there are large changes. (We might automatically detect these, or we might restrict this to be "only ignore changes which have been labelled regressions or improvements manually" cf #171)

Resetting each distribution after a segment is effectively similar to using the same data and waiting for it to age out after 100 commits: you're left with a period after a change where you're not sure enough that you can alert even though we do actually have data that can help with that (we're just not calculating things in a way that can show that).

@alistaire47
Copy link
Contributor

If the mean is drifting, something is probably wrong with the benchmark or setup.

Slightly off-topic but related in "bounds look weird": We might consider eventually calculating the upper and lower bounds separately so they're asymmetric, because it's not that unusual for benchmarks to have higher upward erraticity than downwards, and we really don't want any bounds that go negative. I may have made an issue for this at some point? I forget.

But in terms of what we want bounds to look like after a shift, I think the first option is a little too perfect, and assumes more knowledge about the new segment than we actually have. So ideally they'd blow out a little (not as much or as long as the old method) to capture the new uncertainty, then zoom back in as more data is collected, kind of like these two happened to end up:
image
Bayesian methods are nice here because they let you quantify that uncertainty based on how much data you have while still pulling information from your priors, but they are maybe too fancy for where we are.

@jonkeane
Copy link
Collaborator

jonkeane commented Jan 19, 2023

If the mean is drifting, something is probably wrong with the benchmark or setup.

💯 this is something we actually should automatically be monitoring and then alert separately that something is up. This is a classic death-by-a-thousand-slow-downs trap where no individual change makes things perceptibly slower, but over time you've introduced enough overhead with each change that you're now 2x or 3x slower (or something is wrong with your entire benchmark itself, or something is wrong with your running environment like you mention). Conbench is in some ways uniquely positioned to catch these. But that's a project for after we do this + have more facility to calculating things like the slope of the mean over time. Would you mind making an issue for this @alistaire47 so we can brainstorm there | figure out where to slot that in to new project(s)?

@alistaire47
Copy link
Contributor

#608

@austin3dickey
Copy link
Member Author

Thanks for the issue Ed! I agree with you both that the middle plot is probably our best option here. It uses out-of-segment SD data but not out-of-segment (or future!) mean data. I'm working on implementing that now.

@austin3dickey austin3dickey changed the title Change Distribution queries to only use all data after the latest step change Change z-score calculations to incorporate begins_distribution_change Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
method: lookback z-score ticket related to a specific analysis method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants