New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate and report variance in performance results #60950
base: trunk
Are you sure you want to change the base?
Conversation
Size Change: 0 B Total Size: 1.74 MB ℹ️ View Unchanged
|
( sum( array ) / array.length ) ** 2 | ||
); | ||
} | ||
|
||
export function median( array ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases we also seems to be calculating the median, for others the average 🤔
@@ -72,26 +92,46 @@ export function curateResults( | |||
results: WPRawPerformanceResults | |||
): WPPerformanceResults { | |||
const output = { | |||
timeToFirstByte: median( results.timeToFirstByte ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the reporting from median
to average
, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases we also seems to be calculating the median, for others the average
It seems to me like a pointless inconsistency. The metrics measured don't look fundamentally different from each other. That why I'm trying to unify the stats to use only one methods.
Are you worried that changing it would invalidate the stats history? I think in our case the medians and averages should be approx the same, any difference is just a random noise. I'll check if one of them is systematically higher than the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just saying this would be reverting #54157, but maybe that's ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove that, maybe we should also remove the median
function etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be reverting #54157
That's true, we'll have to deal with that. The reason why median is needed to "stabilize the metric" is that there are outliers and that median is less sensitive to them than average.
This is a real-world example of an inserterHover
measurement with 20 data points:
[2.60, 3.09, 2.57, 2.62, 2.21, 2.98, 2.34, 3.34, 2.37, 3.00, 2.14, 2.16, 2.07, 2.22, 2.12, 2.27, 2.66, 2.20, 6.89, 2.99]
Almost all measurements are nicely grouped around 2.5, in a ±0.5 range, but there is one outlier at 6.8:
The presence of this outlier moves the median from 2.37 to 2.47, by 0.1. But the average is more sensitive, it's moved from 2.52 to 2.74, by 0.22.
When there are more outliers than one, then the stats are even more distorted.
We can solve this by detecting and removing the outliers. By arguing that when a certain result is too far from all others, it means that the measurement instrument broke down and produced a garbage result. Without outliers, neither the median and the average are distorted, and are closer to each other.
My ultimate goal here is to do some significance test for each metric and produce a result like:
on trunk the inserterHover metric is 2.5±0.5ms
on this branch the inserterHover metric is 1.5±0.5ms
with 95% confidence this is a real improvement, not a random noise
This enables automated reporting in PR where the tool can confidently say "out of 100 metrics I checked, your PR improves this one but another two are significantly worse"
All the common significance form accept averages and variances as inputs. Not medians and quartiles. That's why I'm trying to move to averages.
Inspired by #60509, I'm adding a calculation of variance of the measurements. Each measurement is a set of 10+ samples, so let's extract more data than just the average.
Knowing the variance is fundamental for knowing if a change in a metric in a given PR is significant or not.
Also using averages consistently everywhere instead of medians. If the model for the measurements is "a true value ± some random gaussian error", then the average is the best estimate of the true value, not the median.
Let's see how it works.