Performance: Use median instead of average to stabilize First Block metric #54157

WunderBart · 2023-09-04T14:57:35Z

What?

The First Block Loaded has been unstable since the switch to Playwright. From https://www.codevitals.run/project/gutenberg:

After looking into the raw values of the metrics, I found some spikes distorting the otherwise stable values. To address that, let's start by calculating the median instead of the average to filter out those outliers.

youknowriad · 2023-09-04T15:18:14Z

I think it'd be could to figure the reason for the outliers. What's different than puppeteer values?

Other than that, I'm fine with moving to "median" but not sure about doing it only for a single metric while leaving the rest using "average".

github-actions · 2023-09-04T15:35:21Z

Flaky tests detected in 857438c.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/6075034739
📝 Reported issues:

[Flaky Test] should partially select with shift + click (@webkit) #53524 in /test/e2e/specs/editor/various/multi-block-selection.spec.js

WunderBart · 2023-09-04T15:43:21Z

I think it'd be could to figure the reason for the outliers. What's different than puppeteer values?

Yep, I was planning to continue investigating. The difference is that now we load each page in a new, isolated browser context. With Puppeteer, we were using a single/default context, so there was likely some sort of caching going on.

Other than that, I'm fine with moving to "median" but not sure about doing it only for a single metric while leaving the rest using "average".

To be clear, it's not the only metric that we're currently using median for - we also do that for TTFB and LCP. I'm not sure why those metrics were kept being calculated via medians (maybe just an oversight), but from my current observations, I'd be keen to move all the loading-related metrics to median, as they all seem to have the tendency to produce distorting spikes.

dmsnell

This is more reason why distributions are going to be more helpful for us. The numbers on my latest PR for many metrics were wildly different than each other, when I know because the change didn't impact any JS, that they should have all been identical. I'm not looking at small differences either, some around 30%+.

I still think min is more appropriate for some tests as we are theoretically looking at what impact the code might have on runtime performance, not on environmental factors, and we know that the minimum is the fastest it can get based off of the given code.

This should all get better anyway if we do significantly more test rounds as the numbers should converge. What about setting up a daily task to run the tests something like 30 times (or as many as we can fit inside Github's test runner limitation of six hours)? That has come up before as a way to better measure.

It would lose the close associativity to PRs, but honestly, our numbers have consistently struggled to be reliable on a small scale so it's arguable this connection even exists today. If we ran daily runs with better statistical grouping we might have a better chance at identifying when precisely performance impacts improve or degrade.

youknowriad · 2023-09-05T08:57:28Z

I'm personally still concerned that there's a change that we don't understand here and that the median might be hiding real issues. If we know for sure that it's cache, maybe we can ignore the first run or things like that but at least we know what we're doing. The details are super important for performance tests.

WunderBart · 2023-09-05T09:41:32Z

I'm personally still concerned that there's a change that we don't understand here and that the median might be hiding real issues. If we know for sure that it's cache, maybe we can ignore the first run or things like that but at least we know what we're doing. The details are super important for performance tests.

I will keep investigating this and try to figure out what causes those spikes in CI (still can't repro locally). We're already filtering out the first sample from metrics in which we identified it as an outlier, but in the FBL metric, those spikes are occurring at (seemingly) random samples, so we can't really do that. That said, the fact that I cannot repro locally makes me think this isn't a real issue, only something specific to the CI environment.

youknowriad · 2023-09-05T09:45:21Z

only something specific to the CI environment.

If it were specific to the CI environment, why it wasn't happening when we were using puppeteer?

WunderBart · 2023-09-05T12:21:58Z

only something specific to the CI environment.

If it were specific to the CI environment, why it wasn't happening when we were using puppeteer?

As I mentioned above, Puppeteer's stability for Loading metrics was likely related to reusing the browser context for each sample, which we're not doing currently with Playwright.

WunderBart · 2023-09-05T15:37:58Z

@youknowriad, looks like I was wrong - there was a real issue 😄 #54188

Calculate median instead of average for the FBL metric

857438c

WunderBart added [Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests. [Type] Performance Related to performance efforts labels Sep 4, 2023

WunderBart requested a review from youknowriad September 4, 2023 14:57

WunderBart self-assigned this Sep 4, 2023

WunderBart mentioned this pull request Sep 4, 2023

Switch performance tests to Playwright: Take 2 #53768

Merged

WunderBart requested review from tyxla and dmsnell September 4, 2023 15:15

WunderBart removed the [Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests. label Sep 4, 2023

Apply median to all loading metrics for stability

bf66646

dmsnell approved these changes Sep 4, 2023

View reviewed changes

WunderBart merged commit 8e8048e into trunk Sep 5, 2023
49 checks passed

WunderBart deleted the refactor/perf-first-block-loaded-metric branch September 5, 2023 08:34

github-actions bot added this to the Gutenberg 16.7 milestone Sep 5, 2023

ellatrix mentioned this pull request Apr 22, 2024

Calculate and report variance in performance results #60950

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Use median instead of average to stabilize First Block metric #54157

Performance: Use median instead of average to stabilize First Block metric #54157

WunderBart commented Sep 4, 2023 •

edited

youknowriad commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

WunderBart commented Sep 4, 2023 •

edited

dmsnell left a comment

youknowriad commented Sep 5, 2023

WunderBart commented Sep 5, 2023 •

edited

youknowriad commented Sep 5, 2023

WunderBart commented Sep 5, 2023

WunderBart commented Sep 5, 2023

Performance: Use median instead of average to stabilize First Block metric #54157

Performance: Use median instead of average to stabilize First Block metric #54157

Conversation

WunderBart commented Sep 4, 2023 • edited

What?

youknowriad commented Sep 4, 2023

github-actions bot commented Sep 4, 2023

WunderBart commented Sep 4, 2023 • edited

dmsnell left a comment

Choose a reason for hiding this comment

youknowriad commented Sep 5, 2023

WunderBart commented Sep 5, 2023 • edited

youknowriad commented Sep 5, 2023

WunderBart commented Sep 5, 2023

WunderBart commented Sep 5, 2023

WunderBart commented Sep 4, 2023 •

edited

WunderBart commented Sep 4, 2023 •

edited

WunderBart commented Sep 5, 2023 •

edited