Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Use median instead of average to stabilize First Block metric #54157

Merged
merged 2 commits into from Sep 5, 2023

Conversation

WunderBart
Copy link
Member

@WunderBart WunderBart commented Sep 4, 2023

What?

The First Block Loaded has been unstable since the switch to Playwright. From https://www.codevitals.run/project/gutenberg:

Screenshot 2023-09-04 at 16 55 24

After looking into the raw values of the metrics, I found some spikes distorting the otherwise stable values. To address that, let's start by calculating the median instead of the average to filter out those outliers.

@WunderBart WunderBart added [Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests. [Type] Performance Related to performance efforts labels Sep 4, 2023
@WunderBart WunderBart self-assigned this Sep 4, 2023
@WunderBart WunderBart removed the [Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests. label Sep 4, 2023
@youknowriad
Copy link
Contributor

I think it'd be could to figure the reason for the outliers. What's different than puppeteer values?

Other than that, I'm fine with moving to "median" but not sure about doing it only for a single metric while leaving the rest using "average".

@github-actions
Copy link

github-actions bot commented Sep 4, 2023

Flaky tests detected in 857438c.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/6075034739
📝 Reported issues:

@WunderBart
Copy link
Member Author

WunderBart commented Sep 4, 2023

I think it'd be could to figure the reason for the outliers. What's different than puppeteer values?

Yep, I was planning to continue investigating. The difference is that now we load each page in a new, isolated browser context. With Puppeteer, we were using a single/default context, so there was likely some sort of caching going on.

Other than that, I'm fine with moving to "median" but not sure about doing it only for a single metric while leaving the rest using "average".

To be clear, it's not the only metric that we're currently using median for - we also do that for TTFB and LCP. I'm not sure why those metrics were kept being calculated via medians (maybe just an oversight), but from my current observations, I'd be keen to move all the loading-related metrics to median, as they all seem to have the tendency to produce distorting spikes.

Copy link
Contributor

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more reason why distributions are going to be more helpful for us. The numbers on my latest PR for many metrics were wildly different than each other, when I know because the change didn't impact any JS, that they should have all been identical. I'm not looking at small differences either, some around 30%+.

I still think min is more appropriate for some tests as we are theoretically looking at what impact the code might have on runtime performance, not on environmental factors, and we know that the minimum is the fastest it can get based off of the given code.

This should all get better anyway if we do significantly more test rounds as the numbers should converge. What about setting up a daily task to run the tests something like 30 times (or as many as we can fit inside Github's test runner limitation of six hours)? That has come up before as a way to better measure.

It would lose the close associativity to PRs, but honestly, our numbers have consistently struggled to be reliable on a small scale so it's arguable this connection even exists today. If we ran daily runs with better statistical grouping we might have a better chance at identifying when precisely performance impacts improve or degrade.

@WunderBart WunderBart merged commit 8e8048e into trunk Sep 5, 2023
49 checks passed
@WunderBart WunderBart deleted the refactor/perf-first-block-loaded-metric branch September 5, 2023 08:34
@github-actions github-actions bot added this to the Gutenberg 16.7 milestone Sep 5, 2023
@youknowriad
Copy link
Contributor

I'm personally still concerned that there's a change that we don't understand here and that the median might be hiding real issues. If we know for sure that it's cache, maybe we can ignore the first run or things like that but at least we know what we're doing. The details are super important for performance tests.

@WunderBart
Copy link
Member Author

WunderBart commented Sep 5, 2023

I'm personally still concerned that there's a change that we don't understand here and that the median might be hiding real issues. If we know for sure that it's cache, maybe we can ignore the first run or things like that but at least we know what we're doing. The details are super important for performance tests.

I will keep investigating this and try to figure out what causes those spikes in CI (still can't repro locally). We're already filtering out the first sample from metrics in which we identified it as an outlier, but in the FBL metric, those spikes are occurring at (seemingly) random samples, so we can't really do that. That said, the fact that I cannot repro locally makes me think this isn't a real issue, only something specific to the CI environment.

@youknowriad
Copy link
Contributor

only something specific to the CI environment.

If it were specific to the CI environment, why it wasn't happening when we were using puppeteer?

@WunderBart
Copy link
Member Author

only something specific to the CI environment.

If it were specific to the CI environment, why it wasn't happening when we were using puppeteer?

As I mentioned above, Puppeteer's stability for Loading metrics was likely related to reusing the browser context for each sample, which we're not doing currently with Playwright.

@WunderBart
Copy link
Member Author

@youknowriad, looks like I was wrong - there was a real issue 😄 #54188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Performance Related to performance efforts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants