Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Linux/Android resources for staging are needed #96864

Closed
zanderso opened this issue Jan 19, 2022 · 11 comments
Closed

More Linux/Android resources for staging are needed #96864

zanderso opened this issue Jan 19, 2022 · 11 comments
Assignees
Labels
P0 Critical issues such as a build break or regression team-infra Owned by Infrastructure team

Comments

@zanderso
Copy link
Member

Flaky tests and benchmark get moved to staging and need some number of consecutive non-flaky runs in order to move back to prod. However, it looks like staging is under-provisioned, and so achieving that number of runs is going to take a long time. As an example Linux_android opacity_peephole_fade_transition_text_perf__e2e_summary, has only been run on 6 out of the last 70 framework commits. There are several other benchmarks that appear to be receiving similar treatment. For contrast Linux_android animated_placeholder_perf__e2e_summary has been run on 50 out of the last 70 commits.

Marking P2 to determine whether this is really due to insufficient resources or rather due to a bug in scheduling.

/cc @godofredoc

@zanderso zanderso added team-infra Owned by Infrastructure team P2 labels Jan 19, 2022
@zanderso zanderso added this to New in Infra Ticket Queue via automation Jan 19, 2022
@yusuf-goog yusuf-goog moved this from New to Triaged in Infra Ticket Queue Jan 19, 2022
@keyonghan
Copy link
Contributor

keyonghan commented Jan 19, 2022

We have three linux bots running tasks in staging, there are 4 in idle status (1 motog4 (M), 3 samsung) which are not being scheduled to run tests ever.

A couple of things to move forward:

  1. I don't think we need to run all linux/android devicelab tests in staging (https://ci.chromium.org/p/flutter/g/devicelab_staging/console). This consumes most of resources, and causes builds queued up and run in batches. The next time we need to validate any new configs/hardware, we can enable them back to validate and then skip after.
  2. We have a KR to support high-end phones in Q1 to make sure benchmarks are collected smoothly. We need to start a plan to run tests in those new testbeds.

As a short-time workaround, we can limit the number of devicelab tests running in the staging pool, giving room for the real flaky tests validation.

@keyonghan keyonghan self-assigned this Jan 19, 2022
@keyonghan keyonghan moved this from Triaged to In progress in Infra Ticket Queue Jan 19, 2022
@keyonghan
Copy link
Contributor

https://flutter-review.googlesource.com/c/infra/+/25440 to skip ~35 tests migrated from mac/android.

@keyonghan
Copy link
Contributor

I believe those flaky tests are now being picked up frequently enough. Here are the top ones
Screen Shot 2022-01-20 at 9 39 42 AM

@godofredoc
Copy link
Contributor

This issue has been mitigated removing a subset of tests from staging.

@zanderso
Copy link
Member Author

Thanks! I think we can close this as fixed, and I'll re-open or file a new one as needed.

Infra Ticket Queue automation moved this from In progress to Done Jan 20, 2022
@zanderso zanderso reopened this Jan 21, 2022
Infra Ticket Queue automation moved this from Done to In progress Jan 21, 2022
@zanderso
Copy link
Member Author

It still seems like there is a capacity issue.
Screen Shot 2022-01-20 at 7 33 20 PM

Also, there are only two windows bots on staging? They've both been offline for 9+ hours.

Screen Shot 2022-01-20 at 7 36 07 PM

@keyonghan
Copy link
Contributor

Yeah, builds are queued up quickly especially when we have several commits merged around the same time.

There are currently 3 linux staging bots running 25 staging linux/android tests, whereas there are 17 linux prod bots running 83 prod linux/android test (with current 90th% queue time 22min, SLO 35 min).

It makes sense to me to migrate, say 2 bots, from prod to staging to help validate the flaky tests for now. We can expand the prod capacity when new bots are available.

For windows bots, opened #97017.

@keyonghan
Copy link
Contributor

Instead of migrating bots between prod and staging, https://flutter-review.googlesource.com/c/infra/+/25566 to run devicelab staging linux tests in A02 testbeds.
Only benchmarks and bringup:true ones are now running on motoG4. This way we are reducing 10 more tests.

@keyonghan
Copy link
Contributor

Linux staging builders are being picked up and run in a timely manner now. Will monitor for a while before close.
Screen Shot 2022-01-26 at 3 18 02 PM

@keyonghan
Copy link
Contributor

Builds are running at a frequent pace. Closing.

Infra Ticket Queue automation moved this from In progress to Done Jan 29, 2022
@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2022
@flutter-triage-bot flutter-triage-bot bot added P0 Critical issues such as a build break or regression and removed P2 labels Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P0 Critical issues such as a build break or regression team-infra Owned by Infrastructure team
Projects
No open projects
Development

No branches or pull requests

3 participants