new_gallery__crane_perf is 2.56% flaky #82745

gspencergoog · 2021-05-17T21:20:12Z

The benchmark test new_gallery__crane_perf has a flaky ratio 2.56%, which is above our self-imposed 2% threshold.

(Creating this issue as part of the gardener rotation.)

zanderso · 2021-05-17T21:46:35Z

All failures over the last 200 runs were failures of gradle to grab dependencies from the network. For example:

[new_gallery__crane_perf] [STDOUT] stderr: [ +100 ms] FAILURE: Build failed with an exception.
[new_gallery__crane_perf] [STDOUT] stderr: [        ] * What went wrong:
[new_gallery__crane_perf] [STDOUT] stderr: [        ] Execution failed for task ':package_info:extractProfileAnnotations'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ] > Could not resolve all files for configuration ':package_info:lintClassPath'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]    > Could not resolve org.codehaus.groovy:groovy-all:2.4.15.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]      Required by:
[new_gallery__crane_perf] [STDOUT] stderr: [        ]          project :package_info > com.android.tools.lint:lint-gradle:27.1.0
[new_gallery__crane_perf] [STDOUT] stderr: [        ]       > Could not resolve org.codehaus.groovy:groovy-all:2.4.15.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]          > Could not get resource 'https://jcenter.bintray.com/org/codehaus/groovy/groovy-all/2.4.15/groovy-all-2.4.15.pom'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]             > Could not GET 'https://jcenter.bintray.com/org/codehaus/groovy/groovy-all/2.4.15/groovy-all-2.4.15.pom'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]                > Connect to jcenter.bintray.com:443 [jcenter.bintray.com/34.95.74.180] failed: connect timed out

We should potentially bump the priority of #19235 to address this.

blasten · 2021-05-17T22:39:16Z

@godofredoc there was a conversation about caching these dependencies in the images. Has there been any research or work around how feasible it's to do that, and estimated time?

One concern I had is that we build many apps in these bots, and the dependency graph is different, so we would need some type of tool that can capture the dependencies from the apps and also lock their versions.

godofredoc · 2021-05-17T23:05:27Z

Caching the dependencies in images may not be a good option. I think the previous conversation on this topic was about making the gradle cache to work.

Ideally gradle should be smart enough to only download dependencies that do not exist in a given directory but it seems like is not working. I'd recommend to try to find out what are we missing to enable caching at gradle level.

The only thing we currently do is to set GRADLE_USER_HOME env to a cached directory in the bot https://cs.opensource.google/flutter/recipes/+/master:recipe_modules/android_sdk/api.py;l=69?q=gradle&ss=flutter%2Frecipes. This setup is what we use to cache xcode, android-sdk, etc. but unfortunately it seems like we are still missing some bits on the gradle side.

blasten · 2021-05-17T23:11:44Z

Ideally gradle should be smart enough to only download dependencies that do not exist in a given directory but it seems like is not working.

Are all of the LUCI VMs stateful? My understanding was that only the Mac bots.

godofredoc · 2021-05-17T23:23:45Z

LUCI uses hermetic builds for everything. Cache folder is an exception where we keep things to reuse in between builds.

blasten · 2021-05-17T23:30:46Z

I see. How is the cache folder restored? For example, GitHub actions suggests to restore based on the OS, and content hash of the .gradle and . properties files.

godofredoc · 2021-05-17T23:38:04Z

It is automatically done by LUCI. Behind the scenes when a task completes it grabs every sub folder from <cache_folder> and hashes them to an internal cache folder. The next task creates a chroot and copies the content to the chroot/cache folder. From the task view point of view the folder exist in cache location populated with the previous content.

godofredoc · 2021-05-17T23:42:12Z

The only difference from the what Github actions is suggesting is that instead of using $HOME/.gradle we are setting GRADLE_USER_HOME to use /cache folder instead.

blasten · 2021-05-18T17:57:50Z

Behind the scenes when a task completes it grabs every sub folder from <cache_folder> and hashes them to an internal cache folder.

I'm still not sure about how the cache is restored. How is the key computed? If the same cache is reused, I would need to know what forms the cache key.

There are some potential issues, but I'm not sure if any apply:

If the cache key is the source code checked out from GitHub, then the hit rate will be 0. (If we ignore empty commits)
If the cache key doesn't depend on the source code, then, can two or more bots write and read to the same object at the same time? Follow up question, is the cache stored in a cloud bucket? Disk multi-write mode seems to be in preview, do we use it?

@godofredoc

godofredoc · 2021-05-18T18:39:23Z

Answers for the potential issues list:

1.- cache keys are not saved in github. The cache mechanism is independent of the source code and it is based on the content of the folder when the task execution completes.
2.- when the task is reading/updating the cache it is already a local folder mounted from the state of a previous task based on a calculated hash using all the files in a a given folder.

An example workflow using gradle:

A previous cache from the bot is mounted to <bot_chroot>/cache/gradle. LUCI handles this automatically calculating hashes and saving to isolated.
GRADLE_USER_HOME is set to <bot_chroot>/cache/gradle
Code is checked out and gradle task is run. As many files are already in <bot_chroot>/cache/gradle/caches the expectation is that gradle downloads only what is not yet available in the cache(replacing files should not happen if gradle handles caches properly).
The task runs and completes.
LUCI calculates hashes for files in <bot_chroot>/cache/gradle and uploads the new versions to isolated. Theres is no clashes as isolated optimizes uploads/downloads using hashes(uploading a new file only if a file with the same hash does not exist in the server).

For the gradle use case the only problem I can see is the cache growing too much that we need to manually trim it. The current max size is 10G.

blasten · 2021-05-18T19:06:04Z

For the gradle use case the only problem I can see is the cache growing too much that we need to manually trim it. The current max size is 10G.

How do we set this limit?

A previous cache from the bot is mounted to <bot_chroot>/cache/gradle.

If two bots finish at the same time, which one wins? What if one finishes first, but it takes longer to upload, so it ends up overriding the last one?

How difficult would it be if we move to a model where we restore the Gradle cache per project basis, and the key is formed by hashing the .gradle and .lockfile content files? Similar to the GitHub actions model.

This is to remove failures downloading gradle dependencies and speeding up the builds. Bug: flutter/flutter#82745

godofredoc · 2021-05-18T22:34:57Z

The 10G is the default limit set by the luci bot.

There is no winner, the new files will be uploaded independently and the top level hashes will be recalculated.

blasten · 2021-05-18T23:01:12Z

ah, if it uses merkle trees or similar then it's a different story.

At the moment, the AI for me is to lock all the dependencies, so they resolve to same ones locally or remotely.

godofredoc · 2021-05-18T23:13:35Z

Sounds good, thanks!

jmagman · 2021-05-24T19:23:40Z

new_gallery__crane_perf 15 day flake ratio is back under 2%, last 50 runs flaky and failed number is 0. Turning it back on.

Leaving this issue open to track the Gradle cache discussion.

Hixie · 2021-05-25T22:36:07Z

What is still actionable here?

blasten · 2021-05-25T22:49:52Z

Lowering priority, and assigning it to Godofredo to answer question about querying the latest executions.

@godofredo what is the preferred way of querying the latest executions? I saw LUCI has a section about how to use BigQuery to get that data.

blasten · 2021-05-25T22:53:10Z

Note that the documentation I found is for Chromium.

godofredoc · 2021-05-25T23:33:44Z

@keyonghan created a dashboard to track flakiness: https://dashboards.corp.google.com/flutter_prod_flakiness_overview_dashboard

There is a single failure in the last 100 runs and it was from before landing the changes:

godofredoc · 2021-05-25T23:34:55Z

@blasten is there an easy way to get all the tests that use gradle? I'd like to take a look at their flakiness levels for the last 7 days.

blasten · 2021-05-25T23:41:36Z

@godofredoc that's a good question. Some tests have gradle_ in the name, but definitely not all of them. Should we have some convention? It would be fantastic to filter by tag or label.

godofredoc · 2021-05-25T23:53:10Z

That will be great, if you can help us identify them we can potentially rename them or tag them. I'm really interested on checking how many gradle related flakiness issues we can close now.

godofredoc · 2021-05-26T00:37:25Z

I'm closing this bug and opening a new one for grouping tests

github-actions · 2021-07-31T06:01:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

gspencergoog added framework flutter/packages/flutter repository. See also f: labels. team: flakes labels May 17, 2021

gspencergoog assigned goderbauer and zanderso and unassigned goderbauer May 17, 2021

zanderso assigned blasten and unassigned zanderso May 17, 2021

gspencergoog mentioned this issue May 17, 2021

Gardening: mark flaky tests flaky. #82754

Merged

godofredoc added a commit to godofredoc/infra that referenced this issue May 18, 2021

Start using gradle caches in devicelab tasks.

1ee9623

This is to remove failures downloading gradle dependencies and speeding up the builds. Bug: flutter/flutter#82745

This was referenced May 19, 2021

Add lockfile for Gradle projects flutter/gallery#505

Merged

Add Gradle lockfiles and tool to generate them #83067

Merged

jmagman mentioned this issue May 24, 2021

Mark new_gallery__crane_perf unflaky #83288

Merged

blasten removed their assignment May 25, 2021

blasten assigned godofredoc May 25, 2021

blasten added P1 High-priority issues at the top of the work list and removed P1 labels May 25, 2021

godofredoc closed this as completed May 26, 2021

keyonghan added engine flutter/engine repository. See also e: labels. and removed framework flutter/packages/flutter repository. See also f: labels. labels May 26, 2021

keyonghan added the c: flake Tests that sometimes, but not always, incorrectly pass label Jun 18, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new_gallery__crane_perf is 2.56% flaky #82745

new_gallery__crane_perf is 2.56% flaky #82745

gspencergoog commented May 17, 2021 •

edited

zanderso commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 18, 2021

godofredoc commented May 18, 2021 •

edited

blasten commented May 18, 2021

godofredoc commented May 18, 2021

blasten commented May 18, 2021

godofredoc commented May 18, 2021

jmagman commented May 24, 2021

Hixie commented May 25, 2021

blasten commented May 25, 2021 •

edited

blasten commented May 25, 2021

godofredoc commented May 25, 2021

godofredoc commented May 25, 2021

blasten commented May 25, 2021

godofredoc commented May 25, 2021

godofredoc commented May 26, 2021

github-actions bot commented Jul 31, 2021

new_gallery__crane_perf is 2.56% flaky #82745

new_gallery__crane_perf is 2.56% flaky #82745

Comments

gspencergoog commented May 17, 2021 • edited

zanderso commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 17, 2021

godofredoc commented May 17, 2021

godofredoc commented May 17, 2021

blasten commented May 18, 2021

godofredoc commented May 18, 2021 • edited

blasten commented May 18, 2021

godofredoc commented May 18, 2021

blasten commented May 18, 2021

godofredoc commented May 18, 2021

jmagman commented May 24, 2021

Hixie commented May 25, 2021

blasten commented May 25, 2021 • edited

blasten commented May 25, 2021

godofredoc commented May 25, 2021

godofredoc commented May 25, 2021

blasten commented May 25, 2021

godofredoc commented May 25, 2021

godofredoc commented May 26, 2021

github-actions bot commented Jul 31, 2021

gspencergoog commented May 17, 2021 •

edited

godofredoc commented May 18, 2021 •

edited

blasten commented May 25, 2021 •

edited