Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_gallery__crane_perf is 2.56% flaky #82745

Closed
gspencergoog opened this issue May 17, 2021 · 24 comments
Closed

new_gallery__crane_perf is 2.56% flaky #82745

gspencergoog opened this issue May 17, 2021 · 24 comments
Assignees
Labels
c: flake Tests that sometimes, but not always, incorrectly pass engine flutter/engine repository. See also e: labels. P1 High-priority issues at the top of the work list

Comments

@gspencergoog
Copy link
Contributor

gspencergoog commented May 17, 2021

The benchmark test new_gallery__crane_perf has a flaky ratio 2.56%, which is above our self-imposed 2% threshold.

(Creating this issue as part of the gardener rotation.)

@gspencergoog gspencergoog added framework flutter/packages/flutter repository. See also f: labels. team: flakes labels May 17, 2021
@zanderso
Copy link
Member

All failures over the last 200 runs were failures of gradle to grab dependencies from the network. For example:

[new_gallery__crane_perf] [STDOUT] stderr: [ +100 ms] FAILURE: Build failed with an exception.
[new_gallery__crane_perf] [STDOUT] stderr: [        ] * What went wrong:
[new_gallery__crane_perf] [STDOUT] stderr: [        ] Execution failed for task ':package_info:extractProfileAnnotations'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ] > Could not resolve all files for configuration ':package_info:lintClassPath'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]    > Could not resolve org.codehaus.groovy:groovy-all:2.4.15.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]      Required by:
[new_gallery__crane_perf] [STDOUT] stderr: [        ]          project :package_info > com.android.tools.lint:lint-gradle:27.1.0
[new_gallery__crane_perf] [STDOUT] stderr: [        ]       > Could not resolve org.codehaus.groovy:groovy-all:2.4.15.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]          > Could not get resource 'https://jcenter.bintray.com/org/codehaus/groovy/groovy-all/2.4.15/groovy-all-2.4.15.pom'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]             > Could not GET 'https://jcenter.bintray.com/org/codehaus/groovy/groovy-all/2.4.15/groovy-all-2.4.15.pom'.
[new_gallery__crane_perf] [STDOUT] stderr: [        ]                > Connect to jcenter.bintray.com:443 [jcenter.bintray.com/34.95.74.180] failed: connect timed out

We should potentially bump the priority of #19235 to address this.

@blasten
Copy link

blasten commented May 17, 2021

@godofredoc there was a conversation about caching these dependencies in the images. Has there been any research or work around how feasible it's to do that, and estimated time?

One concern I had is that we build many apps in these bots, and the dependency graph is different, so we would need some type of tool that can capture the dependencies from the apps and also lock their versions.

@godofredoc
Copy link
Contributor

Caching the dependencies in images may not be a good option. I think the previous conversation on this topic was about making the gradle cache to work.

Ideally gradle should be smart enough to only download dependencies that do not exist in a given directory but it seems like is not working. I'd recommend to try to find out what are we missing to enable caching at gradle level.

The only thing we currently do is to set GRADLE_USER_HOME env to a cached directory in the bot https://cs.opensource.google/flutter/recipes/+/master:recipe_modules/android_sdk/api.py;l=69?q=gradle&ss=flutter%2Frecipes. This setup is what we use to cache xcode, android-sdk, etc. but unfortunately it seems like we are still missing some bits on the gradle side.

@blasten
Copy link

blasten commented May 17, 2021

Ideally gradle should be smart enough to only download dependencies that do not exist in a given directory but it seems like is not working.

Are all of the LUCI VMs stateful? My understanding was that only the Mac bots.

@godofredoc
Copy link
Contributor

LUCI uses hermetic builds for everything. Cache folder is an exception where we keep things to reuse in between builds.

@blasten
Copy link

blasten commented May 17, 2021

I see. How is the cache folder restored? For example, GitHub actions suggests to restore based on the OS, and content hash of the .gradle and . properties files.

@godofredoc
Copy link
Contributor

It is automatically done by LUCI. Behind the scenes when a task completes it grabs every sub folder from <cache_folder> and hashes them to an internal cache folder. The next task creates a chroot and copies the content to the chroot/cache folder. From the task view point of view the folder exist in cache location populated with the previous content.

@godofredoc
Copy link
Contributor

The only difference from the what Github actions is suggesting is that instead of using $HOME/.gradle we are setting GRADLE_USER_HOME to use /cache folder instead.

@blasten
Copy link

blasten commented May 18, 2021

Behind the scenes when a task completes it grabs every sub folder from <cache_folder> and hashes them to an internal cache folder.

I'm still not sure about how the cache is restored. How is the key computed? If the same cache is reused, I would need to know what forms the cache key.

There are some potential issues, but I'm not sure if any apply:

  1. If the cache key is the source code checked out from GitHub, then the hit rate will be 0. (If we ignore empty commits)
  2. If the cache key doesn't depend on the source code, then, can two or more bots write and read to the same object at the same time? Follow up question, is the cache stored in a cloud bucket? Disk multi-write mode seems to be in preview, do we use it?

@godofredoc

@godofredoc
Copy link
Contributor

godofredoc commented May 18, 2021

Answers for the potential issues list:

1.- cache keys are not saved in github. The cache mechanism is independent of the source code and it is based on the content of the folder when the task execution completes.
2.- when the task is reading/updating the cache it is already a local folder mounted from the state of a previous task based on a calculated hash using all the files in a a given folder.

An example workflow using gradle:

  • A previous cache from the bot is mounted to <bot_chroot>/cache/gradle. LUCI handles this automatically calculating hashes and saving to isolated.
  • GRADLE_USER_HOME is set to <bot_chroot>/cache/gradle
  • Code is checked out and gradle task is run. As many files are already in <bot_chroot>/cache/gradle/caches the expectation is that gradle downloads only what is not yet available in the cache(replacing files should not happen if gradle handles caches properly).
  • The task runs and completes.
  • LUCI calculates hashes for files in <bot_chroot>/cache/gradle and uploads the new versions to isolated. Theres is no clashes as isolated optimizes uploads/downloads using hashes(uploading a new file only if a file with the same hash does not exist in the server).

For the gradle use case the only problem I can see is the cache growing too much that we need to manually trim it. The current max size is 10G.

@blasten
Copy link

blasten commented May 18, 2021

For the gradle use case the only problem I can see is the cache growing too much that we need to manually trim it. The current max size is 10G.

How do we set this limit?

A previous cache from the bot is mounted to <bot_chroot>/cache/gradle.

If two bots finish at the same time, which one wins? What if one finishes first, but it takes longer to upload, so it ends up overriding the last one?

How difficult would it be if we move to a model where we restore the Gradle cache per project basis, and the key is formed by hashing the .gradle and .lockfile content files? Similar to the GitHub actions model.

godofredoc added a commit to godofredoc/infra that referenced this issue May 18, 2021
This is to remove failures downloading gradle dependencies and speeding
up the builds.

Bug: flutter/flutter#82745
@godofredoc
Copy link
Contributor

The 10G is the default limit set by the luci bot.

There is no winner, the new files will be uploaded independently and the top level hashes will be recalculated.

@blasten
Copy link

blasten commented May 18, 2021

ah, if it uses merkle trees or similar then it's a different story.

At the moment, the AI for me is to lock all the dependencies, so they resolve to same ones locally or remotely.

@godofredoc
Copy link
Contributor

Sounds good, thanks!

@jmagman
Copy link
Member

jmagman commented May 24, 2021

new_gallery__crane_perf 15 day flake ratio is back under 2%, last 50 runs flaky and failed number is 0. Turning it back on.

Leaving this issue open to track the Gradle cache discussion.

@Hixie
Copy link
Contributor

Hixie commented May 25, 2021

What is still actionable here?

@blasten blasten removed their assignment May 25, 2021
@blasten blasten added P1 High-priority issues at the top of the work list and removed P1 labels May 25, 2021
@blasten
Copy link

blasten commented May 25, 2021

Lowering priority, and assigning it to Godofredo to answer question about querying the latest executions.

@godofredo what is the preferred way of querying the latest executions? I saw LUCI has a section about how to use BigQuery to get that data.

@blasten
Copy link

blasten commented May 25, 2021

Note that the documentation I found is for Chromium.

@godofredoc
Copy link
Contributor

@keyonghan created a dashboard to track flakiness: https://dashboards.corp.google.com/flutter_prod_flakiness_overview_dashboard

There is a single failure in the last 100 runs and it was from before landing the changes:

image

@godofredoc
Copy link
Contributor

@blasten is there an easy way to get all the tests that use gradle? I'd like to take a look at their flakiness levels for the last 7 days.

@blasten
Copy link

blasten commented May 25, 2021

@godofredoc that's a good question. Some tests have gradle_ in the name, but definitely not all of them. Should we have some convention? It would be fantastic to filter by tag or label.

@godofredoc
Copy link
Contributor

That will be great, if you can help us identify them we can potentially rename them or tag them. I'm really interested on checking how many gradle related flakiness issues we can close now.

@godofredoc
Copy link
Contributor

I'm closing this bug and opening a new one for grouping tests

@keyonghan keyonghan added engine flutter/engine repository. See also e: labels. and removed framework flutter/packages/flutter repository. See also f: labels. labels May 26, 2021
@keyonghan keyonghan added the c: flake Tests that sometimes, but not always, incorrectly pass label Jun 18, 2021
@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
c: flake Tests that sometimes, but not always, incorrectly pass engine flutter/engine repository. See also e: labels. P1 High-priority issues at the top of the work list
Projects
Development

No branches or pull requests

8 participants