New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new_gallery__crane_perf is 2.56% flaky #82745
Comments
All failures over the last 200 runs were failures of gradle to grab dependencies from the network. For example:
We should potentially bump the priority of #19235 to address this. |
@godofredoc there was a conversation about caching these dependencies in the images. Has there been any research or work around how feasible it's to do that, and estimated time? One concern I had is that we build many apps in these bots, and the dependency graph is different, so we would need some type of tool that can capture the dependencies from the apps and also lock their versions. |
Caching the dependencies in images may not be a good option. I think the previous conversation on this topic was about making the gradle cache to work. Ideally gradle should be smart enough to only download dependencies that do not exist in a given directory but it seems like is not working. I'd recommend to try to find out what are we missing to enable caching at gradle level. The only thing we currently do is to set GRADLE_USER_HOME env to a cached directory in the bot https://cs.opensource.google/flutter/recipes/+/master:recipe_modules/android_sdk/api.py;l=69?q=gradle&ss=flutter%2Frecipes. This setup is what we use to cache xcode, android-sdk, etc. but unfortunately it seems like we are still missing some bits on the gradle side. |
Are all of the LUCI VMs stateful? My understanding was that only the Mac bots. |
LUCI uses hermetic builds for everything. Cache folder is an exception where we keep things to reuse in between builds. |
I see. How is the cache folder restored? For example, GitHub actions suggests to restore based on the OS, and content hash of the .gradle and . properties files. |
It is automatically done by LUCI. Behind the scenes when a task completes it grabs every sub folder from <cache_folder> and hashes them to an internal cache folder. The next task creates a chroot and copies the content to the chroot/cache folder. From the task view point of view the folder exist in cache location populated with the previous content. |
The only difference from the what Github actions is suggesting is that instead of using $HOME/.gradle we are setting GRADLE_USER_HOME to use /cache folder instead. |
I'm still not sure about how the cache is restored. How is the key computed? If the same cache is reused, I would need to know what forms the cache key. There are some potential issues, but I'm not sure if any apply:
|
Answers for the potential issues list: 1.- cache keys are not saved in github. The cache mechanism is independent of the source code and it is based on the content of the folder when the task execution completes. An example workflow using gradle:
For the gradle use case the only problem I can see is the cache growing too much that we need to manually trim it. The current max size is 10G. |
How do we set this limit?
If two bots finish at the same time, which one wins? What if one finishes first, but it takes longer to upload, so it ends up overriding the last one? How difficult would it be if we move to a model where we restore the Gradle cache per project basis, and the key is formed by hashing the .gradle and .lockfile content files? Similar to the GitHub actions model. |
This is to remove failures downloading gradle dependencies and speeding up the builds. Bug: flutter/flutter#82745
The 10G is the default limit set by the luci bot. There is no winner, the new files will be uploaded independently and the top level hashes will be recalculated. |
ah, if it uses merkle trees or similar then it's a different story. At the moment, the AI for me is to lock all the dependencies, so they resolve to same ones locally or remotely. |
Sounds good, thanks! |
new_gallery__crane_perf 15 day flake ratio is back under 2%, last 50 runs flaky and failed number is 0. Turning it back on. Leaving this issue open to track the Gradle cache discussion. |
What is still actionable here? |
Lowering priority, and assigning it to Godofredo to answer question about querying the latest executions. @godofredo what is the preferred way of querying the latest executions? I saw LUCI has a section about how to use BigQuery to get that data. |
Note that the documentation I found is for Chromium. |
@keyonghan created a dashboard to track flakiness: https://dashboards.corp.google.com/flutter_prod_flakiness_overview_dashboard There is a single failure in the last 100 runs and it was from before landing the changes: |
@blasten is there an easy way to get all the tests that use gradle? I'd like to take a look at their flakiness levels for the last 7 days. |
@godofredoc that's a good question. Some tests have |
That will be great, if you can help us identify them we can potentially rename them or tag them. I'm really interested on checking how many gradle related flakiness issues we can close now. |
I'm closing this bug and opening a new one for grouping tests |
This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of |
The benchmark test new_gallery__crane_perf has a flaky ratio 2.56%, which is above our self-imposed 2% threshold.
(Creating this issue as part of the gardener rotation.)
The text was updated successfully, but these errors were encountered: