New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951
Comments
So for a sharded build, if the sub tasks have a new 5 minute gap, and the coordinator adds a new 5 minute gap, that adds up to a 10 minute gap total. |
Seems earlier builds hit different @ricardoamador could you help reach out to LUCI team to see if anythings we can do regarding the gap? |
Yes sir. |
Wait! Is this happening on anything other than the Mac mac_ios_engine tests? |
Using this build: https://luci-milo.appspot.com/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7105/timeline Okay I asked in the LUCI channel and there appears to be a correlation between the total overhead time spent on the task: and the amount of time the tasks is uninstalling named caches: And this "gap" appears at the end of the build for unknown reasons: |
Yeah you are correct. I did some spot checking and they do have overhead in most of the builds but I did not find one that is as high as the Mac overhead. Windows had 2 mins from the ones I checked and Linux was up to a minute. What is that chart from? |
@vashworth When did the new xcode experiment happen? Is there any bot from prod that is with the new version? From LUCI, @ricardoamador Could you help take a look at those slow builds to see if they happen on specific bots? For each of those bots, it is interesting to see how many different xcode versions have been there. |
I haven't run any Xcode 15 experiments in probably over a month as I've been focusing on macOS 13. And my Xcode 15 experiments would only affect macOS 13 bots - which we have none of in the prod pool currently. Also, all of my macOS 13 experiments have been in the try pool or via |
@keyonghan yeah I will take a look to see if there are particular bots or any other trend. |
So there is definitely a difference depending on the machine the test runs on, however looking at the configs I can make no discernible difference between the machines. I looked at several and found variations such as these: If the task runs on build836-m9 it consistently takes ~3m 30s of overhead clearing the named caches. I could not identify any difference in the machines and the tasks is the same from the flutter side which makes this more puzzling. |
Thanks @ricardoamador , these are useful info! Do you find any correlation of these bots' overhead time with the number of available xcode versions installed in these bots? Named cache is bot specific, so I am not surprised they differ across bots. BTW: You can find the number of installed xcode versions by |
Is that in the output or do I need to log onto the bot to run this? Checking on the number of versions. |
I don't think this info is already there, but we can add a info step in the osx_sdk recipes module to print these info. |
Oh I see. Yeah I can do that. Though taking another look it seems that this might not have any effect either. I looked to see if the requested dimensions vs the actual bot dimension made any difference: It appears that it does not. Unless there is something else happening. I will go ahead and add that code to the recipe to see if there is something up there. |
Looking still further I do not see a universal increase of ~10 minutes across my samples. I see only one machine where that was the case and that one does not have a lot of runs for the last 3 weeks. I see some runs where the time spikes but they are roughly similar before and after 10/31. In one case the time improved in my sampling. Any row without a date means it was after 10/31/2023. And you can see that the cache uninstall time is consistent before and after the 31st. I don't see slower run times. I see an issue where some machines uninstall the cache slower but I won't know until I add the code. |
How would this be the case if we are cleaning up the caches? How could there be old versions. |
The cache is not indeed From our code, we do not clean up (remove the contents under osx_sdk) cache if the cache is not polluted. |
Okay I have just submitted the code to show the contents for the osx_sdk directory to see what is going on in there. LUCI suspects that it maybe slow due to calculating the disk size so this may help us in determining if we need to run a cleanup. |
Okay this is making a ton of sense. I added some code to show us the contents of the directory for the osx_sdk where we put xcode and other mac packages. And the number of files in that directory directly correlates to the amount of overhead that is spent on uninstalling named caches, which I assume is due to possibly figuring out the directory size? But when the named cache uninstall is taking >= 6 minutes we have the following files:
When the named cache uninstall is taking ~ 4 minutes we have the following files:
And finally when the uninstall named cache is at ~ 2minutes we have:
So the amount of overhead is directly related to the number of files that we have/have left in the osx_sdk directory. So obviously the fix here is to cleanup and remove anything that we are not using. @vashworth are all these versions needed and Is it safe to delete anything that is not in the configuration dependencies for the task? I'm guessing this will impact run time for a test that does not have the required version here. |
Yes, it's safe. I'm uncertain of how to delete only certain directories, though. I know how to clear the entire cache, which is also safe since the needed Xcode/runtime versions will reinstall but that will cause an initial slowdown. So preferable to just delete the old ones, but I don't know how. Also, we can probably improve this a little bit once the fleet is on macOS 13 (although recipes could be updated now). On macOS 13, runtimes are no longer copied into the Xcode app so we don't need to store Xcode app's by runtime. For example, we wouldn't need a runtime-specific version like |
@vashworth Okay good to know. The easier method is to blow away the cache but then we would need some kind of detection method to determine when to do that. I'll look into this. One more question. Do you know if this is a result of personal testing or just lack of cleanup in infra? Or both. I am curious as to how we would end up with such a discrepancy (one bot has so many more files than the other) on the bots since they were all active at the same time. |
Since these are prod bots, I think this is just lack of cleanup in infra. I think non-infra members aren't able to do |
One way we can do is add a manual meta file saving the last time when CI touches different versions. If CI hasn't used a version for, say 1 week, then do clean up. This should work for main branches. But for release branches which run less frequently, we may hit occasionally install/remove. |
@keyonghan that sounds like a good idea. |
Removing from queue as it is a P2 but continuing work on this. |
Okay this is fixed by: https://flutter-review.googlesource.com/c/recipes/+/52703. Infra is going to continue to monitor the overhead times as a result. Current removal timeout is set to 30 days. |
This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of |
AFAICT this time is entirely accounted for by a new gap that has appeared between the last task ending, and the whole job being completed. For example, note the large gap on the right hand side between the last recorded task ending, and the end of the timeline:
https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7242/timeline
This is also reflected in the stats on the main page for the build.
cc @godofredoc
The text was updated successfully, but these errors were encountered: