Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

Closed
zanderso opened this issue Nov 6, 2023 · 29 comments
Closed

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

zanderso opened this issue Nov 6, 2023 · 29 comments
Assignees
Labels
P2 Important issues not at the top of the work list team-infra Owned by Infrastructure team

Comments

@zanderso
Copy link
Member

zanderso commented Nov 6, 2023

AFAICT this time is entirely accounted for by a new gap that has appeared between the last task ending, and the whole job being completed. For example, note the large gap on the right hand side between the last recorded task ending, and the end of the timeline:

https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7242/timeline

This is also reflected in the stats on the main page for the build.

cc @godofredoc

@zanderso zanderso added the team-infra Owned by Infrastructure team label Nov 6, 2023
@zanderso
Copy link
Member Author

zanderso commented Nov 6, 2023

So for a sharded build, if the sub tasks have a new 5 minute gap, and the coordinator adds a new 5 minute gap, that adds up to a 10 minute gap total.

@keyonghan
Copy link
Contributor

Seems earlier builds hit different gap periods ranging from seconds (10/27) to 5 mins (10/30)

@ricardoamador could you help reach out to LUCI team to see if anythings we can do regarding the gap?

@ricardoamador
Copy link
Contributor

Yes sir.

@ricardoamador
Copy link
Contributor

Wait! Is this happening on anything other than the Mac mac_ios_engine tests?

@ricardoamador
Copy link
Contributor

ricardoamador commented Nov 7, 2023

Using this build: https://luci-milo.appspot.com/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7105/timeline

Okay I asked in the LUCI channel and there appears to be a correlation between the total overhead time spent on the task:

image

and the amount of time the tasks is uninstalling named caches:

image

And this "gap" appears at the end of the build for unknown reasons:

image

@ricardoamador
Copy link
Contributor

The LUCI team is going to look into this. Though Robbi asked a question I do not have the answer to:

image

@ricardoamador ricardoamador added the P2 Important issues not at the top of the work list label Nov 7, 2023
@zanderso
Copy link
Member Author

zanderso commented Nov 7, 2023

It's not just Mac mac_ios_engine, I think. Eyeballing the chart, several builds are affected:
Screenshot 2023-11-06 at 18 48 25

@ricardoamador
Copy link
Contributor

ricardoamador commented Nov 7, 2023

Yeah you are correct. I did some spot checking and they do have overhead in most of the builds but I did not find one that is as high as the Mac overhead. Windows had 2 mins from the ones I checked and Linux was up to a minute. What is that chart from?

@keyonghan
Copy link
Contributor

keyonghan commented Nov 7, 2023

@vashworth When did the new xcode experiment happen? Is there any bot from prod that is with the new version? From LUCI, xcode contains lots of small files which may contribute to the slow.

@ricardoamador Could you help take a look at those slow builds to see if they happen on specific bots? For each of those bots, it is interesting to see how many different xcode versions have been there.

@vashworth
Copy link
Contributor

vashworth commented Nov 7, 2023

I haven't run any Xcode 15 experiments in probably over a month as I've been focusing on macOS 13. And my Xcode 15 experiments would only affect macOS 13 bots - which we have none of in the prod pool currently. Also, all of my macOS 13 experiments have been in the try pool or via led testing (against either try or staging bots).

@ricardoamador
Copy link
Contributor

@keyonghan yeah I will take a look to see if there are particular bots or any other trend.

@ricardoamador ricardoamador self-assigned this Nov 7, 2023
@ricardoamador
Copy link
Contributor

So there is definitely a difference depending on the machine the test runs on, however looking at the configs I can make no discernible difference between the machines. I looked at several and found variations such as these:

If the task runs on build836-m9 it consistently takes ~3m 30s of overhead clearing the named caches.
On build844-m9 it consistently takes ~5m 51s of overhead clearing the named caches.
On build800-m9 it consistently takes ~2m 35s of overhead clearing the named caches.
And on build808-m9 it consistently takes ~7m 10s of overhead clearing the named caches.

I could not identify any difference in the machines and the tasks is the same from the flutter side which makes this more puzzling.

Sampling of task run times with overhead:
image

Task overhead on specific bots is consistent:
image

@keyonghan
Copy link
Contributor

Thanks @ricardoamador , these are useful info! Do you find any correlation of these bots' overhead time with the number of available xcode versions installed in these bots? Named cache is bot specific, so I am not surprised they differ across bots.

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

@ricardoamador
Copy link
Contributor

ricardoamador commented Nov 7, 2023

Is that in the output or do I need to log onto the bot to run this? Checking on the number of versions.

@keyonghan
Copy link
Contributor

I don't think this info is already there, but we can add a info step in the osx_sdk recipes module to print these info.

@ricardoamador
Copy link
Contributor

Oh I see. Yeah I can do that. Though taking another look it seems that this might not have any effect either.

I looked to see if the requested dimensions vs the actual bot dimension made any difference:

image

It appears that it does not. Unless there is something else happening. I will go ahead and add that code to the recipe to see if there is something up there.

@ricardoamador
Copy link
Contributor

Looking still further I do not see a universal increase of ~10 minutes across my samples. I see only one machine where that was the case and that one does not have a lot of runs for the last 3 weeks.

I see some runs where the time spikes but they are roughly similar before and after 10/31. In one case the time improved in my sampling. Any row without a date means it was after 10/31/2023.

image

And you can see that the cache uninstall time is consistent before and after the 31st. I don't see slower run times. I see an issue where some machines uninstall the cache slower but I won't know until I add the code.

@ricardoamador
Copy link
Contributor

ricardoamador commented Nov 7, 2023

@keyonghan

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

How would this be the case if we are cleaning up the caches? How could there be old versions.

@keyonghan
Copy link
Contributor

@keyonghan

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

How would this be the case if we are cleaning up the caches? How could there be old versions.

The cache is not indeed cleaned up each time after a task finishes. They just get renamed/removed to a specific dir where the bot can retrieve the content the next task runs. For that step, calculation of the size of the directory is needed and it contributes to the overhead time.

From our code, we do not clean up (remove the contents under osx_sdk) cache if the cache is not polluted.

@ricardoamador
Copy link
Contributor

Okay I have just submitted the code to show the contents for the osx_sdk directory to see what is going on in there. LUCI suspects that it maybe slow due to calculating the disk size so this may help us in determining if we need to run a cleanup.

@ricardoamador
Copy link
Contributor

ricardoamador commented Nov 8, 2023

Okay this is making a ton of sense. I added some code to show us the contents of the directory for the osx_sdk where we put xcode and other mac packages. And the number of files in that directory directly correlates to the amount of overhead that is spent on uninstalling named caches, which I assume is due to possibly figuring out the directory size?

But when the named cache uninstall is taking >= 6 minutes we have the following files:

XCode.app
mac_toolchain
xcode_14a5294e
xcode_14a5294e_runtime_ios-16-0_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e222b_runtime_ios-16-4_14e222b_ios-16-2_14c18
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-0_14a5294e
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e222b
xcode_runtime_ios-16-4_14e300c

When the named cache uninstall is taking ~ 4 minutes we have the following files:

xcode_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

And finally when the uninstall named cache is at ~ 2minutes we have:

xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

So the amount of overhead is directly related to the number of files that we have/have left in the osx_sdk directory. So obviously the fix here is to cleanup and remove anything that we are not using.

@vashworth are all these versions needed and Is it safe to delete anything that is not in the configuration dependencies for the task? I'm guessing this will impact run time for a test that does not have the required version here.

@vashworth
Copy link
Contributor

vashworth commented Nov 8, 2023

Okay this is making a ton of sense. I added some code to show us the contents of the directory for the osx_sdk where we put xcode and other mac packages. And the number of files in that directory directly correlates to the amount of overhead that is spent on uninstalling named caches, which I assume is due to possibly figuring out the directory size?

But when the named cache uninstall is taking >= 6 minutes we have the following files:

XCode.app
mac_toolchain
xcode_14a5294e
xcode_14a5294e_runtime_ios-16-0_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e222b_runtime_ios-16-4_14e222b_ios-16-2_14c18
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-0_14a5294e
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e222b
xcode_runtime_ios-16-4_14e300c

When the named cache uninstall is taking ~ 4 minutes we have the following files:

xcode_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

And finally when the uninstall named cache is at ~ 2minutes we have:

xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

So the amount of overhead is directly related to the number of files that we have/have left in the osx_sdk directory. So obviously the fix here is to cleanup and remove anything that we are not using.

@vashworth do you have any recommendations here? Is it safe to delete anything that is not in the configuration dependencies for the task?

Yes, it's safe. I'm uncertain of how to delete only certain directories, though. I know how to clear the entire cache, which is also safe since the needed Xcode/runtime versions will reinstall but that will cause an initial slowdown. So preferable to just delete the old ones, but I don't know how.

Also, we can probably improve this a little bit once the fleet is on macOS 13 (although recipes could be updated now). On macOS 13, runtimes are no longer copied into the Xcode app so we don't need to store Xcode app's by runtime. For example, we wouldn't need a runtime-specific version like xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18, would only need xcode_14e300c so that's at least one less version to have cached.

@ricardoamador
Copy link
Contributor

@vashworth Okay good to know. The easier method is to blow away the cache but then we would need some kind of detection method to determine when to do that. I'll look into this.

One more question. Do you know if this is a result of personal testing or just lack of cleanup in infra? Or both. I am curious as to how we would end up with such a discrepancy (one bot has so many more files than the other) on the bots since they were all active at the same time.

@vashworth
Copy link
Contributor

@vashworth Okay good to know. The easier method is to blow away the cache but then we would need some kind of detection method to determine when to do that. I'll look into this.

One more question. Do you know if this is a result of personal testing or just lack of cleanup in infra? Or both. I am curious as to how we would end up with such a discrepancy (one bot has so many more files than the other) on the bots since they were all active at the same time.

Since these are prod bots, I think this is just lack of cleanup in infra. I think non-infra members aren't able to do led tests on prod bots so most likely not from personal testing. I'm not really sure why they would have a discrepancy either. I could be the bots were moved around between pools at some point?

@keyonghan
Copy link
Contributor

One way we can do is add a manual meta file saving the last time when CI touches different versions. If CI hasn't used a version for, say 1 week, then do clean up. This should work for main branches. But for release branches which run less frequently, we may hit occasionally install/remove.

@ricardoamador
Copy link
Contributor

@keyonghan that sounds like a good idea.
@vashworth Okay just checking. Thanks. Yeah also very possible that bots were in different pools at one point.

@ricardoamador
Copy link
Contributor

Removing from queue as it is a P2 but continuing work on this.

@ricardoamador
Copy link
Contributor

Okay this is fixed by: https://flutter-review.googlesource.com/c/recipes/+/52703. Infra is going to continue to monitor the overhead times as a result. Current removal timeout is set to 30 days.

Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2 Important issues not at the top of the work list team-infra Owned by Infrastructure team
Projects
None yet
Development

No branches or pull requests

4 participants