Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

zanderso · 2023-11-06T17:37:42Z

AFAICT this time is entirely accounted for by a new gap that has appeared between the last task ending, and the whole job being completed. For example, note the large gap on the right hand side between the last recorded task ending, and the end of the timeline:

https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7242/timeline

This is also reflected in the stats on the main page for the build.

cc @godofredoc

zanderso · 2023-11-06T17:38:43Z

So for a sharded build, if the sub tasks have a new 5 minute gap, and the coordinator adds a new 5 minute gap, that adds up to a 10 minute gap total.

keyonghan · 2023-11-06T22:58:53Z

Seems earlier builds hit different gap periods ranging from seconds (10/27) to 5 mins (10/30)

@ricardoamador could you help reach out to LUCI team to see if anythings we can do regarding the gap?

ricardoamador · 2023-11-06T23:08:07Z

Yes sir.

ricardoamador · 2023-11-06T23:55:35Z

Wait! Is this happening on anything other than the Mac mac_ios_engine tests?

ricardoamador · 2023-11-07T00:14:05Z

Using this build: https://luci-milo.appspot.com/ui/p/flutter/builders/prod/Mac%20mac_ios_engine/7105/timeline

Okay I asked in the LUCI channel and there appears to be a correlation between the total overhead time spent on the task:

and the amount of time the tasks is uninstalling named caches:

And this "gap" appears at the end of the build for unknown reasons:

ricardoamador · 2023-11-07T00:16:00Z

The LUCI team is going to look into this. Though Robbi asked a question I do not have the answer to:

zanderso · 2023-11-07T02:49:31Z

It's not just Mac mac_ios_engine, I think. Eyeballing the chart, several builds are affected:

ricardoamador · 2023-11-07T15:39:58Z

Yeah you are correct. I did some spot checking and they do have overhead in most of the builds but I did not find one that is as high as the Mac overhead. Windows had 2 mins from the ones I checked and Linux was up to a minute. What is that chart from?

keyonghan · 2023-11-07T16:27:04Z

@vashworth When did the new xcode experiment happen? Is there any bot from prod that is with the new version? From LUCI, xcode contains lots of small files which may contribute to the slow.

@ricardoamador Could you help take a look at those slow builds to see if they happen on specific bots? For each of those bots, it is interesting to see how many different xcode versions have been there.

vashworth · 2023-11-07T16:33:09Z

I haven't run any Xcode 15 experiments in probably over a month as I've been focusing on macOS 13. And my Xcode 15 experiments would only affect macOS 13 bots - which we have none of in the prod pool currently. Also, all of my macOS 13 experiments have been in the try pool or via led testing (against either try or staging bots).

ricardoamador · 2023-11-07T16:52:30Z

@keyonghan yeah I will take a look to see if there are particular bots or any other trend.

ricardoamador · 2023-11-07T21:25:30Z

So there is definitely a difference depending on the machine the test runs on, however looking at the configs I can make no discernible difference between the machines. I looked at several and found variations such as these:

If the task runs on build836-m9 it consistently takes ~3m 30s of overhead clearing the named caches.
On build844-m9 it consistently takes ~5m 51s of overhead clearing the named caches.
On build800-m9 it consistently takes ~2m 35s of overhead clearing the named caches.
And on build808-m9 it consistently takes ~7m 10s of overhead clearing the named caches.

I could not identify any difference in the machines and the tasks is the same from the flutter side which makes this more puzzling.

Sampling of task run times with overhead:

Task overhead on specific bots is consistent:

keyonghan · 2023-11-07T21:34:46Z

Thanks @ricardoamador , these are useful info! Do you find any correlation of these bots' overhead time with the number of available xcode versions installed in these bots? Named cache is bot specific, so I am not surprised they differ across bots.

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

ricardoamador · 2023-11-07T21:38:09Z

Is that in the output or do I need to log onto the bot to run this? Checking on the number of versions.

keyonghan · 2023-11-07T21:50:57Z

I don't think this info is already there, but we can add a info step in the osx_sdk recipes module to print these info.

ricardoamador · 2023-11-07T21:57:42Z

Oh I see. Yeah I can do that. Though taking another look it seems that this might not have any effect either.

I looked to see if the requested dimensions vs the actual bot dimension made any difference:

It appears that it does not. Unless there is something else happening. I will go ahead and add that code to the recipe to see if there is something up there.

ricardoamador · 2023-11-07T22:26:50Z

Looking still further I do not see a universal increase of ~10 minutes across my samples. I see only one machine where that was the case and that one does not have a lot of runs for the last 3 weeks.

I see some runs where the time spikes but they are roughly similar before and after 10/31. In one case the time improved in my sampling. Any row without a date means it was after 10/31/2023.

And you can see that the cache uninstall time is consistent before and after the 31st. I don't see slower run times. I see an issue where some machines uninstall the cache slower but I won't know until I add the code.

ricardoamador · 2023-11-07T22:43:01Z

@keyonghan

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

How would this be the case if we are cleaning up the caches? How could there be old versions.

keyonghan · 2023-11-07T22:59:37Z

@keyonghan

BTW: You can find the number of installed xcode versions by ls path['cache']/osx_sdk. If we have too many old xcode versions, we can try cleaning up before LUCI team has a solution.

How would this be the case if we are cleaning up the caches? How could there be old versions.

The cache is not indeed cleaned up each time after a task finishes. They just get renamed/removed to a specific dir where the bot can retrieve the content the next task runs. For that step, calculation of the size of the directory is needed and it contributes to the overhead time.

From our code, we do not clean up (remove the contents under osx_sdk) cache if the cache is not polluted.

ricardoamador · 2023-11-08T17:52:42Z

Okay I have just submitted the code to show the contents for the osx_sdk directory to see what is going on in there. LUCI suspects that it maybe slow due to calculating the disk size so this may help us in determining if we need to run a cleanup.

ricardoamador · 2023-11-08T22:58:46Z

Okay this is making a ton of sense. I added some code to show us the contents of the directory for the osx_sdk where we put xcode and other mac packages. And the number of files in that directory directly correlates to the amount of overhead that is spent on uninstalling named caches, which I assume is due to possibly figuring out the directory size?

But when the named cache uninstall is taking >= 6 minutes we have the following files:

XCode.app
mac_toolchain
xcode_14a5294e
xcode_14a5294e_runtime_ios-16-0_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e222b_runtime_ios-16-4_14e222b_ios-16-2_14c18
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-0_14a5294e
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e222b
xcode_runtime_ios-16-4_14e300c

When the named cache uninstall is taking ~ 4 minutes we have the following files:

xcode_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

And finally when the uninstall named cache is at ~ 2minutes we have:

xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c

So the amount of overhead is directly related to the number of files that we have/have left in the osx_sdk directory. So obviously the fix here is to cleanup and remove anything that we are not using.

@vashworth are all these versions needed and Is it safe to delete anything that is not in the configuration dependencies for the task? I'm guessing this will impact run time for a test that does not have the required version here.

vashworth · 2023-11-08T23:14:49Z

Okay this is making a ton of sense. I added some code to show us the contents of the directory for the osx_sdk where we put xcode and other mac packages. And the number of files in that directory directly correlates to the amount of overhead that is spent on uninstalling named caches, which I assume is due to possibly figuring out the directory size?

But when the named cache uninstall is taking >= 6 minutes we have the following files:
XCode.app
mac_toolchain
xcode_14a5294e
xcode_14a5294e_runtime_ios-16-0_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e222b_runtime_ios-16-4_14e222b_ios-16-2_14c18
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-0_14a5294e
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e222b
xcode_runtime_ios-16-4_14e300c
When the named cache uninstall is taking ~ 4 minutes we have the following files:
xcode_14a5294e
xcode_14c18
xcode_14e222b
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c
And finally when the uninstall named cache is at ~ 2minutes we have:
xcode_14e300c
xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18
xcode_runtime_ios-16-2_14c18
xcode_runtime_ios-16-4_14e300c
So the amount of overhead is directly related to the number of files that we have/have left in the osx_sdk directory. So obviously the fix here is to cleanup and remove anything that we are not using.

@vashworth do you have any recommendations here? Is it safe to delete anything that is not in the configuration dependencies for the task?

Yes, it's safe. I'm uncertain of how to delete only certain directories, though. I know how to clear the entire cache, which is also safe since the needed Xcode/runtime versions will reinstall but that will cause an initial slowdown. So preferable to just delete the old ones, but I don't know how.

Also, we can probably improve this a little bit once the fleet is on macOS 13 (although recipes could be updated now). On macOS 13, runtimes are no longer copied into the Xcode app so we don't need to store Xcode app's by runtime. For example, we wouldn't need a runtime-specific version like xcode_14e300c_runtime_ios-16-4_14e300c_ios-16-2_14c18, would only need xcode_14e300c so that's at least one less version to have cached.

ricardoamador · 2023-11-08T23:19:24Z

@vashworth Okay good to know. The easier method is to blow away the cache but then we would need some kind of detection method to determine when to do that. I'll look into this.

One more question. Do you know if this is a result of personal testing or just lack of cleanup in infra? Or both. I am curious as to how we would end up with such a discrepancy (one bot has so many more files than the other) on the bots since they were all active at the same time.

vashworth · 2023-11-08T23:25:45Z

@vashworth Okay good to know. The easier method is to blow away the cache but then we would need some kind of detection method to determine when to do that. I'll look into this.

One more question. Do you know if this is a result of personal testing or just lack of cleanup in infra? Or both. I am curious as to how we would end up with such a discrepancy (one bot has so many more files than the other) on the bots since they were all active at the same time.

Since these are prod bots, I think this is just lack of cleanup in infra. I think non-infra members aren't able to do led tests on prod bots so most likely not from personal testing. I'm not really sure why they would have a discrepancy either. I could be the bots were moved around between pools at some point?

keyonghan · 2023-11-08T23:30:20Z

One way we can do is add a manual meta file saving the last time when CI touches different versions. If CI hasn't used a version for, say 1 week, then do clean up. This should work for main branches. But for release branches which run less frequently, we may hit occasionally install/remove.

ricardoamador · 2023-11-08T23:34:36Z

@keyonghan that sounds like a good idea.
@vashworth Okay just checking. Thanks. Yeah also very possible that bots were in different pools at one point.

ricardoamador · 2023-11-13T21:04:38Z

Removing from queue as it is a P2 but continuing work on this.

ricardoamador · 2023-12-16T00:14:06Z

Okay this is fixed by: https://flutter-review.googlesource.com/c/recipes/+/52703. Infra is going to continue to monitor the overhead times as a result. Current removal timeout is set to 30 days.

github-actions · 2023-12-30T01:22:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.

zanderso added the team-infra Owned by Infrastructure team label Nov 6, 2023

ricardoamador added the P2 Important issues not at the top of the work list label Nov 7, 2023

ricardoamador self-assigned this Nov 7, 2023

ricardoamador closed this as completed Dec 16, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

zanderso commented Nov 6, 2023 •

edited

zanderso commented Nov 6, 2023

keyonghan commented Nov 6, 2023

ricardoamador commented Nov 6, 2023

ricardoamador commented Nov 6, 2023

ricardoamador commented Nov 7, 2023 •

edited

ricardoamador commented Nov 7, 2023

zanderso commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 •

edited

keyonghan commented Nov 7, 2023 •

edited

vashworth commented Nov 7, 2023 •

edited

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 •

edited

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 •

edited

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 8, 2023

ricardoamador commented Nov 8, 2023 •

edited

vashworth commented Nov 8, 2023 •

edited

ricardoamador commented Nov 8, 2023

vashworth commented Nov 8, 2023

keyonghan commented Nov 8, 2023

ricardoamador commented Nov 8, 2023

ricardoamador commented Nov 13, 2023

ricardoamador commented Dec 16, 2023

github-actions bot commented Dec 30, 2023

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

Mac mac_ios_engine running ~10 minutes slower since 10/31 #137951

Comments

zanderso commented Nov 6, 2023 • edited

zanderso commented Nov 6, 2023

keyonghan commented Nov 6, 2023

ricardoamador commented Nov 6, 2023

ricardoamador commented Nov 6, 2023

ricardoamador commented Nov 7, 2023 • edited

ricardoamador commented Nov 7, 2023

zanderso commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 • edited

keyonghan commented Nov 7, 2023 • edited

vashworth commented Nov 7, 2023 • edited

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 • edited

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023

ricardoamador commented Nov 7, 2023 • edited

keyonghan commented Nov 7, 2023

ricardoamador commented Nov 8, 2023

ricardoamador commented Nov 8, 2023 • edited

vashworth commented Nov 8, 2023 • edited

ricardoamador commented Nov 8, 2023

vashworth commented Nov 8, 2023

keyonghan commented Nov 8, 2023

ricardoamador commented Nov 8, 2023

ricardoamador commented Nov 13, 2023

ricardoamador commented Dec 16, 2023

github-actions bot commented Dec 30, 2023

zanderso commented Nov 6, 2023 •

edited

ricardoamador commented Nov 7, 2023 •

edited

ricardoamador commented Nov 7, 2023 •

edited

keyonghan commented Nov 7, 2023 •

edited

vashworth commented Nov 7, 2023 •

edited

ricardoamador commented Nov 7, 2023 •

edited

ricardoamador commented Nov 7, 2023 •

edited

ricardoamador commented Nov 8, 2023 •

edited

vashworth commented Nov 8, 2023 •

edited