Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GHA/windows: mitigate drastic runtests perf drop under MSYS2 #16217

Closed
wants to merge 7 commits into from

Conversation

vszakats
Copy link
Member

@vszakats vszakats commented Feb 6, 2025

Today GHA Windows runner images (all versions) deployed an upgrade
(20250127.1.0 -> 20250203.1.0) that upgraded the default MSYS2, which
now seems to feature the October 2024 issue that caused curl runtests
run times increasing ~2.5x. It also causes test987 to fail, and vcpkg
jobs hitting their time limits and fail. Reliability also got a hit.

In October this issue came with a Git for Windows upgrade, and likely
the MSYS2 runtime update within it. It affected vcpkg jobs only, and
I mitigated it by switching them to use the default MSYS2 shell and
runtime (at C:\msys64):

5f9411f #15380

After today's update this mitigation no longer works. The issue also
affects dl-mingw jobs now, though to a lesser extent than vcpkg ones.

Tried switching back to Git for Windows which received several updates
since October, but the performance issue is still present.

I managed to mitigate the slowdown in vcpkg by lowering test parallelism
to -j4 (from -j8), after which the jobs are about half the speed
than before, and fit their time limits. dl-mingw builds run slower by
1-1.5 minutes per job, they were already using -j4.

Example jobs:

Before (ALL GOOD):
https://github.com/curl/curl/actions/runs/13167230443/job/36750175428 installed MSYS2, mingw (-j8): 3m50s (OK)
https://github.com/curl/curl/actions/runs/13167230443/job/36750158662 default MSYS2, dl-mingw (-j4): 4m22s (OK)
https://github.com/curl/curl/actions/runs/13167230443/job/36750163392 default MSYS2, vcpkg (-j8): 3m27s (OK)
runner: https://github.com/actions/runner-images/blob/win22/20250127.1/images/windows/Windows2022-Readme.md
C:\msys64:
System: MSYS_NT-10.0-20348 fv-az1115-916 3.5.4-0bc1222b.x86_64 2024-12-05 09:27 UTC x86_64 Msys
msys2/msys2-runtime@0bc1222b
edit: really → msys2/msys2-runtime@8d847f4

After:
https://github.com/curl/curl/actions/runs/13186498273/job/36809747078 installed MSYS2, mingw (-j8): 3m48s (OK)
https://github.com/curl/curl/actions/runs/13186498273/job/36809728481 default MSYS2, dl-mingw (-j4): 5m56s (SLOW)
https://github.com/curl/curl/actions/runs/13186498273/job/36809736429 default MSYS2, vcpkg (-j8): 9m1s (SLOW)
runner: https://github.com/actions/runner-images/blob/win22/20250203.1/images/windows/Windows2022-Readme.md
C:\msys64:
System: MSYS_NT-10.0-20348 fv-az1115-498 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys
msys2/msys2-runtime@2644508f

windows-2025 image:
C:\msys64:
System: MSYS_NT-10.0-26100 fv-az2043-515 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys

windows-2019 image:
C:\msys64:
System: MSYS_NT-10.0-17763 fv-az1434-677 3.5.7-2644508f.x86_64 2025-01-30 09:08 UTC x86_64 Msys

This PR:
final: https://github.com/curl/curl/actions/runs/13186498273/job/36809736429 GfW, vcpkg (-j4): ~7m (SLOW)
test: https://github.com/curl/curl/actions/runs/13187992987/job/36814644852?pr=16217, GfW, vcpkg (-j8): ~11m (SLOWER)

Before and after (unused) Git for Windows (SLOW as tested in this PR):
C:\Program Files\Git
System: MINGW64_NT-10.0-20348 fv-az1760-186 3.5.4-395fda67.x86_64 2024-11-25 09:49 UTC x86_64 Msys
msys2/msys2-runtime@395fda67 (fork)

Before and after (used) MSYS2 installed via msys2/setup-msys2 (OK):
D:\a_temp\msys64
System: MINGW64_NT-10.0-20348 fv-az836-378 3.5.4-0bc1222b.x86_64 2024-12-05 09:27 UTC x86_64 Msys

Perl pipe issue report from October, still open:
msys2/msys2-runtime#230
ARM deadlock fixed by GfW 2.47.1(1), but for x86_64, on a quick glance:
msys2/msys2-runtime@290bea9
Possibly interesting:
msys2/msys2-autobuild#62

@vszakats vszakats added Windows Windows-specific CI Continuous Integration labels Feb 6, 2025
@vszakats
Copy link
Member Author

vszakats commented Feb 6, 2025

Switching back to the default shell doesn't help.

Maybe we could install yet another copy of MSYS2 for vcpkg and dl-mingw jobs? I don't know. This is extremely exhausing and time consuming, and apparently neverending.

edit:

Hard to say, but this patch seems to bring back the terrible run times
(2.5x slower) seen in the October 2024 fallout, which is slightly less
terrible than the current one (5x slower).
Turns out I picked a job
that disabled parallel tests. Meaning after today's fallout, we're back
to the October slowdown of 2-2.5x. Since it looks the same it's
probable that it's the same issue as suspected originally. The bug
commit is now in both the default MSYS2 and the one shipping in
Git for Windows. Bad.

msys2/msys2-runtime#230 (Perl pipe issue report from October, still open)

Possibly related, this fixes ARM deadlocks fixed by GfW 2.47.1(1), but for x86_64, on a quick glance:
msys2/msys2-runtime@290bea9
msys2/msys2-runtime#255

Possibly interesting:
msys2/msys2-autobuild#62

@github-actions github-actions bot added the tests label Feb 6, 2025
@vszakats

This comment was marked as outdated.

@testclutch

This comment was marked as resolved.

@vszakats vszakats force-pushed the wgha-perf branch 2 times, most recently from e3b82d6 to 9aa0480 Compare February 6, 2025 20:09
@vszakats vszakats mentioned this pull request Feb 6, 2025
Seen with MSVC 2019:
```
D:\a\curl\curl\lib\vtls\wolfssl.c(773): error C2220: the following warning is treated as an error [D:\a\curl\curl\bld\lib\libcurl_object.vcxproj]
D:\a\curl\curl\lib\vtls\wolfssl.c(773): warning C4706: assignment within conditional expression [D:\a\curl\curl\bld\lib\libcurl_object.vcxproj]
```
https://github.com/curl/curl/actions/runs/13190321645/job/36821938202?pr=16217#step:9:30
Silence checksrc where it detects it as EQUALSNULL.
@vszakats vszakats changed the title GHA/windows: switch back to default bash shell for vcpkg GHA/windows: mitigate drastic runtests perf drop under MSYS2 Feb 7, 2025
@vszakats vszakats closed this in 1bf774d Feb 7, 2025
@vszakats vszakats deleted the wgha-perf branch February 7, 2025 01:45
vszakats added a commit to vszakats/curl that referenced this pull request Feb 8, 2025
vszakats added a commit to vszakats/curl that referenced this pull request Feb 8, 2025
@vszakats
Copy link
Member Author

well that sucks. the last known good version you are using was still based on Cygwin 3.4. I tried with the very first Git for Windows that had a version based on Cygwin 3.5 (the very next one, 2.47.0-rc0), and the slowdown was already there. Good news is, my ARM64 patch was nowhere near that version. Bad news is, a lot of stuff changed (a major Cygwin release), so attempting to narrow down exactly what the cause was would be a major undertaking.

maybe if I try with msys2 (upstream of Git for Windows) I'll be more lucky.

The reason I choose Git for Windows for the workaround was that it's quick to install
and trivial to pick a specific version. For bisecting or closing up on the issue, MSYS2
would be nicer indeed.

It may be interesting that the MSYS2 installed via msys2/setup-msys2 (3.5.4-0bc1222b)
doesn't hit this issue, nor the Cygwin installed via cygwin/cygwin-install-action (3.5.7-1).
At least for now.

I tried diffing the runtime sources listed in the PR message, e.g. 0bc1222b and 2644508f
(IIRC), and they had a reasonably small amount of changes that may be related.

@jeremyd2019
Copy link

I think the 0bc1222b revision was a red herring (probably a bug in the packaging), I think I'll have to go by the date to figure out exactly what revision was used. Not horribly difficult, just an extra step.

@jeremyd2019
Copy link

The revision of the 2024-12-05 build of msys2-runtime was really 8d847f4

@jeremyd2019
Copy link

I seem to have hit a wall now. I think I upset Github actions running that workflow so many times. Now I can't run any workflows (at least not on Windows runners), I get a message about

The job was not started because recent account payments have failed or your spending limit needs to be increased. Please check the 'Billing & plans' section in your settings.

But in my account settings, all usages show 0. (I don't have any payments set up).

What I found so far is that I have isolated the slowdown to msys2-runtime (I ran the 2024-12-08 msys2 installer and it was "fast", I updated all packages except msys2-runtime and it stayed "fast", and I updated all packages including msys2-runtime and it slowed down).

BTW, updating perl resulted in a couple of test failures. Do you by chance use $^O to try to decide to do different things between msys and cygwin? Because that changed recently: https://www.msys2.org/news/#2025-02-14-moving-msys2-closer-to-cygwin

@jeremyd2019
Copy link

Whatever limit I hit seems to have reset now 😌 . Might have to give this a rest for a while so I can do other things with GHA 😁

@jeremyd2019
Copy link

Based on the timing (affected Git for Windows when it first updated to Cygwin 3.5, and upstream MSYS2 only when it updated from 3.5.4 to 3.5.7) my first suspect commit was msys2/msys2-runtime@c7fe29f. I was about to build msys2-runtime with that commit reverted, and inject the built artifacts into the workflow when all my GHA runs started failing.

@jeremyd2019
Copy link

jeremyd2019 commented Feb 20, 2025

BTW, updating perl resulted in a couple of test failures. Do you by chance use $^O to try to decide to do different things between msys and cygwin? Because that changed recently: https://www.msys2.org/news/#2025-02-14-moving-msys2-closer-to-cygwin

Yep

data/test1158:%PERL -e "print 'Test requires a system supporting double quotes in file names' if ($^O eq 'msys');"
data/test1186:%PERL -e "print 'Test requires a system supporting double quotes in file names' if ($^O eq 'msys');"

And looking for that I found another bug waiting to happen:

#######################################################################
# return Cygwin pid from virtual pid
#
sub winpid_to_pid {
    my $vpid = $_[0];
    if(($^O eq 'cygwin' || $^O eq 'msys') && $vpid > 65536) {
        my $pid = Cygwin::winpid_to_pid($vpid - 65536);
        if($pid) {
            return $pid;
        } else {
            return $vpid
        }
    }
    return $vpid;
}

Cygwin recently increased its max pid (cygwin/cygwin@363357c), so the hardcoded 65536 will no longer be correct as of 3.6.0. It should be getting the max pid by calling C api cygwin_internal(CW_MAX_CYGWIN_PID). (I don't know anything about that Cygwin perl module, but it must be calling those same sys/cygwin.h APIs).

@vszakats
Copy link
Member Author

Whatever limit I hit seems to have reset now 😌 . Might have to give this a rest for a while so I can do other things with GHA 😁

I think it may help to delete in your fork all workflows/jobs besides just one dl-mingw job for testing.

@vszakats
Copy link
Member Author

vszakats commented Feb 20, 2025

Nice, looks like we'll have to do some fixing there. /cc @mback2k @dfandrich

edit: The C code complementing the Perl PID logic: (this is for native Windows, so no way to query MAX_PID via the Cygwin API.)

curl/tests/server/util.c

Lines 281 to 296 in d1fc1c4

curl_off_t our_getpid(void)
{
curl_off_t pid;
pid = (curl_off_t)Curl_getpid();
#if defined(_WIN32)
/* store pid + 65536 to avoid conflict with Cygwin/msys PIDs, see also:
* - https://cygwin.com/git/?p=newlib-cygwin.git;a=commit; ↵
* h=b5e1003722cb14235c4f166be72c09acdffc62ea
* - https://cygwin.com/git/?p=newlib-cygwin.git;a=commit; ↵
* h=448cf5aa4b429d5a9cebf92a0da4ab4b5b6d23fe
*/
pid += 65536;
#endif
return pid;
}

PR for the tests (and some more): #16411

@jeremyd2019
Copy link

To throw another fun wrinkle into your world, I think GfW is contemplating dropping perl at some point. git-for-windows/git#5393

@vszakats
Copy link
Member Author

git-for-windows/git#5393

Splendid :) Thanks for the heads up!

For curl we are happy with MSYS2 and don't need GfW except for this workaround
(that's hopefully just temporary). Staying with the pinned version for too long is also
not ideal, even with Perl in it.

@jeremyd2019
Copy link

jeremyd2019 commented Feb 21, 2025

Version HEAD Commit run tests duration
3.5.4-0bc1222b.x86_64 2024-12-05 09:27 UTC msys2/msys2-runtime@8d847f4 2:50
3.5.7-2644508f.x86_64 2025-02-06 19:32 UTC msys2/msys2-runtime@19b50e4 8:47
3.5.7-2aa1c88e-api-350.x86_64 2025-02-20 22:36 UTC msys2/msys2-runtime@2aa1c88e 2:46

Maybe you could take your issue to cygwin@cygwin.com with this knowledge, since we pretty well know now it is a regression in an upstream Cygwin commit? I don't know that they'll jump right on it at the moment though, they're kind of focused on regressions between 3.5.7 and 3.6.0 right now.

(the commit I reverted is necessary to avoid a different bug so it's not just a matter of reverting it)

@jeremyd2019
Copy link

jeremyd2019 commented Feb 21, 2025

For curl we are happy with MSYS2 and don't need GfW except for this workaround (that's hopefully just temporary). Staying with the pinned version for too long is also not ideal, even with Perl in it.

Another option, I don't know if it's any better for you as a workaround, is to just downgrade msys2-runtime:
jeremyd2019@95a404e. This does immediately expose those tests to the double quote failures due to $^O change. It also comes with implicit GPG signature verification of the downloaded package.

@vszakats
Copy link
Member Author

Wow, nice! I hope your revert makes its way to the binaries soon.

Taking to Cygwin, I'm not sure I'd like to do that. I have little insight
into this, no way to make tests, and already spending basically all my
awake time with curl, with a long pending queue.

Hopefully someone else can take this based on your research and
the CI tests that reproduce it.

Thanks for looking into this, much appreciated, in the name of all
curl devs :)

@vszakats
Copy link
Member Author

For curl we are happy with MSYS2 and don't need GfW except for this workaround (that's hopefully just temporary). Staying with the pinned version for too long is also not ideal, even with Perl in it.

Another option, I don't know if it's any better for you as a workaround, is to just downgrade msys2-runtime: jeremyd2019@95a404e. This does immediately expose those tests to the double quote failures due to $^O change. It also comes with implicit GPG signature verification of the downloaded package.

I like it better than the GfW workaround. Shorter, safer,
and easier to drop once no longer necessary. I suppose
it's faster too, but have to try.

I'll give it a spin once merged the double quote PR.

@jeremyd2019
Copy link

jeremyd2019 commented Feb 21, 2025

I like it better than the GfW workaround. Shorter, safer, and easier to drop once no longer necessary. I suppose it's faster too, but have to try.

I'll give it a spin once merged the double quote PR.

You can squeeze more performance out of pacman if you

sed -i -e 's/^CheckSpace/#CheckSpace/' /etc/pacman.conf

before calling it. That check is surprisingly expensive under MSYS2, and all it does is make sure the operation is aborted early if there's not enough free disk space for it. Very unlikely to be the case in this CI scenario, and if it is, it's no great loss if you end up with a corrupted install vs failing early. setup-msys2 action does this same change for the same reason.

@vszakats
Copy link
Member Author

sed -i -e 's/^CheckSpace/#CheckSpace/' /etc/pacman.conf

Excellent, I'll try it!

vszakats added a commit that referenced this pull request Feb 21, 2025
MSYS/MSYS2 and Cygwin are the same platform. Adjust code where they were
treated differently.

- drop separate `MSYS` from buildinfo flags. Our code is using the
  `CYGWIN` variable and CMake (since v3.21) sets it also for `MSYS`.
- fix test1158 and test1186 to exclude them for all Win32 targets,
  instead of just MSYS test envs. To align behavior between MSYS and
  Cygwin envs. Required for recent MSYS2 releases which reports itself
  as Cygwin, and no longer MSYS, which broke the previous exclusion
  logic.
- follow Cygwin bumping its `MAX_PID` value, to avoid PID collisions.
  https://cygwin.com/git/?p=newlib-cygwin.git;a=commit;h=363357c023ce01e936bdaedf0f479292a8fa4e0f

Reported-by: Jeremy Drake
Bug: #16217 (comment)
Ref: https://www.msys2.org/news/#2025-02-14-moving-msys2-closer-to-cygwin
Closes #16411
@jeremyd2019
Copy link

jeremyd2019 commented Feb 21, 2025

I was just trying to merge cygwin's main branch with git-for-windows, since they are planning to release a new Cygwin 3.6.0 in the next few weeks and I wanted to see what breaks. It looks like the pipe code implicated in the commit I found has changed a lot between 3.5 and 3.6, hopefully the new code performs better (and works correctly in the case the commit was trying to fix in the first place)

I saw 2:50 on the run tests step with my hack-merged 3.6 runtime: https://github.com/jeremyd2019/curl/actions/runs/13464176764/job/37626204204 (jeremyd2019@0a7c629)

vszakats added a commit that referenced this pull request Feb 21, 2025
We recently switched to a known good version of Git for Windows to avoid
the MSYS2/Cygwin runtime performance regression.

MSYS2 is closer to the source of the MSYS2/Cygwin projects. Its known
good version is newer. Installing the downgrade is faster and safer. It
also allows to restore the scripts to their original iteration, making
the workaround easier to drop once the perf issue is fixed upstream.

Therefore, switch back to using MSYS2, and install the runtime downgrade
before running curl tests.

Also disable `pacman`'s `CheckSpace` for best performance.

Jeremy identified to the root cause of the perf regression in this
Cygwin commit (from 2024-09-17):
https://cygwin.com/git/?p=newlib-cygwin.git;a=commit;h=c7fe29f5cb85242ae2607945762f7e0b9af02513

Co-authored-by: Jeremy Drake
Patch: jeremyd2019@95a404e
Ref: #16217 (comment)
Ref: #16217 (comment)

Follow-up to 116950a #16265
Follow-up to 1bf774d #16217
Follow-up to 5f9411f #15380

Closes #16424
@BrianInglis
Copy link
Contributor

edit: The C code complementing the Perl PID logic: (this is for native Windows, so no way to query MAX_PID via the Cygwin API.)

Cygwin MAX_PID is 64K until 3.6 bumps it to 4M like Linux

Required for orgs running new AMD EPYC servers with hundreds of MT cores

@vszakats
Copy link
Member Author

vszakats commented Feb 22, 2025

edit: The C code complementing the Perl PID logic: (this is for native Windows, so no way to query MAX_PID via the Cygwin API.)

Cygwin MAX_PID is 64K until 3.6 bumps it to 4M like Linux

Required for orgs running new AMD EPYC servers with hundreds of MT cores

@BrianInglis: Thanks! We're hopefully covered after merging 4842f22.
If you see anything amiss, let me know!

I was just trying to merge cygwin's main branch with git-for-windows, since they are planning to release a new Cygwin 3.6.0 in the next few weeks and I wanted to see what breaks. It looks like the pipe code implicated in the commit I found has changed a lot between 3.5 and 3.6, hopefully the new code performs better (and works correctly in the case the commit was trying to fix in the first place)

I saw 2:50 on the run tests step with my hack-merged 3.6 runtime: https://github.com/jeremyd2019/curl/actions/runs/13464176764/job/37626204204 (jeremyd2019@0a7c629)

@jeremyd2019: Looking good! I'm glad for the results you're getting. We're also experiencing some flakiness and crashes/hangs in runtests.pl, that's after trying to sweep all corners on our end. I suspect it might have to do with the interactions between MSYS2 runtime and native Windows test binaries. I'll keep an eye if/how those improve after the upgrade. (We're keeping tab of our observations in #14854)

@BrianInglis
Copy link
Contributor

I saw 2:50 on the run tests step with my hack-merged 3.6 runtime: https://github.com/jeremyd2019/curl/actions/runs/13464176764/job/37626204204 (jeremyd2019@0a7c629)

@jeremyd2019: Looking good! I'm glad for the results you're getting. We're also experiencing some flakiness and crashes/hangs in runtests.pl, that's after trying to sweep all corner on our end. I suspect it might have to do with the interactions between MSYS2 runtime and native Windows test binaries. I'll keep an eye if/how those improve after the upgrade. (We're keeping tab of our observations in #14854)

I run the Cygwin curl tests in CI and local with unlimited job slots MAKEFLAGS="-j" cygtest to minimize time without overloading the system, as most tests only run for ms (and other approaches took longer from under or overloading - YMMV!)

@vszakats
Copy link
Member Author

vszakats commented Feb 23, 2025

I saw 2:50 on the run tests step with my hack-merged 3.6 runtime: https://github.com/jeremyd2019/curl/actions/runs/13464176764/job/37626204204 (jeremyd2019@0a7c629)

@jeremyd2019: Looking good! I'm glad for the results you're getting. We're also experiencing some flakiness and crashes/hangs in runtests.pl, that's after trying to sweep all corner on our end. I suspect it might have to do with the interactions between MSYS2 runtime and native Windows test binaries. I'll keep an eye if/how those improve after the upgrade. (We're keeping tab of our observations in #14854)

I run the Cygwin curl tests in CI and local with unlimited job slots MAKEFLAGS="-j" cygtest to minimize time without overloading the system, as most tests only run for ms (and other approaches took longer from under or overloading - YMMV!)

The flakiness affects native Windows builds with test run parallelism enabled.
Pure Cygwin tests run stable.

I think MAKEFLAGS="-j" cygtest parallelize building the tests, but not
running them. runtests.pl displays the test parallelism at launch in the
* Jobs: <N> line. You can set it with export TFLAGS=-j8 for example.
It can improve test run times.

You may also try configuring with --enable-test-bundles, which makes
building the test binaries significantly faster.

@BrianInglis
Copy link
Contributor

I run the Cygwin curl tests in CI and local with unlimited job slots MAKEFLAGS="-j" cygtest to minimize time without overloading the system, as most tests only run for ms (and other approaches took longer from under or overloading - YMMV!)

The flakiness affects native Windows builds with test run parallelism enabled. Pure Cygwin tests run stable.

I think MAKEFLAGS="-j" cygtest parallelize building the tests, but not running them. runtests.pl displays the test parallelism at launch in the * Jobs: <N> line. You can set it with export TFLAGS=-j8 for example. It can improve test run times.

You may also try configuring with --enable-test-bundles, which makes building the test binaries signficiantly faster.

Thanks for those tips which I will try out in some tests, but System characteristics shows no Jobs: ... entry.

@vszakats
Copy link
Member Author

vszakats commented Feb 23, 2025

I run the Cygwin curl tests in CI and local with unlimited job slots MAKEFLAGS="-j" cygtest to minimize time without overloading the system, as most tests only run for ms (and other approaches took longer from under or overloading - YMMV!)

The flakiness affects native Windows builds with test run parallelism enabled. Pure Cygwin tests run stable.
I think MAKEFLAGS="-j" cygtest parallelize building the tests, but not running them. runtests.pl displays the test parallelism at launch in the * Jobs: <N> line. You can set it with export TFLAGS=-j8 for example. It can improve test run times.
You may also try configuring with --enable-test-bundles, which makes building the test binaries signficiantly faster.

Thanks for those tips which I will try out in some tests, but System characteristics shows no Jobs: ... entry.

You're welcome! A missing entry means test parallelism is not enabled. Try the TFLAGS=-j8 option. Its best value depends on the machine of course, it may work with more, we use 8 on the GHA Windows runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration cmdline tool performance tests Windows Windows-specific
Development

Successfully merging this pull request may close these issues.

4 participants