Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.2.1] Windows/bzlmod/rules_rust: Failed to clean up module context directory: C:/tmp/modextwd/rules_rust~~crate (Directory not empty) #22710

Closed
criemen opened this issue Jun 11, 2024 · 9 comments
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug untriaged

Comments

@criemen
Copy link
Contributor

criemen commented Jun 11, 2024

Description of the bug:

This bug continues the discussion from #22688, and the fruitful conversation we had in the associated PR: #22689 (comment)
Context: Without that PR, my team was observing bazel crashing spuriously on Windows after converting our rules_rust usage from WORKSPACE files to bzlmod.

Thanks to bazelisk (it's amazing!), we were able to pull in a pre-release build of bazel's 7.2.1 branch.
Instead of the aforementioned crash we're now getting the (occasional) error

ERROR: Analysis of target '@@codeql~//ruby:ruby-generic-zip' failed; build aborted: Failed to clean up module context directory: C:/tmp/modextwd/rules_rust~~crate (Directory not empty)

during our build.

Is this a bazel bug or a rules_rust bug?

Which category does this issue belong to?

External Dependency

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I'm not sure. The relevant build target that pulls rules_rust into the target '@@codeql~//ruby:ruby-generic-zip' is defined here, but that repository is built in the context of an internal repository, and there's a few dependencies to the target that aren't open-sourced, so the repo doesn't build without modifications standalone.

If there's a need for a reproducer, then I can try hacking/reducing our internal dependencies in a branch and writing an Actions workflow that runs bazel in a loop until it errors out with this.

Which operating system are you running Bazel on?

Windows Server 2019 (GitHub actions)

What is the output of bazel info release?

development version

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Bazelisk to pull in 7628649

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

We're using rules_rust crate_universe mechanism to pull in external dependencies from Cargo.toml into bazel. As might be more uncommon, we do have 3 dependencies we pull in as git repositories, instead of as regular downloads from crate.io, which takes a longer time (and might create more files on disk/more IO).

@github-actions github-actions bot added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label Jun 11, 2024
@Wyverald
Copy link
Member

OK so this is exactly the same bug as #22688 except that we don't crash anymore, right?

It looks like something is holding the extension working directory hostage (par for the course on Windows)... I'm not sure what Bazel could do to recover in this case; I don't think we're supposed to wait forever until the ... hostage-holder (?) lets go.

@criemen
Copy link
Contributor Author

criemen commented Jun 11, 2024

OK so this is exactly the same bug as #22688 except that we don't crash anymore, right?

Yes, exactly, and at least as a non-bazel-dev, I didn't know which directory we were talking about in #22688, but now the exception includes a path.

I'm not sure what Bazel could do to recover in this case; I don't think we're supposed to wait forever until the ... hostage-holder (?) lets go.

Probably not, no. Would there be a reasonable spot somewhere in your code where you could show at least one candidate (or up to X candidates) of files inside that directory that prevent deletion?

Why is this directory cleaned up at all, in the middle of a build?
I'm seeing

[...]
Tue, 11 Jun 2024 18:27:59 GMT [378 / 494] Executing genrule @@_main~external_deps~luajit//:luajit64_genrule; 19s local, disk-cache ... (16 actions running)
Tue, 11 Jun 2024 18:28:00 GMT ERROR: Analysis of target '@@codeql~//ruby:ruby-generic-zip' failed; build aborted: Failed to clean up module context directory: C:/tmp/modextwd/rules_rust~~crate (Directory not empty)
Tue, 11 Jun 2024 18:28:00 GMT INFO: Elapsed time: 63.263s, Critical Path: 27.00s
Tue, 11 Jun 2024 18:28:00 GMT INFO: 396 processes: 113 internal, 283 local.
Tue, 11 Jun 2024 18:28:00 GMT ERROR: Build did NOT complete successfully
Tue, 11 Jun 2024 18:28:00 GMT FAILED: 
Tue, 11 Jun 2024 18:28:00 GMT BAZEL FAILED

in our build log, which seems to indicate that bazel tries to delete the directory while it's still executing other build targets (I doubt that >100 actions would finish in one second, without any log output).
Is there a workaround here that'd not clean up this directory at all? Our build machines are ephemeral, so I'd expect the directory sticking around to not matter, as the module extension should only be executed once, and then cached for the duration of the build.

@Wyverald
Copy link
Member

Right. We sometimes need to restart a module extension's execution (same as the repo restart mechanism described here: https://bazel.build/extending/repo#restarting_the_implementation_function). Since 7.1.0 we've introduced the flag --experimental_worker_for_repo_fetching=auto to eliminate restarts from repo fetching, but that has had a bit of a rocky landing (many concurrency issues that were since fixed) and hasn't been extended to module extensions yet.

Would there be a reasonable spot somewhere in your code where you could show at least one candidate (or up to X candidates) of files inside that directory that prevent deletion?

Probably we can just ls the directory when we fail to clean it up. In fact I could cook up a temporary Bazel commit for you to try out (just set your .bazelversion to a commit SHA) in a bit.

@criemen
Copy link
Contributor Author

criemen commented Jun 11, 2024

Probably we can just ls the directory when we fail to clean it up. In fact I could cook up a temporary Bazel commit for you to try out (just set your .bazelversion to a commit SHA) in a bit.

Let me try to reproduce the problem on my local Windows machine first, then I can inspect the directory.

@criemen
Copy link
Contributor Author

criemen commented Jun 11, 2024

@Wyverald okay it'd be great to get some support here, I've debugged as far as I can get.

I was able to isolate the code in our public repository, so I can share it with you. This actions workflow (on that branch of the repo) builds a single bazel target, and in about 1/20-3/20 times fails with an error. Example failure here (but I'm not sure if you can see the logs without extra permissions).
The problem does NOT reproduce on Windows 11/Windows Server 2022. That means I can't locally debug it.

I also stuck an actions/artifact-upload step into those workflows to observe the directory that's not empty, and that step never has anything to upload. So if you stuck a ls into it someplace, better be very careful where that is placed - maybe ideally in the file system code that complains that the directory is not empty directly, dumping out its state and then confirming that with ls - anything later might be too late to observe anything useful.

I also observed that fetching this module extension is really slow, but I think most of that is the cargo-bazel invocation. I'm not sure, but I believe that happens in the end of the module extension, so I'd be very surprised if that gets interrupted by a restart then, though. I've not checked with the source of the module extension. If there's any more logs that'd be interesting to look at, or if I can assist further with investigating this, please let me know!
Otherwise, I think forking the repo on github will let you run my reproducer actions workflow by yourself in your repo.
FWIW, I'm using export PAGER= && for run in {1..20}; do gh api -X POST 'repos/github/codeql/actions/workflows/criemen-debug.yml/dispatches' -f "ref=criemen/repr-bazel-bug"; done to trigger the dispatches.

@criemen
Copy link
Contributor Author

criemen commented Jun 12, 2024

I tried to upload the whole bazel output base, but that was too many files for actions/upload-artifact. I got a log file out, though, but I'm not sure it tells us a lot. It does talk about a TreeDeleter, so maybe that's where modextwd gets cleaned up in the end?
debug-Windows.zip

Also, as the problem here seem to be restarts of the module extension: Is there any way to figure out what's causing those, and then preload the reasons for those restarts? Even in a hacky way where we patch rules_rust and insert Labels to stuff in our repository, that could be a workaround for us until we find a proper root cause+fix.

@criemen
Copy link
Contributor Author

criemen commented Jun 12, 2024

From what I can see, restarts in module extensions can't be prevented at all at the moment?
However, doing some targeted print-debugging in rules_rust, they did expensive processing and then later steps that caused more restarts. I believe those restarts caused the module extension WD not to be clean, and then the cleanup to fail as outlined in this issue.

bazelbuild/rules_rust#2691 reduces the number of expensive restarts on rules_rust, and I'll test that on CI soon.
That should work around the bug here, but it of course doesn't solve it.
Is there an issue for extending the virtual threads approach to module extension implementations, to do away with the restarts?

@Wyverald
Copy link
Member

Is there an issue for extending the virtual threads approach to module extension implementations, to do away with the restarts?

Filed #22729

github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Jun 13, 2024
This PR reduces the number of expensive module extension implementation
restarts in combination with multiple `from_cargo` configurations.

As each configuration processing calls `module_ctx.path` (which can and
will cause restarts), we executed _generate_hub_and_spokes (i.e.
`cargo-bazel`) _a lot_ and then threw that away on the next restart.
Triggering `module_ctx.path` early is therefore a significant
performance optimization.

On our repository, this gives us a 10sec speedup on module extension
processing (local M1 mac, nothing happening in parallel), see our
[MODULE.bazel](https://github.com/github/codeql/blob/main/MODULE.bazel)
if you're interested in what we're doing.

This excessive restarting also exposed an upstream bazel bug on Windows
2019, where bazel spuriously fails to clean up the working directory
(c.f. bazelbuild/bazel#22710).

Co-authored-by: Daniel Wagner-Hall <dwagnerhall@apple.com>
github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Jun 14, 2024
This PR reduces the number of expensive module extension implementation
restarts in combination with multiple `from_cargo` configurations.

As each configuration processing calls `module_ctx.path` (which can and
will cause restarts), we executed _generate_hub_and_spokes (i.e.
`cargo-bazel`) _a lot_ and then threw that away on the next restart.
Triggering `module_ctx.path` early is therefore a significant
performance optimization.

On our repository, this gives us a 10sec speedup on module extension
processing (local M1 mac, nothing happening in parallel), see our
[MODULE.bazel](https://github.com/github/codeql/blob/main/MODULE.bazel)
if you're interested in what we're doing.

This excessive restarting also exposed an upstream bazel bug on Windows
2019, where bazel spuriously fails to clean up the working directory
(c.f. bazelbuild/bazel#22710).

Co-authored-by: Daniel Wagner-Hall <dwagnerhall@apple.com>
@criemen
Copy link
Contributor Author

criemen commented Jun 14, 2024

Let's close this as won't fix:

I have a workaround for rules_rust, and other people can implement similar workarounds in their module extensions.
The problem disappeared on Windows 11/server 2022, which hints at some OS-level fix we're benefitting from, so investigating this further is only for the benefit for a legacy OS release.
Once #22729 lands, we don't have restarts anymore and don't need to clean up the module extension working directory either.

So the impact of working on this issue would be really low.

@criemen criemen closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug untriaged
Projects
None yet
Development

No branches or pull requests

5 participants