-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Flaky build failure: npm package directory copy fails with "No such file or directory" #1412
Comments
I just hit this as well. Some empirical observations:
I saw this right after an upgrade to Bazel7.0.0, but that may have just been causing a full invalidation. Although I don't recall ever seeing this this on a Bazel7 nightly from a few weeks ago even with fairly frequent expunges. |
@DavidZbarsky-at which Bazel version were you on prior to the upgrade to Bazel 7? And could you be more precise how you "disable BES" - which flags did you change? |
We were previously on 7.0.0-pre.20231011.2. I disabled the following flags:
|
I wasn't seeing this on Bazel 6, but it is happening pretty regularly right after upgrading to Bazel 7. Here's our options:
I tried 1) enabling runfiles 2) disabling sandboxing to see if either of those worked but it didn't have any impact. |
We speculated that configuring CopyDirectory not to run remotely via the following could help:
However, testing shows this was not the case - we still see the same error when the action runs locally. |
I've now seeing this in a non-CI scenario:
However, the directory exists locally:
If I comment out remote caching, then the issue goes away:
I'll note that that library it is complaining about probably does not exist on our remote cache server yet because this change hasn't been pushed through yet. |
We're also seeing this. I do find it very interesting that it's an exec failure. AFAICT it's not an error printed by the tool responsible for the action, it's bazel printing that error. Whenever this happens, I cannot find the action in the profile produced by I'm only talking from intuition, I have no proof for any of my speculation. |
I was able to reproduce it 100% of the time with Bazel 7 if I do things in the right order.
I start by doing Doing a build of just the part of the repo with JS dependencies:
I'm 3 for 3 now in my test of reproducing it. Any ideas what to look for or try? If I run the exact same build command again, it works the second try. And now that I ran the build command the second time, I can't reproduce the error... Time to dig in some more. |
I've also experienced this, I'm not sure what's causing it though |
We are also seeing this on Bazel
Given the name of the target I suspect that https://github.com/aspect-build/rules_ts/blob/main/ts/private/npm_repositories.bzl#L87 is involved, even though I am not sure how or why. |
The work-around documented here has resolved this for a few users we've talked to on Slack: https://github.com/aspect-build/rules_js/blob/main/docs/faq.md#flaky-build-failure-exec-failed-due-to-ioexception Unclear what the root cause is but looks related to "build without the bytes" with remote-able copy actions from bazel-lib. Note: if you're using persistent runners than even with this fix landed at HEAD, your runner could still get the external repository into this bad state if a build was run on the runner on a PR with a base branch without the fix. After landing, you'll need to ask all developers to rebase PRs past the fix so all builds on the persistent runner have the flags set. |
I wanted to note that that work around did not work in our case. It was still happening with 7.0.2. |
I can also confirm that this workaround does not solve the issue |
Unfortunately, I don't have a repo of this issue in any of our builds on rules_js CI or on our internal uses. We are on 7.0.2 internally. @ewhauser I believe you're on persistent runners without RBE. Are you using bzlmod or WORKSPACE? Have you doubled checked that you don't have multiple |
Huh. I would never have guessed that |
@gregmagolan, so with that fix do we still need |
It looks like the fix just made the flake less frequent but didn't fix it. Re-opening this issue. Our plan is now to switch the copy_directory rule that takes the source directory as its input to a new tar extract rule that takes the tgz file as an input. That should solve it once and for all. @thesayyn and @alexeagle are looking into expanding the tar rule in bazel-lib to support extraction. |
As an aside: Would it be worth filing an upstream Bazel issue for this? This feels like a bug in Bazel that you're working around here. |
We've narrowed down the problem (for us at least) down to the I believe that we already had I don't have a minimal repro, but my Bazel 7 upgrade was failing consistently without |
Re-opening as even with the fix in #1538 there are still the corner cases that use source directories: packages with lifecycle hooks & packages with patches. |
CC @joeleba Based on #1412 (comment) and the error message it looks more likely that this has something to do with |
More detailed update on this issue now that #1538 has landed: #1538 has "mostly" fixed this issue since for most packages (those that don't have lifecycle hooks or patches) the fix makes it is such that they no longer use a CopyDirectory action with a source directory input to copy into the virtual store. Instead those packages will use a tar toolchain to extract the package .tgz directly into the virtual store. As mentioned above, packages with lifecycle hooks and those with patches will require separate fixes in the future to no longer use source directory inputs. |
JFYI: Updating to last version of rules_js, fixed that issue for us in 100% cases |
What happened?
Errors look like
https://bazelbuild.slack.com/archives/CEZUUKQ6P/p1702916591850299
They happen on about 2% of builds as observed by one client. They use
--remote_download_minimal
.Version
One client observed on rules_js 1.32.2 and bazel 6.4.0
How to reproduce
Any other information?
No response
The text was updated successfully, but these errors were encountered: