-
Notifications
You must be signed in to change notification settings - Fork 17.4k
Flaky build: Windows build sometimes fails with "resource busy or locked" #18991
Comments
Happened again in https://github.visualstudio.com/Atom/_build/results?buildId=33622 |
Happened again in: |
Happened to me today in: |
I'm starting to look into this issue. Here's what I'm seeing so far. Starting with rimraf...The failure happens down inside rimraf:
We're currently using rimraf 2.6.2, which is almost the latest version. The latest version is 2.6.3, and it doesn't appear to include any changes that would affect the problem we're experiencing (isaacs/rimraf@v2.6.2...v2.6.3). rimraf is called by node-temp...rimraf is getting called by node-temp:
We're currently using @atom/node-temp 0.8.4, which is a fork of node-temp. The latest version of node-temp is 0.9.0. It only includes a few changes from our fork, but those changes might be helpful to us. Specifically, node-temp 0.9.0 replaces use of the deprecated We probably don't need to be using a forked version of node-temp anymoreWe adopted the forked version of node-temp in atom/tree-view#1225. We did so in order to upgrade to a newer version of rimraf (atom/node-temp@b5266da). However, since the time that we created the fork, node-temp has incorporated that same rimraf upgrade (bruce/node-temp@b5266da). Therefore, I think we can safely move from our forked version of node-temp back to using the official version. I'm going to explore this option. ☝️ It might help with this issue, or it might not. Even if it doesn't help, it will be nice to move off of our fork of node-temp if we no longer need to be maintaining a fork. 😅 |
Could the |
That is indeed the case in all of the logs that I've seen so far.
Good question. Maybe the |
#19143 upgraded atom to node-temp 0.9.0. Unfortunately, even after that upgrade, we still sometimes see these https://ci.appveyor.com/project/Atom/atom/builds/23812379/job/w2yekqk4w1ge9us3#L122 |
In #18991 (comment), @rafeca asked:
I think this would be useful to investigate. Similar to what @rafeca was asking, I wonder if Atom has a lock on the When a crash happens on macOS, the crash report is included in the build artifacts. But for some reason, when a crash happens on Windows, there's no crash report in the build artifacts (example). 😕 If we could get the crash report, that might offer a clue regarding the cause of the problem. |
I think the
So the real question is: Why is the render process crashing? Unfortunately, it doesn't seem like we have much to go on from these reports. We're just seeing that the render process crashed and then an exception caused by trying to delete a file that it still has locked. I'll do some investigation into how we can get more information about the crash. |
I spent some time trying to reproduce this crash in a local VM and got lucky. The crash report indicates that my crash originated from our fork of When @smashwilson was working on There's no guarantee that this local crash was caused by the same thing as our crashes on CI, but it's a distinct possibility. |
Encountered a second crash locally, also originating from This time, I'm going to dig into #17899 which switches from |
2 more local crashes, both involving |
I was really hoping #17899 would resolve this, but it looks like the render process crashed on Azure DevOps for that PR on Windows. So it's either a different crash, or we're still somehow using |
Been running tests locally for over an hour with no reproduction yet. Going to let them loop all night and see what I find. It's strange because last time I tried I was able to reproduce failures faster than this. |
Frustrated. I ran render process tests in a Windows VM locally on c3bf951 (which is right after the railcars rolled to |
uhmm, I recently upgraded the version of This is the list of changes in |
For posterity, I am reporting here the findings that were mentioned on Slack. It looks like we haven't been able to reproduce those crashes anymore because, as @rafeca pointed out, In the process of investigating the original crash, we discovered another potential crash in atom/nsfw#5, but we haven't been able to reproduce that, possibly because the The above evidence would seem to confirm that the crash was effectively fixed as of #19273. We will conduct some additional testing on Windows to verify that, by rolling back that pull request, we can "easily" reproduce the crash again. If that's the case, I think we could go ahead and close this issue. |
@as-cii The stack traces that I saw included the |
Okay, I managed to reproduce the crashes locally on cbd4641. Because I built locally, I was able to view the source associated with the crash report. Here's a screenshot: It looks like I was wrong. When we reverted the Electron 3 upgrade in c9e6d04, it looks like we downgraded from |
After reflecting a bit, I just want to offer a mea culpa on this entire issue here. This was a blunder for me. Missing the fact that we weren't on the latest version of |
We saw this issue again yesterday, and that branch uses nsfw 1.0.22. I think that means that this issue is still relevant, so I'm going to re-open it. If I'm misunderstanding, please let me know. |
Good news! @nathansobo and I were able to open and symbolicate the crash report from the failing build mentioned above and it's not caused by a null-pointer exception in NSFW. Instead, it appears that the process is running out of memory: Another interesting fact is that we're observing 15 NSFW threads associated with the process in the crash report. Their stack traces all look similar to: Could this still be somehow related to NSFW, or are we simply allocating too much memory in our tests? We will keep investigating and report our findings here. |
We seem to be leaking memory in |
Okay, I have found the leak and will open a PR fixing it. |
Well, crap. Even after #19359, it looks like we're still seeing this problem. Here's the latest instance: https://ci.appveyor.com/project/Atom/atom/builds/24670700/job/w6mc6rnungpiaaru#L1575 I think that means that we should re-open this issue. If I'm misinterpreting the problem in the build linked above, please let me know. |
This issue keeps morphing because it is very general: "The render process is crashing". This time, we were able to reproduce a crash on @as-cii's Windows VM. Here's a screenshot of the stack trace. After diving into the V8 source, the C++ exception in question seems to come from a
We don't have crash reports on AppVeyor, and we'd need to have those to definitively declare that the crash we observed locally was the same as what we observed there. Considering we want to move away from AppVeyor, I don't think it's worth adding crash reports support at this point. At this point, the situation feels murky. We definitely observed a real crash that reaches into the internals of V8. We recently upgraded to Electron 3, which could be connected. If this is just about AppVeyor, I don't really care, but what if this points to a deeper issue that users might encounter. This is especially worrisome in light of #18915. |
Here's basically the same crash that @rafeca observed on macOS. |
Closing in favor of #19372 since this crash isn't Windows-specific. I think the crash we're observing now deserves its own issue as it likely blocks release. |
This issue has been automatically locked since there has not been any recent activity after it was closed. If you can still reproduce this issue in Safe Mode then please open a new issue and fill out the entire issue template to ensure that we have enough information to address your issue. Thanks! |
In our Windows 64-bit builds, we frequently see the build fail at this point in the test suite...
Rebuilding can sometimes resolve the failure.
Affected CI provider(s): AppVeyor, Azure DevOps
Affected platform(s): Windows 64-bit
Example commit experiencing flakiness: 8d8e39d
Example failing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070045/job/ooh3881g6b53avyn
Example passing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070428/job/t68iykoniojgy7tw
Example log from failing build: https://github.com/atom/atom/files/2967489/build.log
The text was updated successfully, but these errors were encountered: