Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Flaky build: Windows build sometimes fails with "resource busy or locked" #18991

Closed
jasonrudolph opened this issue Mar 14, 2019 · 35 comments · Fixed by #19244 or #19359
Closed

Flaky build: Windows build sometimes fails with "resource busy or locked" #18991

jasonrudolph opened this issue Mar 14, 2019 · 35 comments · Fixed by #19244 or #19359

Comments

@jasonrudolph
Copy link
Contributor

In our Windows 64-bit builds, we frequently see the build fail at this point in the test suite...

...
TextEditor > buffer manipulation > .outdentSelectedRows() > when one line is selected > outdents line and retains editor Renderer process crashed, exiting
EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
Error: EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
    at Object.fs.unlinkSync (fs.js:1061:3)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:211:13)
    at D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:245:5
    at Array.forEach (<anonymous>:null:null)
    at rmkidsSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:244:26)
    at rmdirSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:237:7)
    at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
    at D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:245:5
    at Array.forEach (<anonymous>:null:null)
    at rmkidsSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:244:26)
    at rmdirSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:237:7)
    at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
    at cleanupDirsSync (D:\a\1\s\node_modules\temp\lib\temp.js:143:5)
    at process.cleanupSync (D:\a\1\s\node_modules\temp\lib\temp.js:196:19)
    at process.emit (events.js:187:15)
    at App.<anonymous> (D:\a\1\s\out\Atom Dev x64\resources\electron.asar\browser\init.js:64:11)
    at App.emit (events.js:182:13)
    at AtomApplication.exit (D:\a\1\s\src\main-process\atom-application.js:99:9)
    at WebContents.browserWindow.webContents.on (D:\a\1\s\src\main-process\atom-window.js:180:30)
    at WebContents.emit (events.js:182:13)

Rebuilding can sometimes resolve the failure.


Affected CI provider(s): AppVeyor, Azure DevOps

Affected platform(s): Windows 64-bit

Example commit experiencing flakiness: 8d8e39d

Example failing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070045/job/ooh3881g6b53avyn

Example passing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070428/job/t68iykoniojgy7tw

Example log from failing build: https://github.com/atom/atom/files/2967489/build.log

@jasonrudolph
Copy link
Contributor Author

@rafeca
Copy link
Contributor

rafeca commented Mar 19, 2019

@rafeca
Copy link
Contributor

rafeca commented Mar 27, 2019

@jasonrudolph
Copy link
Contributor Author

I'm starting to look into this issue. Here's what I'm seeing so far.

Starting with rimraf...

The failure happens down inside rimraf:

Error: EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
    at Object.fs.unlinkSync (fs.js:1061:3)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:211:13)
    ...

We're currently using rimraf 2.6.2, which is almost the latest version. The latest version is 2.6.3, and it doesn't appear to include any changes that would affect the problem we're experiencing (isaacs/rimraf@v2.6.2...v2.6.3).

rimraf is called by node-temp...

rimraf is getting called by node-temp:

...
at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
at cleanupDirsSync (D:\a\1\s\node_modules\temp\lib\temp.js:143:5)
at process.cleanupSync (D:\a\1\s\node_modules\temp\lib\temp.js:196:19)
...

We're currently using @atom/node-temp 0.8.4, which is a fork of node-temp.

The latest version of node-temp is 0.9.0. It only includes a few changes from our fork, but those changes might be helpful to us. Specifically, node-temp 0.9.0 replaces use of the deprecated os-tmpdir module with Node's built-in os module (atom/node-temp@279c135).

We probably don't need to be using a forked version of node-temp anymore

We adopted the forked version of node-temp in atom/tree-view#1225. We did so in order to upgrade to a newer version of rimraf (atom/node-temp@b5266da). However, since the time that we created the fork, node-temp has incorporated that same rimraf upgrade (bruce/node-temp@b5266da). Therefore, I think we can safely move from our forked version of node-temp back to using the official version.

I'm going to explore this option. ☝️ It might help with this issue, or it might not. Even if it doesn't help, it will be nice to move off of our fork of node-temp if we no longer need to be maintaining a fork. 😅

This was referenced Apr 11, 2019
@rafeca
Copy link
Contributor

rafeca commented Apr 11, 2019

Could the EBUSY errors be caused by a crash? In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

@jasonrudolph
Copy link
Contributor Author

In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

That is indeed the case in all of the logs that I've seen so far.

Could the EBUSY errors be caused by a crash?

Good question. Maybe the EBUSY errors are caused by a crash, or maybe a crash is leading to an EBUSY error. I don't yet know whether the order of the lines in the log file are a reliable indicator of the order in which things happened. 😬

@jasonrudolph
Copy link
Contributor Author

The latest version of node-temp is 0.9.0. It only includes a few changes from our fork, but those changes might be helpful to us. Specifically, node-temp 0.9.0 replaces use of the deprecated os-tmpdir module with Node's built-in os module (atom/node-temp@279c135).

#19143 upgraded atom to node-temp 0.9.0. Unfortunately, even after that upgrade, we still sometimes see these EBUSY: resource busy or locked failures:

https://ci.appveyor.com/project/Atom/atom/builds/23812379/job/w2yekqk4w1ge9us3#L122

@nathansobo nathansobo self-assigned this Apr 15, 2019
@jasonrudolph
Copy link
Contributor Author

In #18991 (comment), @rafeca asked:

Could the EBUSY errors be caused by a crash? In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

I think this would be useful to investigate. Similar to what @rafeca was asking, I wonder if Atom has a lock on the Databases.db file, and then the renderer crashes with Databases.db still locked, and then the next test attempts to access that file and can't, thus triggering the EBUSY error message.

When a crash happens on macOS, the crash report is included in the build artifacts. But for some reason, when a crash happens on Windows, there's no crash report in the build artifacts (example). 😕 If we could get the crash report, that might offer a clue regarding the cause of the problem.

@nathansobo
Copy link
Contributor

nathansobo commented Apr 15, 2019

I think the EBUSY is downstream of the actual issue, which is the render process crashing. Here is my narrative for why we're seeing the error trying to delete a temp directory:

So the real question is: Why is the render process crashing? Unfortunately, it doesn't seem like we have much to go on from these reports. We're just seeing that the render process crashed and then an exception caused by trying to delete a file that it still has locked. I'll do some investigation into how we can get more information about the crash.

@nathansobo
Copy link
Contributor

nathansobo commented Apr 17, 2019

I spent some time trying to reproduce this crash in a local VM and got lucky. The crash report indicates that my crash originated from our fork of nsfw. I spent about an hour trying to symbolicate the nsfw stack frames but decided to shelve that effort.

When @smashwilson was working on watcher, I remember him lamenting the structure of nsfw and its use of threads. In this crash dump, I see frames from the nsfw module in 3 different threads: 19 (the thread that crashed), 22, and 23. Could there be some kind of evil race condition within nsfw on Windows that's exacerbated by the rapid setups and teardowns occurring within Atom's test suite?

There's no guarantee that this local crash was caused by the same thing as our crashes on CI, but it's a distinct possibility.

@nathansobo
Copy link
Contributor

Encountered a second crash locally, also originating from nsfw. I appended it to the previous crash report as "Crash 2".

This time, nsfw stack frames were only active in the thread that crashed, rather than in multiple threads. So maybe that rules out a race between nsfw threads. Not sure at this point, and maybe it doesn't matter since we have a viable alternative to nsfw.

I'm going to dig into #17899 which switches from nsfw to @atom/watcher and see if I can get us switched to it. Hopefully doing so will solve these render process crashes.

@nathansobo
Copy link
Contributor

2 more local crashes, both involving nsfw on the crashing thread and in 3-4 other threads as well. I'm not going to post the crash reports for them because I don't think they will add much.

@nathansobo
Copy link
Contributor

I was really hoping #17899 would resolve this, but it looks like the render process crashed on Azure DevOps for that PR on Windows. So it's either a different crash, or we're still somehow using nsfw on that PR even though it changes the default. The mystery continues.

@rafeca
Copy link
Contributor

rafeca commented Apr 29, 2019

@nathansobo
Copy link
Contributor

It seems that #19244 introduced issues of its own, so I'm switching back to @atom/nsfw for now in #19345. Until we find another path watching solution or the underlying cause of the crash, this issue must be reopened.

@nathansobo nathansobo reopened this May 15, 2019
@nathansobo
Copy link
Contributor

Been running tests locally for over an hour with no reproduction yet. Going to let them loop all night and see what I find. It's strange because last time I tried I was able to reproduce failures faster than this.

@nathansobo
Copy link
Contributor

nathansobo commented May 16, 2019

Frustrated. I ran render process tests in a Windows VM locally on c3bf951 (which is right after the railcars rolled to 1.39.0-dev and before I merged #19244) all night, with no reproduction. Could something have subtly shifted from when I was testing things earlier in the month? At this point, I'm ready to put this issue back in the technical backlog and wait to see it again on CI.

@rafeca
Copy link
Contributor

rafeca commented May 16, 2019

uhmm, I recently upgraded the version of nsfw (which got downgraded at the end of February when we reverted the electron 3 change): #19273, but I have no idea if that could have fixed the crashes...

This is the list of changes in nsfw: atom/nsfw@v1.0.18...v1.0.22

@as-cii
Copy link
Contributor

as-cii commented May 16, 2019

For posterity, I am reporting here the findings that were mentioned on Slack.

It looks like we haven't been able to reproduce those crashes anymore because, as @rafeca pointed out, nsfw has been recently upgraded to v1.0.22. This version contains atom/nsfw#2, which fixes a problem in the exact code path that was crashing before. We haven't been able to symbolicate the crash dumps reported by @nathansobo, but it seems like they were happening while attempting to call isWatching.

In the process of investigating the original crash, we discovered another potential crash in atom/nsfw#5, but we haven't been able to reproduce that, possibly because the nsfw API is never used in a way that would trigger the race condition illustrated in the pull request.

The above evidence would seem to confirm that the crash was effectively fixed as of #19273. We will conduct some additional testing on Windows to verify that, by rolling back that pull request, we can "easily" reproduce the crash again. If that's the case, I think we could go ahead and close this issue.

@nathansobo
Copy link
Contributor

@as-cii The stack traces that I saw included the mInterface && guard from atom/nsfw#2, so I don't think that explains it.

@nathansobo
Copy link
Contributor

Okay, I managed to reproduce the crashes locally on cbd4641. Because I built locally, I was able to view the source associated with the crash report. Here's a screenshot:

Screen Shot 2019-05-16 at 12 15 31 PM

It looks like I was wrong. When we reverted the Electron 3 upgrade in c9e6d04, it looks like we downgraded from @atom/nsfw from 1.0.21 to 1.0.18 which preceded the change that Ash made in atom/nsfw#2. I missed this fact. If you look at this screenshot, it looks like the fix that Ash introduced is most likely the solution. I assume the method in question was inlined, which is why that method isn't indicated directly in the crash.

@nathansobo
Copy link
Contributor

After reflecting a bit, I just want to offer a mea culpa on this entire issue here.

This was a blunder for me. Missing the fact that we weren't on the latest version of nsfw was a really basic mistake on my part. A small mistake, but with costly implications. I have a bias toward building my way out of problems, but this situation called for a different response. I hope to learn from this going forward.

@jasonrudolph
Copy link
Contributor Author

We saw this issue again yesterday, and that branch uses nsfw 1.0.22. I think that means that this issue is still relevant, so I'm going to re-open it. If I'm misunderstanding, please let me know.

@jasonrudolph jasonrudolph reopened this May 17, 2019
@as-cii
Copy link
Contributor

as-cii commented May 17, 2019

Good news! @nathansobo and I were able to open and symbolicate the crash report from the failing build mentioned above and it's not caused by a null-pointer exception in NSFW. Instead, it appears that the process is running out of memory:

Screen Shot 2019-05-17 at 18 23 30

Another interesting fact is that we're observing 15 NSFW threads associated with the process in the crash report. Their stack traces all look similar to:

Screen Shot 2019-05-17 at 18 40 18

Could this still be somehow related to NSFW, or are we simply allocating too much memory in our tests? We will keep investigating and report our findings here.

@nathansobo
Copy link
Contributor

We seem to be leaking memory in text-editor-spec.js, but it may be in a native extension rather than in JS objects. Here I have the test runner paused via the Chrome devtools about halfway through text-editor-spec.js, and you can see that the JS heap is 169MB while the heap for the process itself is around 2GB.

Screen Shot 2019-05-17 at 2 20 28 PM

@nathansobo
Copy link
Contributor

Okay, I have found the leak and will open a PR fixing it.

@jasonrudolph
Copy link
Contributor Author

Well, crap. Even after #19359, it looks like we're still seeing this problem. Here's the latest instance:

https://ci.appveyor.com/project/Atom/atom/builds/24670700/job/w6mc6rnungpiaaru#L1575

I think that means that we should re-open this issue. If I'm misinterpreting the problem in the build linked above, please let me know.

@jasonrudolph jasonrudolph reopened this May 20, 2019
@nathansobo
Copy link
Contributor

This issue keeps morphing because it is very general: "The render process is crashing". This time, we were able to reproduce a crash on @as-cii's Windows VM. Here's a screenshot of the stack trace.

image

After diving into the V8 source, the C++ exception in question seems to come from a CHECK macro inside the AllocationResult constructor. That is suspicious considering that #19359 addressed a memory leak that caused us to crash by running out of memory. However, here are reasons to doubt this is an out-of-memory issue:

  • Antonio's VM wasn't out of memory when we observed the crash. The tests were eating about 1GB of RAM out of 4GB allocated to the VM. Also, wouldn't Windows start swapping before crashing a process?
  • Previously, we failed with an explicit ReportOOMErrorOnMainThread on our stack. Wouldn't we see that in this case too if we were running out of RAM.

We don't have crash reports on AppVeyor, and we'd need to have those to definitively declare that the crash we observed locally was the same as what we observed there. Considering we want to move away from AppVeyor, I don't think it's worth adding crash reports support at this point.

At this point, the situation feels murky. We definitely observed a real crash that reaches into the internals of V8. We recently upgraded to Electron 3, which could be connected. If this is just about AppVeyor, I don't really care, but what if this points to a deeper issue that users might encounter. This is especially worrisome in light of #18915.

@nathansobo
Copy link
Contributor

nathansobo commented May 20, 2019

Here's basically the same crash that @rafeca observed on macOS.

@nathansobo
Copy link
Contributor

Closing in favor of #19372 since this crash isn't Windows-specific. I think the crash we're observing now deserves its own issue as it likely blocks release.

@lock
Copy link

lock bot commented Nov 16, 2019

This issue has been automatically locked since there has not been any recent activity after it was closed. If you can still reproduce this issue in Safe Mode then please open a new issue and fill out the entire issue template to ensure that we have enough information to address your issue. Thanks!

@lock lock bot locked as resolved and limited conversation to collaborators Nov 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants