Flaky build: Windows build sometimes fails with "resource busy or locked" #18991

jasonrudolph · 2019-03-14T16:42:25Z

In our Windows 64-bit builds, we frequently see the build fail at this point in the test suite...

...
TextEditor > buffer manipulation > .outdentSelectedRows() > when one line is selected > outdents line and retains editor Renderer process crashed, exiting
EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
Error: EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
    at Object.fs.unlinkSync (fs.js:1061:3)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:211:13)
    at D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:245:5
    at Array.forEach (<anonymous>:null:null)
    at rmkidsSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:244:26)
    at rmdirSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:237:7)
    at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
    at D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:245:5
    at Array.forEach (<anonymous>:null:null)
    at rmkidsSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:244:26)
    at rmdirSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:237:7)
    at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
    at cleanupDirsSync (D:\a\1\s\node_modules\temp\lib\temp.js:143:5)
    at process.cleanupSync (D:\a\1\s\node_modules\temp\lib\temp.js:196:19)
    at process.emit (events.js:187:15)
    at App.<anonymous> (D:\a\1\s\out\Atom Dev x64\resources\electron.asar\browser\init.js:64:11)
    at App.emit (events.js:182:13)
    at AtomApplication.exit (D:\a\1\s\src\main-process\atom-application.js:99:9)
    at WebContents.browserWindow.webContents.on (D:\a\1\s\src\main-process\atom-window.js:180:30)
    at WebContents.emit (events.js:182:13)

Rebuilding can sometimes resolve the failure.

Affected CI provider(s): AppVeyor, Azure DevOps

Affected platform(s): Windows 64-bit

Example commit experiencing flakiness: 8d8e39d

Example failing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070045/job/ooh3881g6b53avyn

Example passing build at that commit: https://ci.appveyor.com/project/Atom/atom/builds/22070428/job/t68iykoniojgy7tw

Example log from failing build: https://github.com/atom/atom/files/2967489/build.log

The text was updated successfully, but these errors were encountered:

jasonrudolph · 2019-03-14T16:44:16Z

Happened again in https://github.visualstudio.com/Atom/_build/results?buildId=33622

jasonrudolph · 2019-03-14T17:07:28Z

Happened again in:

rafeca · 2019-03-19T12:17:40Z

Happened to me today in:

https://ci.appveyor.com/project/Atom/atom/builds/23186894/job/dbv2p7ugaqktv7ar

rafeca · 2019-03-27T10:25:23Z

Another one:

https://github.visualstudio.com/Atom/_build/results?buildId=36574

jasonrudolph · 2019-04-02T13:44:31Z

Another one:

https://github.visualstudio.com/Atom/_build/results?buildId=36221&view=logs&jobId=4fbced83-508e-5fe0-c978-5c71ec0fc506&taskId=469abba8-e098-5409-562f-26dedf3df794&lineStart=2170&lineEnd=2197&colStart=1&colEnd=1

jasonrudolph · 2019-04-11T16:45:41Z

I'm starting to look into this issue. Here's what I'm seeing so far.

Starting with rimraf...

The failure happens down inside rimraf:

Error: EBUSY: resource busy or locked, unlink 'C:\Users\VSSADM~1\AppData\Local\Temp\atom-test-data119118-3988-dplic9.gszd4\databases\Databases.db'
    at Object.fs.unlinkSync (fs.js:1061:3)
    at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:211:13)
    ...

We're currently using rimraf 2.6.2, which is almost the latest version. The latest version is 2.6.3, and it doesn't appear to include any changes that would affect the problem we're experiencing (isaacs/rimraf@v2.6.2...v2.6.3).

rimraf is called by node-temp...

rimraf is getting called by node-temp:

...
at fixWinEPERMSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:150:5)
at rimrafSync (D:\a\1\s\node_modules\temp\node_modules\rimraf\rimraf.js:216:26)
at cleanupDirsSync (D:\a\1\s\node_modules\temp\lib\temp.js:143:5)
at process.cleanupSync (D:\a\1\s\node_modules\temp\lib\temp.js:196:19)
...

We're currently using @atom/node-temp 0.8.4, which is a fork of node-temp.

The latest version of node-temp is 0.9.0. It only includes a few changes from our fork, but those changes might be helpful to us. Specifically, node-temp 0.9.0 replaces use of the deprecated os-tmpdir module with Node's built-in os module (atom/node-temp@279c135).

We probably don't need to be using a forked version of node-temp anymore

We adopted the forked version of node-temp in atom/tree-view#1225. We did so in order to upgrade to a newer version of rimraf (atom/node-temp@b5266da). However, since the time that we created the fork, node-temp has incorporated that same rimraf upgrade (bruce/node-temp@b5266da). Therefore, I think we can safely move from our forked version of node-temp back to using the official version.

I'm going to explore this option. ☝️ It might help with this issue, or it might not. Even if it doesn't help, it will be nice to move off of our fork of node-temp if we no longer need to be maintaining a fork. 😅

rafeca · 2019-04-11T22:54:25Z

Could the EBUSY errors be caused by a crash? In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

jasonrudolph · 2019-04-12T13:36:11Z

In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

That is indeed the case in all of the logs that I've seen so far.

Could the EBUSY errors be caused by a crash?

Good question. Maybe the EBUSY errors are caused by a crash, or maybe a crash is leading to an EBUSY error. I don't yet know whether the order of the lines in the log file are a reliable indicator of the order in which things happened. 😬

jasonrudolph · 2019-04-12T21:54:07Z

The latest version of node-temp is 0.9.0. It only includes a few changes from our fork, but those changes might be helpful to us. Specifically, node-temp 0.9.0 replaces use of the deprecated os-tmpdir module with Node's built-in os module (atom/node-temp@279c135).

#19143 upgraded atom to node-temp 0.9.0. Unfortunately, even after that upgrade, we still sometimes see these EBUSY: resource busy or locked failures:

https://ci.appveyor.com/project/Atom/atom/builds/23812379/job/w2yekqk4w1ge9us3#L122

jasonrudolph · 2019-04-15T21:51:09Z

In #18991 (comment), @rafeca asked:

Could the EBUSY errors be caused by a crash? In a few (if not all) of the failures I saw a Renderer process crashed, exiting message on the line just before the EBUSY error...

I think this would be useful to investigate. Similar to what @rafeca was asking, I wonder if Atom has a lock on the Databases.db file, and then the renderer crashes with Databases.db still locked, and then the next test attempts to access that file and can't, thus triggering the EBUSY error message.

When a crash happens on macOS, the crash report is included in the build artifacts. But for some reason, when a crash happens on Windows, there's no crash report in the build artifacts (example). 😕 If we could get the crash report, that might offer a clue regarding the cause of the problem.

nathansobo · 2019-04-15T22:05:02Z

I think the EBUSY is downstream of the actual issue, which is the render process crashing. Here is my narrative for why we're seeing the error trying to delete a temp directory:

If a --test flag is passed to the main process, it uses the temp package to create a userData directory for Electron. This directory contains a Databases.db file which I think is used by the Chrome render process internally.
The render process is crashing. When the main process detects the crash, it logs the message Renderer process crashed, exiting and then calls this.atomApplication.exit(100).
AtomApplication.exit in turn calls exit on the Electron application object.
Eventually, Electron seems to be emitting an exitevent on the process global.
This triggers the temp package to attempt clean up its temporary directories, however, since the render process was interacting with Databases.db and did not exit gracefully, it's still retaining a lock on the file and we're unable to delete it.

So the real question is: Why is the render process crashing? Unfortunately, it doesn't seem like we have much to go on from these reports. We're just seeing that the render process crashed and then an exception caused by trying to delete a file that it still has locked. I'll do some investigation into how we can get more information about the crash.

nathansobo · 2019-04-17T16:34:42Z

I spent some time trying to reproduce this crash in a local VM and got lucky. The crash report indicates that my crash originated from our fork of nsfw. I spent about an hour trying to symbolicate the nsfw stack frames but decided to shelve that effort.

When @smashwilson was working on watcher, I remember him lamenting the structure of nsfw and its use of threads. In this crash dump, I see frames from the nsfw module in 3 different threads: 19 (the thread that crashed), 22, and 23. Could there be some kind of evil race condition within nsfw on Windows that's exacerbated by the rapid setups and teardowns occurring within Atom's test suite?

There's no guarantee that this local crash was caused by the same thing as our crashes on CI, but it's a distinct possibility.

nathansobo · 2019-04-17T18:18:04Z

Encountered a second crash locally, also originating from nsfw. I appended it to the previous crash report as "Crash 2".

This time, nsfw stack frames were only active in the thread that crashed, rather than in multiple threads. So maybe that rules out a race between nsfw threads. Not sure at this point, and maybe it doesn't matter since we have a viable alternative to nsfw.

I'm going to dig into #17899 which switches from nsfw to @atom/watcher and see if I can get us switched to it. Hopefully doing so will solve these render process crashes.

nathansobo · 2019-04-17T18:40:20Z

2 more local crashes, both involving nsfw on the crashing thread and in 3-4 other threads as well. I'm not going to post the crash reports for them because I don't think they will add much.

nathansobo · 2019-04-17T21:31:17Z

I was really hoping #17899 would resolve this, but it looks like the render process crashed on Azure DevOps for that PR on Windows. So it's either a different crash, or we're still somehow using nsfw on that PR even though it changes the default. The mystery continues.

jasonrudolph · 2019-04-19T12:10:30Z

One more...

https://github.visualstudio.com/Atom/_build/results?buildId=38206&view=logs&jobId=4fbced83-508e-5fe0-c978-5c71ec0fc506&taskId=ee27e2dd-d235-5e40-8292-5f5961d971ba&lineStart=2148&lineEnd=2149&colStart=1&colEnd=1

rafeca · 2019-04-29T14:13:10Z

Another one: https://github.visualstudio.com/Atom/_build/results?buildId=39013

jasonrudolph · 2019-05-02T13:36:37Z

A couple more:

nathansobo · 2019-05-15T19:07:00Z

It seems that #19244 introduced issues of its own, so I'm switching back to @atom/nsfw for now in #19345. Until we find another path watching solution or the underlying cause of the crash, this issue must be reopened.

nathansobo · 2019-05-16T04:14:01Z

Been running tests locally for over an hour with no reproduction yet. Going to let them loop all night and see what I find. It's strange because last time I tried I was able to reproduce failures faster than this.

nathansobo · 2019-05-16T14:39:40Z

Frustrated. I ran render process tests in a Windows VM locally on c3bf951 (which is right after the railcars rolled to 1.39.0-dev and before I merged #19244) all night, with no reproduction. Could something have subtly shifted from when I was testing things earlier in the month? At this point, I'm ready to put this issue back in the technical backlog and wait to see it again on CI.

rafeca · 2019-05-16T14:44:52Z

uhmm, I recently upgraded the version of nsfw (which got downgraded at the end of February when we reverted the electron 3 change): #19273, but I have no idea if that could have fixed the crashes...

This is the list of changes in nsfw: atom/nsfw@v1.0.18...v1.0.22

as-cii · 2019-05-16T15:31:04Z

For posterity, I am reporting here the findings that were mentioned on Slack.

It looks like we haven't been able to reproduce those crashes anymore because, as @rafeca pointed out, nsfw has been recently upgraded to v1.0.22. This version contains atom/nsfw#2, which fixes a problem in the exact code path that was crashing before. We haven't been able to symbolicate the crash dumps reported by @nathansobo, but it seems like they were happening while attempting to call isWatching.

In the process of investigating the original crash, we discovered another potential crash in atom/nsfw#5, but we haven't been able to reproduce that, possibly because the nsfw API is never used in a way that would trigger the race condition illustrated in the pull request.

The above evidence would seem to confirm that the crash was effectively fixed as of #19273. We will conduct some additional testing on Windows to verify that, by rolling back that pull request, we can "easily" reproduce the crash again. If that's the case, I think we could go ahead and close this issue.

nathansobo · 2019-05-16T16:46:13Z

@as-cii The stack traces that I saw included the mInterface && guard from atom/nsfw#2, so I don't think that explains it.

nathansobo · 2019-05-16T18:31:41Z

Okay, I managed to reproduce the crashes locally on cbd4641. Because I built locally, I was able to view the source associated with the crash report. Here's a screenshot:

It looks like I was wrong. When we reverted the Electron 3 upgrade in c9e6d04, it looks like we downgraded from @atom/nsfw from 1.0.21 to 1.0.18 which preceded the change that Ash made in atom/nsfw#2. I missed this fact. If you look at this screenshot, it looks like the fix that Ash introduced is most likely the solution. I assume the method in question was inlined, which is why that method isn't indicated directly in the crash.

nathansobo · 2019-05-16T19:23:39Z

After reflecting a bit, I just want to offer a mea culpa on this entire issue here.

This was a blunder for me. Missing the fact that we weren't on the latest version of nsfw was a really basic mistake on my part. A small mistake, but with costly implications. I have a bias toward building my way out of problems, but this situation called for a different response. I hope to learn from this going forward.

jasonrudolph · 2019-05-17T14:39:41Z

We saw this issue again yesterday, and that branch uses nsfw 1.0.22. I think that means that this issue is still relevant, so I'm going to re-open it. If I'm misunderstanding, please let me know.

as-cii · 2019-05-17T16:42:12Z

Good news! @nathansobo and I were able to open and symbolicate the crash report from the failing build mentioned above and it's not caused by a null-pointer exception in NSFW. Instead, it appears that the process is running out of memory:

Another interesting fact is that we're observing 15 NSFW threads associated with the process in the crash report. Their stack traces all look similar to:

Could this still be somehow related to NSFW, or are we simply allocating too much memory in our tests? We will keep investigating and report our findings here.

nathansobo · 2019-05-17T20:23:24Z

We seem to be leaking memory in text-editor-spec.js, but it may be in a native extension rather than in JS objects. Here I have the test runner paused via the Chrome devtools about halfway through text-editor-spec.js, and you can see that the JS heap is 169MB while the heap for the process itself is around 2GB.

nathansobo · 2019-05-17T22:33:05Z

Okay, I have found the leak and will open a PR fixing it.

jasonrudolph · 2019-05-20T13:55:54Z

Well, crap. Even after #19359, it looks like we're still seeing this problem. Here's the latest instance:

https://ci.appveyor.com/project/Atom/atom/builds/24670700/job/w6mc6rnungpiaaru#L1575

I think that means that we should re-open this issue. If I'm misinterpreting the problem in the build linked above, please let me know.

nathansobo · 2019-05-20T18:14:30Z

This issue keeps morphing because it is very general: "The render process is crashing". This time, we were able to reproduce a crash on @as-cii's Windows VM. Here's a screenshot of the stack trace.

After diving into the V8 source, the C++ exception in question seems to come from a CHECK macro inside the AllocationResult constructor. That is suspicious considering that #19359 addressed a memory leak that caused us to crash by running out of memory. However, here are reasons to doubt this is an out-of-memory issue:

Antonio's VM wasn't out of memory when we observed the crash. The tests were eating about 1GB of RAM out of 4GB allocated to the VM. Also, wouldn't Windows start swapping before crashing a process?
Previously, we failed with an explicit ReportOOMErrorOnMainThread on our stack. Wouldn't we see that in this case too if we were running out of RAM.

We don't have crash reports on AppVeyor, and we'd need to have those to definitively declare that the crash we observed locally was the same as what we observed there. Considering we want to move away from AppVeyor, I don't think it's worth adding crash reports support at this point.

At this point, the situation feels murky. We definitely observed a real crash that reaches into the internals of V8. We recently upgraded to Electron 3, which could be connected. If this is just about AppVeyor, I don't really care, but what if this points to a deeper issue that users might encounter. This is especially worrisome in light of #18915.

nathansobo · 2019-05-20T19:09:09Z

Here's basically the same crash that @rafeca observed on macOS.

nathansobo · 2019-05-20T19:37:05Z

Closing in favor of #19372 since this crash isn't Windows-specific. I think the crash we're observing now deserves its own issue as it likely blocks release.

lock · 2019-11-16T20:02:48Z

This issue has been automatically locked since there has not been any recent activity after it was closed. If you can still reproduce this issue in Safe Mode then please open a new issue and fill out the entire issue template to ensure that we have enough information to address your issue. Thanks!

jasonrudolph added the flaky-test label Mar 14, 2019

jasonrudolph added the FY2019Q5 deterministic builds label Mar 18, 2019

rafeca mentioned this issue Mar 21, 2019

Try upgrading libgit2 to resolve renderer crashes #19007

Closed

3 tasks

asheren mentioned this issue Apr 4, 2019

Iteration Plan: April 1 - April 12, 2019 #19104

Closed

9 tasks

jasonrudolph self-assigned this Apr 11, 2019

This was referenced Apr 11, 2019

⬆️ temp@0.9.0 atom/tree-view#1310

Merged

⬆️temp@0.9.0 #19143

Merged

nathansobo self-assigned this Apr 15, 2019

jasonrudolph mentioned this issue Apr 16, 2019

Iteration Plan: April 15 - April 26, 2019 #19170

Closed

12 tasks

nathansobo mentioned this issue Apr 16, 2019

Upload Windows crash reports on Azure build pipelines for pull requests only #19171

Merged

4 tasks

50Wliu mentioned this issue Apr 21, 2019

Fix reset-font-size #19192

Merged

nathansobo mentioned this issue Apr 29, 2019

Iteration Plan: April 29 - May 10, 2019 #19236

Closed

9 tasks

nathansobo unassigned jasonrudolph Apr 30, 2019

nathansobo mentioned this issue Apr 30, 2019

Default to @atom/notify for file system notifications #19244

Merged

14 tasks

nathansobo closed this as completed in #19244 May 11, 2019

nathansobo reopened this May 15, 2019

nathansobo closed this as completed May 16, 2019

nathansobo mentioned this issue May 17, 2019

Undo integration of @atom/notify #19352

Merged

jasonrudolph reopened this May 17, 2019

nathansobo mentioned this issue May 17, 2019

🚱 Fix leak of LanguageMode objects in tests #19359

Merged

nathansobo closed this as completed in #19359 May 18, 2019

jasonrudolph reopened this May 20, 2019

nathansobo closed this as completed May 20, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 16, 2019

Flaky build: Windows build sometimes fails with "resource busy or locked" #18991

Flaky build: Windows build sometimes fails with "resource busy or locked" #18991

Comments

jasonrudolph commented Mar 14, 2019

jasonrudolph commented Mar 14, 2019

jasonrudolph commented Mar 14, 2019

rafeca commented Mar 19, 2019

rafeca commented Mar 27, 2019

jasonrudolph commented Apr 2, 2019

jasonrudolph commented Apr 11, 2019

Starting with rimraf...

rimraf is called by node-temp...

We probably don't need to be using a forked version of node-temp anymore

rafeca commented Apr 11, 2019

jasonrudolph commented Apr 12, 2019

jasonrudolph commented Apr 12, 2019

jasonrudolph commented Apr 15, 2019

nathansobo commented Apr 15, 2019 • edited Loading

nathansobo commented Apr 17, 2019 • edited Loading

nathansobo commented Apr 17, 2019

nathansobo commented Apr 17, 2019

nathansobo commented Apr 17, 2019

jasonrudolph commented Apr 19, 2019

rafeca commented Apr 29, 2019

jasonrudolph commented May 2, 2019

nathansobo commented May 15, 2019

nathansobo commented May 16, 2019

nathansobo commented May 16, 2019 • edited Loading

rafeca commented May 16, 2019 • edited Loading

as-cii commented May 16, 2019 • edited Loading

nathansobo commented May 16, 2019

nathansobo commented May 16, 2019

nathansobo commented May 16, 2019

jasonrudolph commented May 17, 2019

as-cii commented May 17, 2019

nathansobo commented May 17, 2019

nathansobo commented May 17, 2019

jasonrudolph commented May 20, 2019

nathansobo commented May 20, 2019

nathansobo commented May 20, 2019 • edited Loading

nathansobo commented May 20, 2019

lock bot commented Nov 16, 2019

nathansobo commented Apr 15, 2019 •

edited

Loading

nathansobo commented Apr 17, 2019 •

edited

Loading

nathansobo commented May 16, 2019 •

edited

Loading

rafeca commented May 16, 2019 •

edited

Loading

as-cii commented May 16, 2019 •

edited

Loading

nathansobo commented May 20, 2019 •

edited

Loading