Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to @atom/notify for file system notifications #19244

Merged
merged 29 commits into from May 11, 2019

Conversation

@nathansobo
Copy link
Contributor

commented Apr 30, 2019

Description of the Change

The library that we are currently using for file system notifications (@atom/nsfw is causing render process crashes on Windows, which is introducing unreliability in our continuous integration pipeline and presumably causing occasional crashes for users. After obtaining crash reports on a local VM, I was unable to find the cause of the crashes after trying for about a half a day.

We've already forked nsfw for other issues in the past, and @smashwilson actually built an alternative library called @atom/watcher in hopes of addressing nsfw's shortcomings for our use case. It's been possible to use @atom/watcher as the back end for file system notifications in Atom for about 2 years now via the "Experimental file system watching library" and "Poll" options in the core settings.

Screen Shot 2019-04-30 at 3 14 51 PM

There's an open pull request that enables this option by default, and my first reaction upon encountering the inexplicable crashes from nsfw was to get that PR over the finish line. Unfortunately, @atom/watcher is also the source of crashes on Windows in CI builds for that PR. After about a day trying, I was unsuccessful in reproducing these crashes in a local VM and unable to associate symbols with the crash report downloaded from CI.

This leaves us in a difficult position. We need a stable, cross-platform provider of file system notifications to drive several critical aspects of Atom's UI and to support the watchPath API that is exposed to third party packages. Both of our current solutions are crashing on Windows in C++ code paths that are hard to debug.

This PR aims to provide an alternative. I built a new module for file system notifications called @atom/notify. It is different from nsfw and watcher in that it is not a native add-on that uses C++ to access cross platform file system notification APIs. Instead, it leans on a mature, cross-platform implementation of file system notifications provided in the Rust notify crate. The node module embeds a small Rust executable that wraps the notify API. It spawns this executable as a subprocess and communicates with it over stdio.

The notify crate takes care of the heavy lifting such as interfacing with the appropriate APIs on all three platforms and debouncing the event stream to coalesce related events. Since it is written in Rust and used widely in that ecosystem, I am confident in its stability. The subprocess eats about 680KB of memory on my machine, but I think the simplicity of talking to Rust over stdio versus building a native add-on is worth it.

Alternate Designs

  • We could fix the crash in nsfw. I tried to figure it out but got nowhere after several hours.
  • We could fix the crash in @atom/watcher, but I had similar challenges finding its source. Also, @atom/watcher is a fairly complex C++ codebase that nobody else is really using but Atom, and I'm not certain we wouldn't find additional issues even if we fixed this one. @smashwilson is the only one with significant context on it, and we can't guarantee that he'll always be available to the Atom team in the future.
  • We could go with VS Code's solution. They use chokidar on macOS and Linux and spawn a custom .Net subprocess on Windows. This solution still involves a subprocess, and reusing their code wouldn't have been a drop in replacement since their implementation exists across several files in the VS Code repository. I feel like using the same JS code and subprocess across all three platforms will lead to more consistent results and be cleaner from a code perspective.

Possible Drawbacks

  • Turning @atom/notify on by default introduces risk, but if we're hitting lots of issues on nightly and beta we can turn it off again before we hit stable, or only turn it on by default on non-stable release channels.
  • Spawning a subprocess introduces some memory overhead. It's unclear how much smaller a native Node add-on might be, but I think 600-700KB for a stable implementation of a critical facility is worth it. There are other places where we could more easily save memory, such as the GitHub package's sidecar process.
  • Rust is a new language that we haven't used yet anywhere else in Atom, and it has a fairly steep learning curve. However, we aren't introducing very much Rust (about 250 lines), and it's all pretty straightforward with a well-defined, narrow interface. I'd argue that for the use case in question, Rust is a more approachable and much safer alternative to C++, and that team member investment in learning it could pay off when attacking other problems.
  • I'm probably going to require a window reload when you change file system notification back-ends in the settings view, because the code for automatically swapping things is complex and not an easy drop-in for the new implementation.

Verification Process

  • I have some test coverage on the @atom/notify module which I am running on all three platforms. I plan to make it more extensive before merging this PR.

Release Notes

  • Atom now defaults to a new back-end for file system notifications based on the Rust notify crate. If you're experiencing any issues that seem related, you can switch back to the previous implementation by choosing Native operating system APIs as your File Watcher in core settings.

Remaining tasks

  • Fix issues building the V8 snapshot
  • Get @atom/notify tests passing again on Linux
  • Prompt to reload when switching notification back-ends in settings
  • Green builds on every platform
    • macOS
    • Linux
    • Windows
  • Decide how to implement the onDidError method on the returned watcher.
  • Surface @atom/notify errors to the user via a notification so we can get feedback
  • Test more edge cases on all platforms – They exist, but we're not going to solve them. See comment below
  • Enable "Poll" option
  • Preserve pre-existing default on the stable channel but use @atom/notify otherwise
  • Document @atom/notify (both the README and the source code)
  • Address windows failures observed in GitHub package 😞

/cc @atom/maintainers
Closes #18991

@jasonrudolph

This comment has been minimized.

Copy link
Member

commented May 1, 2019

@nathansobo: A fantastic write-up, as always. ⚑️ πŸ™‡ Not only does this write-up help the reader understand all the considerations that you're taking into account today, but it will be a great resource for "Future Us" to easily refer back to these considerations and tradeoffs in the months and years to come. πŸ’Ÿ

@nathansobo nathansobo force-pushed the ns/notify branch from 3cdcbb8 to 445b3f1 May 1, 2019

nathansobo added some commits Apr 30, 2019

Default to @atom/notify for file system notifications
There's still some work to do to make this work on Linux and support 
switching between notification back-ends.
Remove top-level construction of a notify Watcher
It was actually dead code and is screwing up V8 startup snapshots.
Shim onDidError method on returned watcher
Need to figure out what to do with this for the long-term.
Remove logic supporting dynamic switching of path watcher backends
The logic is pretty complex and I don't want to take the time to 
integrate @atom/notify with it. I left a bunch of stuff commented out in 
this commit just in case these changes break the build. I'll do another 
pass to delete commented code once we go green.

@nathansobo nathansobo force-pushed the ns/notify branch from e7b0225 to 9b39eea May 2, 2019

nathansobo added some commits May 2, 2019

Prompt to restart Atom when the file system watcher setting changes
I experimented with just reloading the Window, but ran into problems. 
I'm also worried about a situation where there are multiple windows. 
Since this setting should be changed rarely, I think prompting for a 
restart is an expedient solution.
@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

Ugh. My Windows and Linux builds just seem to be hung on Azure Pipelines. Not getting any output from the tests.

Edit: Looks like Azure is having network availability issues.

@nathansobo nathansobo force-pushed the ns/notify branch from 50eefbb to ca7730f May 3, 2019

Stop all watchers before attempting to replace directory with file
I think that the notify subprocess might be holding a lock on the 
watched directory on Windows.

@nathansobo nathansobo force-pushed the ns/notify branch from ca7730f to 3628299 May 3, 2019

@nathansobo nathansobo force-pushed the ns/notify branch 4 times, most recently from fa2664b to 932aaac May 3, 2019

@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

πŸ’š

nathansobo added some commits May 6, 2019

Throw exceptions when there are @atom/notify watcher errors
This will surface these errors to users and to our exception reporting 
infrastructure to ensure we get feedback on the new backend in the wild.
@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 7, 2019

Today I investigated what happens when a watched directory itself is removed or renamed. At least on macOS, @atom/nsfw throws an error and stops watching.

The Rust notify crate doesn't error, but its behavior in this scenario is platform-dependent and inconsistent. I spent a few hours today trying to iron out the inconsistencies with additional code in the subprocess that wraps the notify crate, but it's starting to feel like a quagmire.

In attempting to test rename logic, I discovered that the notify crate doesn't do a good job coalescing raw events from macOS's FSEvents API to report renames when files are created and renamed in rapid succession. There's an open issue about the problem on the repository, and they actually mention the crate's integration tests sleeping for 35 seconds for the FSEvents buffer to clear in certain scenarios 😬. When I fire up a Node process and watch an existing directory, everything works fine, but the pressure of rapid-fire file system operations in automated tests seems to trigger weirdness.

At this point, I see the following paths forward:

  1. Dive into the notify code and try to fix passcod/notify#181 so I can write reliable automated tests that rename files, then proceed to try to paper over platform inconsistencies around renaming watched directories. Is a fix even possible? Not sure. I suspect that it probably is, but it's unclear how long it would take.

  2. Ignore this issue. If you rename watched directories, you get the weird, platform-specific behavior. Also, if you create a file and then quickly rename it, you might not always get perfect rename events.

I'm leaning toward option 2. The primary use case for watchPath is watching project roots, and Atom doesn't deal well with those being renamed anyway for other reasons. The rapid create-rename scenario happens in automated tests, but I don't see it being a huge issue in practice.

Alas, the notify crate isn't as perfect as I'd hoped, but it's still the case that it's used in plenty of production Rust projects including Cargo and should work well in practice (🀞). Despite these imperfections, this path still solves the render process crashes that are slowing down our build cycle and potentially makes Electron upgrades easier.

@smashwilson
Copy link
Member

left a comment

πŸ‘ No blockers that I see.

await nextRootEvent
})

it('adopts existing child watchers and filters events appropriately to them', async function () {

This comment has been minimized.

Copy link
@smashwilson

smashwilson May 9, 2019

Member

This test is the one that was the greatest pain in the ass on CI in #17899, by the way. I was never able to figure out why.

fileSystemWatcherPollInterval: {
description: "If the 'Polling' option is selected for the file system watcher, this will be the interval between polls.",
type: 'number',
default: 1000

This comment has been minimized.

Copy link
@smashwilson

smashwilson May 9, 2019

Member

Any specific reason you tightened this from ~3s to 1s? Did you try it out and find the latency too high?

This comment has been minimized.

Copy link
@nathansobo

nathansobo May 9, 2019

Author Contributor

Yeah, I played with it and 3 seconds just felt really awful. 1 second felt sane though. Definitely want your computer to be plugged in with this option on a big folder I'd imagine. But I want the option to feel usable with this default.

@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

Okay, I'm not sure how I've gotten any green builds before (you can see from the record that they did exist!), but I'm now getting failures due to the GitHub package making assumptions about watchPath reporting paths in the shortened 8.3 format. So we're comparing paths containing VSSADM~1 with paths containing VssAdministrator on Azure and barfing.

I'm going to explore using Windows APIs to normalize to the 8.3-compatible paths in the Rust subprocess and see how we do.

@nathansobo nathansobo force-pushed the ns/notify branch 2 times, most recently from 3cedf06 to 6c439d6 May 10, 2019

@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

Trying a new approach where I normalize %TEMP% to a long path in the build configuration before running the tests.

@nathansobo nathansobo force-pushed the ns/notify branch 2 times, most recently from 02ea84c to 76b9616 May 10, 2019

Normalize %TEMP% to a long path on Windows CI
This avoids problems with assertions that involve paths inside the temp 
folder.

@nathansobo nathansobo force-pushed the ns/notify branch from 76b9616 to b1b1b15 May 10, 2019

@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

The last build was green on Linux and Windows but failed in the fuzzy finder specs on macOS for a seemingly unrelated reason. Merging in master and trying again. πŸ˜“

@nathansobo

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2019

Okay. Green build. Going to wait to merge this until after the next release cycle which should occur early next week. That will maximize our time to test the new notification backend on nightly and beta before it goes to stable. Worst case scenario, we change the default while dealing with anything unforeseen.

@nathansobo nathansobo merged commit b2ecabd into master May 11, 2019

2 checks passed

Atom Pull Requests #20190510.16 succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

@nathansobo nathansobo deleted the ns/notify branch May 11, 2019

@smashwilson

This comment has been minimized.

Copy link
Member

commented May 11, 2019

Yessss ✨ πŸš€

When I get the chance, I'll take a look at flushing any flakiness out of the github package cache invalidation tests.

@rafeca

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

Leaving this comment for the future reader: in case we want to explore making @atom/notify the default watcher, we'll need to tackle #19333 first.

nathansobo added a commit that referenced this pull request May 17, 2019

Revert "Merge pull request #19244 from atom/ns/notify"
This reverts commit b2ecabd, reversing
changes made to c3bf951.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.