Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a directory searcher that uses `ripgrep` as the backend #19348

Merged
merged 14 commits into from May 23, 2019

Conversation

Projects
None yet
2 participants
@rafeca
Copy link
Contributor

commented May 16, 2019

Summary

This PR adds a new RipgrepDirectorySearcher (the backend that atom.workspace.scan() uses that implements the search using ripgrep.

For simplicity (and backwards compatibility), the RipgrepDirectorySearcher implements the exact same interface that the DefaultDirectorySearcher (in order to verify that, we're running the exact same tests against both of them).

馃崘'd with @nathansobo

Benefits

This should speed up considerably searches along big projects (I'll provide some benchmarks later), since ripgrep has proven to be dramatically faster than using Node's filesystem apis to crawl a big project.

Alternate designs

Instead of mimicking the data structure of scandal, we could have chosen to just return the data in a similar structure than the output of ripgrep (which aligns better with what's used on the find and replace UI). This would mean less processing of data and would allow simplifying the find and replace logic.

At the same time, this alternative had some big drawbacks:

  1. Even if it would allow simplifying the logic in find and replace, we would need to do big changes there, which would mean much more testing and risk of introducing bugs. Since for now we are exploring if ripgrep is a suitable alternative for find and replace, we prefer to avoid having to do this amount of work just to test the hypothesis.
  2. Even more complexity would have to be introduced since we want to test both the ripgrep and the standard scanner at the same time to compare them. This means that we would need to keep both types of handling logic in the find and replace, with the risk of diverging or introducing bugs in one logic only without realizing.
  3. If we change the data structure returned by scan(), other packages using that method (or other parts of Atom) won't be able to easily benefit from ripgrep improvements without being refactored.

Verification process

  • The same unit tests are now run for both ripgrep and scandal, a few new tests have been added to test the context lines logic.
  • Do some sanity testing on the find and replace package.
WIP
Co-Authored-By: Rafael Oleza <rafeca@users.noreply.github.com>

@rafeca rafeca requested a review from nathansobo May 16, 2019

@rafeca rafeca force-pushed the ns-ro/ripgrep-scan branch from 293eda6 to c46522e May 16, 2019

@rafeca rafeca changed the title Add a directorty searcher that uses `ripgrep` as the backend Add a directory searcher that uses `ripgrep` as the backend May 16, 2019

@rafeca rafeca force-pushed the ns-ro/ripgrep-scan branch from 1c47e4f to d35aef3 May 17, 2019

rafeca added some commits May 17, 2019

Delay the require of the vscode-ripgrep module
Since a new RipgrepDirectorySearcher is constructed during the snapshot
creation, we cannot require vscode-ripgrep from there (since it's not in
the snapshot).
@nathansobo
Copy link
Contributor

left a comment

Clever approach keeping the set of trailing context lines and continuing populating them in subsequent matches. Did you profile this at all to get a sense of how much overhead this state tracking introduces? I don't imagine it would be too bad though... just a bit of extra work per match, and the alternative is breaking the API which I agree is probably not worth it.


if (options.trailingContextLineCount) {
for (const trailingContextLines of pendingTrailingContexts) {
trailingContextLines.push(message.data.lines.text.trim())

This comment has been minimized.

Copy link
@nathansobo

nathansobo May 17, 2019

Contributor

What if message.data.lines.text contains newlines?

This comment has been minimized.

Copy link
@rafeca

rafeca May 17, 2019

Author Contributor

I haven't tested this, but I'm going to add a test for that.

I remember that when we were testing the current implementation we saw some issues with multi-line results right?

This comment has been minimized.

Copy link
@nathansobo

nathansobo May 17, 2019

Contributor

Yeah, although I honestly can't remember what we saw.

@rafeca

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

Good point! On my local tests I couldn't see any issue even when returning a lot of results but I'm going to do some profiling

@rafeca rafeca force-pushed the ns-ro/ripgrep-scan branch from 884d1b2 to 4f9bd50 May 21, 2019

@rafeca

This comment has been minimized.

Copy link
Contributor Author

commented May 21, 2019

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

I've done some profiling and the overhead added is non-significative. The whole processing of a single line from ripgrep takes around ~0.4ms (most of it comes from the JSON parsing of the line).

For comparison, etch takes ~36ms to update the UI when it receives each of these results:

Screenshot 2019-05-21 at 14 44 46

@rafeca

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

I've added a new commit which enables multiline support in the ripgrep scanner. This, paired with some minor changes in the find-and-replace package, will allow users to search using multiline regexps, a popular feature requested a long time ago.

I've also added a bunch of tests to verify that the behaviour is correct

@rafeca rafeca force-pushed the ns-ro/ripgrep-scan branch from eb7060a to 6748b84 May 22, 2019

@rafeca

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

Oh god I've had to handle yet more edge cases (created tests for each of them):

  1. Handle correctly the unicode support: ripgrep returns the start and end positions of results within a line in bytes, while our UI (and JS usually) expects these positions in terms of character position on the string. I've had to add some logic to do this conversion (which needs to iterate over the whole line if it has characters that are represented by more than 1 byte). I've made this logic as performant as possible (It's O(n) based on the line length), and in my benchmarks even when getting thousands of results from a single line of > 100k characters length, the processing only takes ~10ms.
  2. Convert includes/exludes into globs: ripgrep expects globs for include and exclude patterns, but scandal (and in general find all interfaces) handle also things like src/ so I had to add some logic to convert these kind of things into globs that would get all expected files.
  3. Unescape slashes from RegExps: ripgrep is quite picky about unnecessarily escaped sequences, so we need to unescape slashes which get automatically escaped by JS RegExp implementation.

There may be other edge cases that I'm missing, so my current recommendation is to add this as a setting on the find-and-replace package and leave it by default for a couple of releases while we get some usage.

rafeca added some commits May 22, 2019

@rafeca rafeca merged commit a2a1de8 into master May 23, 2019

2 checks passed

Atom Pull Requests #20190522.13 succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details

@rafeca rafeca deleted the ns-ro/ripgrep-scan branch May 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.