Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Add option to use ripgrep for crawling the list of files #369

Merged
merged 7 commits into from Mar 19, 2019
Merged

Conversation

rafeca
Copy link
Contributor

@rafeca rafeca commented Mar 12, 2019

Summary

This PR contains some minimal changes to demonstrate the usage of ripgrep on the fuzzy finder, so we can discuss the tradeoffs and steps needed to be able to ship this.

This work comes as a follow-up from #367 after the suggestions of @smashwilson .

Benefits

Using ripgrep speeds up drastically the time to crawl medium and large repositories (we're taking about up to 14X faster times):

Type Num Files Time (master) Time (This PR) Improvements
Small 2K 1.3s 0.9s 30% less time (or 1.4X faster) 🎉
Medium 30K 6.2s 1.2s 80% less time (or 5X faster) 🎉
Large 270K 57.5s 4.1s 93% less time (or 14X faster) 🎉

(the measurement has been done the same way as in #366).

Possible Drawbacks

The crawling behaviour with ripgrep is slightly different than the one currently implemented. I don't think that the changes are important enough (or even bad), but it's important to list them:

  1. ripgrep also returns symlink destination files (e.g if there's a symlink ./foo.js which points to ./bar.js, with ripgrep foo.js can also be opened. I think this is an improvement.
  2. The ordering of the returned results is slightly different: right now results are returned in a DFS ordering (then alphabetically), while ripgrep returns all results alphabetically ordered.
  3. Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

Regarding 2., this change is quite noticeable, since the fuzzy finder currently displays the first 10 files returned by the crawler when it gets opened. This means that the first shown files will be slightly different than before (IMO we should change the default list of files that are shown when opening the fuzzy finder to show e.g recently opened files or currently opened files).

Alternate Designs

Other potential designs were presented here, and seems like ripgrep is the best tool to use at this moment.

Regarding the current implementation, I've decided to use vscode-ripgrep, which is just a package that handles the download of the ripgrep binaries for several platforms. This has saved me a ton of time, but I'm not familiar enough with the process to install bundled extensions in Atom to know if this package is suitable for the job.

Next steps

  • Write some automated tests to verify the ripgrep crawling.
  • Manually check that ripgrep works well on different scenarios.
    • Different types of file system (FAT, NTFS, ...).
    • Different OSes (Windows).
    • Check that the ripgrep binary can be used correctly from a production build (e.g does the asar packaging cause any issue?).
      • OSX
      • Windows
  • Check that the behaviour is correct.
    • Files/folders with utf8 names.
    • List of returned files is the same in a few different repos.

Applicable Issues

#271

@rafeca rafeca marked this pull request as ready for review March 12, 2019 16:02
@rafeca rafeca changed the title WIP: Add option to use ripgrep for crawling the list of files Add option to use ripgrep for crawling the list of files Mar 12, 2019
@rafeca rafeca closed this Mar 12, 2019
@rafeca rafeca reopened this Mar 12, 2019
@rafeca
Copy link
Contributor Author

rafeca commented Mar 13, 2019

Small update: I just realized that when using ripgrep we don't need to initialize the Git native code (see commit) 😃

This is quite slow (specially for large repos) so it gives us even greater performance improvements (we can reduce the times on large repos by another 50% 🚀). I'm gonna update the results table with the new data

@rafeca
Copy link
Contributor Author

rafeca commented Mar 13, 2019

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR:

What happens now in master after opening Atom (or focusing Atom back):

master

The same flow with this PR:

ripgrep

@jasonrudolph
Copy link
Contributor

Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

I consider this a huge improvement. 😍

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR...

Wow! 🏎💨

Next steps...

Thanks for listing these out. This seems promising and very much worth pursuing. ⚡

Specially when using `ripgrep`, we don't want to spawn a lot of
processes in parallel if the user has many projects opened.
The modified test was relying on the order that two crawlers that
are running in parallel start returning results. This may have probably caused
flakiness and unintended failures.

In order to fix this, I'm forcing the crawling system to run
sequentially, by faking that the system has a single CPU.
Since the ripgrep binary should be unpacked from the asar archive (we
need to remember to add it to https://github.com/atom/atom/blob/master/script/lib/package-application.js#L101-L111
once we incorporate ripgrep onto Atom), we need to patch the path that
gets resolved by the electron built-in require system.
@rafeca
Copy link
Contributor Author

rafeca commented Mar 14, 2019

The testing so far seems to be going well:

  • I've modified the test suite to run all the fuzzy finder tests both with and without ripgrep, this way we have good coverage for ripgrep.
  • The functionality works fine on Windows and on different file systems.
  • I've been able to test a prod build on OSX and in fact I had to do a small fix to support asar archives.
  • I haven't been able to build Atom on Windows yet... This is the last bit of remaining work.

A couple of drawbacks that I've discovered today:

  • The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).
  • Currently, vscode-ripgrep downloads the needed binary from GitHub during installation (code). This causes failures on CI systems due to rate limits. I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

@jasonrudolph any thoughts about these drawbacks?

@jasonrudolph
Copy link
Contributor

Nice work, @rafeca!

The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).

This seems like a worthwhile tradeoff to me. We're trading a relatively tiny bit of additional disk space and bandwidth, and it turn we spend less time waiting for fuzzy-finder results. 👍

I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

We can generate a scopeless API token for our build account and use that token for atom/fuzzy-finder builds and atom/atom builds. Let's coordinate via Slack to make this happen.

@rafeca
Copy link
Contributor Author

rafeca commented Mar 14, 2019

Awesome! I agree it's a worthwhile tradeoff 😃

I've been able to verify that Windows builds are packaged correctly with ripgrep, which was the last item from my list of things to consider... So this PR is ready for review 😄

@rafeca
Copy link
Contributor Author

rafeca commented Mar 14, 2019

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover (you have to go to settings -> packages -> search fuzzy finder and click on its settings.

This is how this setting looks like:

Screen Shot 2019-03-14 at 21 54 07

I have mixed feelings about this:

  • The benefit is that by having the setting we can find early issues from people that opt in before we turn it on by default.
  • The problem is that the setting is so hidden that I don't expect many people to turn it on, so we're delaying the impact of it until we make it on by default (which takes at least one month) without getting much more testing in exchange.

Couple of things that we could do:

  1. Turn it on by default (and remove the "experimental" wording from the setting). If under some situations something goes wrong we can tell people to turn it off as a workaround.
  2. Leave it as it is: one month is not that much time and even a little bit of extra testing is appreciated.
  3. Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

@50Wliu
Copy link
Contributor

50Wliu commented Mar 15, 2019

Another time we implemented an experimental setting was when we started using fuzzaldrin-plus as an alternative to fuzzaldrin. The setting was off by default - you'll actually notice it right above your new setting in the last screenshot you posted.

Whenever people would open issues noting odd fuzzy finding behavior, we would ask them to try turning on the setting and reporting back. I think we can do the same thing here - keep the setting off by default, but advertise that the setting exists and will become the default in the near future (as we did for tree-sitter).

Looking back, what I think we could have done better is set a timeline for when to enable the new scoring method by default. As you can see, it's been three or four years, and I'm still not sure if it's enabled by default yet on all the packages that have it as an option. That was something we handled much better with tree-sitter.

I would opt towards making it off by default, advertising it when we can, and determining a target version to have it on by default. If by that target version it seems like it's still not ready for prime time, then we can re-evaluate.

@rafeca rafeca self-assigned this Mar 15, 2019
@rafeca
Copy link
Contributor Author

rafeca commented Mar 15, 2019

@50Wliu that sounds good to me.

If I understand correctly our release cadence, if this PR lands now it'll get included in v1.37.0 which will be released around May. We could aim to enable it by default by v1.38.0 or v1.39.0 depending on the feedback/issues reported by users (this can be decided later though).

@jasonrudolph
Copy link
Contributor

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover...

Couple of things that we could do:

  1. ...
  2. ...
  3. Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. 😄

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

Copy link
Contributor

@jasonrudolph jasonrudolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, @rafeca.

I noted one super minor phrasing suggestion below.

I can't wait to start using this. 👏⚡

package.json Outdated Show resolved Hide resolved
@rafeca
Copy link
Contributor Author

rafeca commented Mar 18, 2019

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. 😄

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

Awesome! We can implement it as a followup 😄

Co-Authored-By: rafeca <rafeca@gmail.com>
@rafeca rafeca added FY2019Q5 atom metrics More information: https://github.com/github/pe-atom-log/issues/732 FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728 and removed FY2019Q5 atom metrics More information: https://github.com/github/pe-atom-log/issues/732 labels Mar 19, 2019
@rafeca rafeca merged commit 15e6ff4 into master Mar 19, 2019
@rafeca rafeca deleted the use-ripgrep branch March 19, 2019 11:36
@black-snow
Copy link

Read the release notes, followed the path to this PR, followed your guide where to find the new setting and hit the checkbox. I think there will be more people going down this route :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants