add support for reading patterns from a file #7

BurntSushi · 2016-09-17T20:59:03Z

It would be cool to support grep's -f/--file option, and it should be relatively easy for us to do so. The implementation strategy I have in mind is to just join all of the patterns/literals using a | and hand it off to the regex engine.

It might be faster to hand build an Aho-Corasick automaton (using the aho-corasick crate) if you know we have a bunch of literals. In fact, this is almost assuredly more memory efficient if the number of literals being searched is very large. Unfortunately, this is a harder thing to add, since it would require plumbing an Aho-Corasick automaton through all of the searching code so that either a Regex or an AcAutomaton could be used. Doable, but not straight-forward.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-09-22T11:39:52Z

Unfortunately, this is a harder thing to add, since it would require plumbing an Aho-Corasick automaton through all of the searching code so that either a Regex or an AcAutomaton could be used. Doable, but not straight-forward.

Thinking about this just a touch more, I think it might actually be easier than I let on. The search code doesn't actually know anything about regexes---all it cares about is using the relatively simple interface of grep::Grep. If we put that interface behind a trait and then impl it with Aho-Corasick, we should be pretty much good to go.

It might be nice if the trait and Aho-Corasick implementation were in the grep sub-crate. And maybe the Grep type should be renamed to Regex, that way, the trait could be called Grep. Hmm, yes, I quite like that!

seamusabshere · 2016-09-23T16:27:56Z

for context, here's the man page entry from grep:

 -f file, --file=file
    Read one or more newline separated patterns from file.  Empty pattern
    lines match every input line.  Newlines are not considered part of a
    pattern.  If file is empty, nothing is matched.

🌞 note that neither ag or ack have this, so it would be another differentiator!

emk · 2016-10-11T21:29:39Z

I might have some time next week to take a look at this. Would you be interested in a PR following your outline above?

BurntSushi · 2016-10-11T21:34:11Z

@emk That'd be great! Although re-reading it, I think I'd like to keep the trait private inside of ripgrep for now. I'd eventually like to move more of the searching code to the grep crate, so its API is going to need to be re-worked and thought out anyway.

emk · 2016-10-28T22:02:46Z

Happy news! My top card in Trello at work says we need support for fast -f-style searching, so I have time to start on this on Monday. I'd love to prepare you a pull request to look at. Beyond what you wrote above, do you have any further advice on how I should tackle this to make the integration process as easy as possible?

BurntSushi · 2016-10-28T22:14:46Z

My top card in Trello at work says we need support for fast -f-style searching

Do you have any requirements more specific than this? Specifically, the number of strings you need to search for?

I'd love to prepare you a pull request to look at. Beyond what you wrote above, do you have any further advice on how I should tackle this to make the integration process as easy as possible?

I think it's probably pretty important to support use of Aho-Corasick when the -F flag is given, which should bypass the regex engine entirely. Specifically, Aho-Corasick should scale quite a bit better to thousands/millions of literal strings than the regex engine will. A key problem you might face with this is the fact that Aho-Corasick doesn't actually preserve leftmost-first match semantics, and I'm not quite sure how to deal with that.

In any case, I suspect that the easiest first thing you can do here is to forget about the Aho-Corasick optimization and just join all the strings together with | and feed it to the regex engine. This may not meet your performance requirements however.

emk · 2016-10-28T22:27:41Z

Thank for the encouragement! (And for the great tool, which I already use dozens of times a day.)

As for the implementation, let's start with the simple case as a warm-up exercise and go from there. :-)

For our purposes, I could easily imagine a couple hundred strings. As far as I know, we're unlikely need to search for thousands of strings, but @seamusabshere might know otherwise. I'm still happy to try to implement the general case, because I agree that the PR just wouldn't be up to ripgrep's high standards otherwise.

In the longer run, our use case might benefit from a ripgrep API which allowed us to call the searching engine from another binary. But for our first in-house experiments, we'll have a shell script that uses curl to download a list of strings from a URL, and which then passes them to rg for a filesystem-wide search. We might also benefit from a basic implementation of #1 that can at least detect UTF-16 files automatically, even if it can't detect more exotic encodings, so that we don't miss data in files created on Windows.

BurntSushi · 2016-10-28T22:56:18Z

As for the implementation, let's start with the simple case as a warm-up exercise and go from there. :-)

To be clear, I'd be happy with a PR that did the simplest thing first! I imagine it should work fine with hundreds of strings. I don't know what it will look like at thousands or hundreds of thousands though. (I do know that my Aho-Corasick implementation can handle a million pretty easily.)

In the longer run, our use case might benefit from a ripgrep API which allowed us to call the searching engine from another binary. But for our first in-house experiments, we'll have a shell script that uses curl to download a list of strings from a URL, and which then passes them to rg for a filesystem-wide search.

Yup. #162 is the tracking issue for this. I'm almost ready to submit a PR that takes another step in that direction.

We might also benefit from a basic implementation of #1 that can at least detect UTF-16 files automatically, even if it can't detect more exotic encodings, so that we don't miss data in files created on Windows.

I honestly expect #1 to be quite hard to implement. At a high level, you just need something that takes a Read and transcodes the raw input to UTF-8, which in turn also implements Read. I think that alone is enough to work with the current implementation. The problem then comes with the output though. Do you output in the same encoding as the input? If so, that sounds pretty hairy. (Possibly even intractable.)

I am kind of hoping to delay #1 until the search code gets pushed into its own library. That way we'll have an API to look at and design from, instead of what we have today, which is a bit convoluted.

seamusabshere · 2016-10-29T23:14:08Z

examples of what i want to search for:

specific uuids
strings that look like addresses /\d+ [a-z]+ (?:st|dr)/i
/^[0-9A-F]{8}-[0-9A-F]{4}-[4][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

in case that's helpful :)

BurntSushi · 2016-10-29T23:16:06Z

@seamusabshere I think that's all part of the plan. Most of the discussion at this point is just implementation details. :-)

Note that ripgrep probably won't ever support "I want to match hundreds of thousands of regexes all at once." If you're at that point, you'll probably need something more specialized. (Hundreds of thousands of literals should be fine though, even if it's not practical in the first rev.)

BurntSushi · 2016-11-06T17:27:21Z

Note that if this supports -f- for reading from stdin, then it should satisfy #180 (which I'm now closing as a dupe).

emk · 2016-11-06T17:33:14Z

Ah, nice! OK, I have some work in progress on this issue, and I'll be sure to support that case as well. That's a great feature.

emk · 2016-11-07T23:42:29Z

The trickiest part of this so far is testing -f-, which can't be done using the existing WorkDir::stdout and WorkDir::output APIs, because we need to pipe some standard input to the process. This requires using something like spawn, as in this example. Do you have any preferences about how I should implement this?

BurntSushi · 2016-11-07T23:50:24Z

@emk If there's a way to add a convenience method or two to WorkDir to test it, then I think that would be fine. That way, we could add other tests that depend on stdin.

However, since there aren't actually any existing tests for stdin, I wouldn't be offended if you opted to skip testing -f-. :-)

emk · 2016-11-09T11:23:30Z

The first piece of this is available in #227. Thank you for getting me started in the right direction, and please feel free to suggest improvements!

This is a somewhat basic implementation of `-f-` (BurntSushi#7), with unit tests. Changes include: 1. The internals of the `pattern` function have been refactored to avoid code duplication, but there's a lot more we could do. Right now we read the entire pattern list into a `Vec`. 2. There's now a `WorkDir::pipe` command that allows sending standard input to `rg` when testing. Not implemented: aho-corasick.

BurntSushi · 2016-11-15T23:15:55Z

Closed by #227

BurntSushi added the enhancement An enhancement to the functionality of the software. label Sep 17, 2016

BurntSushi mentioned this issue Sep 23, 2016

grep's great --file option not implemented #17

Closed

BurntSushi mentioned this issue Oct 14, 2016

A readline flag #180

Closed

BurntSushi mentioned this issue Oct 21, 2016

request: option for literal searching #190

Closed

emk mentioned this issue Oct 29, 2016

add support for other text encodings #1

Closed

emk mentioned this issue Nov 9, 2016

Allow specifying patterns with -f FILE and -f- #227

Merged

BurntSushi closed this as completed Nov 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for reading patterns from a file #7

add support for reading patterns from a file #7

BurntSushi commented Sep 17, 2016

BurntSushi commented Sep 22, 2016

seamusabshere commented Sep 23, 2016

emk commented Oct 11, 2016

BurntSushi commented Oct 11, 2016

emk commented Oct 28, 2016

BurntSushi commented Oct 28, 2016

emk commented Oct 28, 2016 •

edited

Loading

BurntSushi commented Oct 28, 2016

seamusabshere commented Oct 29, 2016

BurntSushi commented Oct 29, 2016

BurntSushi commented Nov 6, 2016

emk commented Nov 6, 2016

emk commented Nov 7, 2016

BurntSushi commented Nov 7, 2016

emk commented Nov 9, 2016

BurntSushi commented Nov 15, 2016

add support for reading patterns from a file #7

add support for reading patterns from a file #7

Comments

BurntSushi commented Sep 17, 2016

BurntSushi commented Sep 22, 2016

seamusabshere commented Sep 23, 2016

emk commented Oct 11, 2016

BurntSushi commented Oct 11, 2016

emk commented Oct 28, 2016

BurntSushi commented Oct 28, 2016

emk commented Oct 28, 2016 • edited Loading

BurntSushi commented Oct 28, 2016

seamusabshere commented Oct 29, 2016

BurntSushi commented Oct 29, 2016

BurntSushi commented Nov 6, 2016

emk commented Nov 6, 2016

emk commented Nov 7, 2016

BurntSushi commented Nov 7, 2016

emk commented Nov 9, 2016

BurntSushi commented Nov 15, 2016

emk commented Oct 28, 2016 •

edited

Loading