-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for reading patterns from a file #7
Comments
Thinking about this just a touch more, I think it might actually be easier than I let on. The search code doesn't actually know anything about regexes---all it cares about is using the relatively simple interface of It might be nice if the trait and Aho-Corasick implementation were in the |
for context, here's the man page entry from grep:
🌞 note that neither |
I might have some time next week to take a look at this. Would you be interested in a PR following your outline above? |
@emk That'd be great! Although re-reading it, I think I'd like to keep the trait private inside of |
Happy news! My top card in Trello at work says we need support for fast |
Do you have any requirements more specific than this? Specifically, the number of strings you need to search for?
I think it's probably pretty important to support use of Aho-Corasick when the In any case, I suspect that the easiest first thing you can do here is to forget about the Aho-Corasick optimization and just join all the strings together with |
Thank for the encouragement! (And for the great tool, which I already use dozens of times a day.) As for the implementation, let's start with the simple case as a warm-up exercise and go from there. :-) For our purposes, I could easily imagine a couple hundred strings. As far as I know, we're unlikely need to search for thousands of strings, but @seamusabshere might know otherwise. I'm still happy to try to implement the general case, because I agree that the PR just wouldn't be up to ripgrep's high standards otherwise. In the longer run, our use case might benefit from a |
To be clear, I'd be happy with a PR that did the simplest thing first! I imagine it should work fine with hundreds of strings. I don't know what it will look like at thousands or hundreds of thousands though. (I do know that my Aho-Corasick implementation can handle a million pretty easily.)
Yup. #162 is the tracking issue for this. I'm almost ready to submit a PR that takes another step in that direction.
I honestly expect #1 to be quite hard to implement. At a high level, you just need something that takes a I am kind of hoping to delay #1 until the search code gets pushed into its own library. That way we'll have an API to look at and design from, instead of what we have today, which is a bit convoluted. |
examples of what i want to search for:
in case that's helpful :) |
@seamusabshere I think that's all part of the plan. Most of the discussion at this point is just implementation details. :-) Note that ripgrep probably won't ever support "I want to match hundreds of thousands of regexes all at once." If you're at that point, you'll probably need something more specialized. (Hundreds of thousands of literals should be fine though, even if it's not practical in the first rev.) |
Note that if this supports |
Ah, nice! OK, I have some work in progress on this issue, and I'll be sure to support that case as well. That's a great feature. |
The trickiest part of this so far is testing |
@emk If there's a way to add a convenience method or two to However, since there aren't actually any existing tests for stdin, I wouldn't be offended if you opted to skip testing |
The first piece of this is available in #227. Thank you for getting me started in the right direction, and please feel free to suggest improvements! |
This is a somewhat basic implementation of `-f-` (BurntSushi#7), with unit tests. Changes include: 1. The internals of the `pattern` function have been refactored to avoid code duplication, but there's a lot more we could do. Right now we read the entire pattern list into a `Vec`. 2. There's now a `WorkDir::pipe` command that allows sending standard input to `rg` when testing. Not implemented: aho-corasick.
Closed by #227 |
It would be cool to support
grep
's-f/--file
option, and it should be relatively easy for us to do so. The implementation strategy I have in mind is to just join all of the patterns/literals using a|
and hand it off to the regex engine.It might be faster to hand build an Aho-Corasick automaton (using the
aho-corasick
crate) if you know we have a bunch of literals. In fact, this is almost assuredly more memory efficient if the number of literals being searched is very large. Unfortunately, this is a harder thing to add, since it would require plumbing an Aho-Corasick automaton through all of the searching code so that either aRegex
or anAcAutomaton
could be used. Doable, but not straight-forward.The text was updated successfully, but these errors were encountered: