Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster substring search #735

Merged
merged 1 commit into from
Nov 25, 2016
Merged

Add faster substring search #735

merged 1 commit into from
Nov 25, 2016

Conversation

constinit
Copy link
Contributor

This commit implements the fast substring search algorithm described by
Leonid Volnitsky.

The full algorithm is described at:
http://volnitsky.com/project/str_search/

Briefly, we insert each consecutive pair of characters in the needle
into a hash table. We then step through pairs of characters in the
haystack. If the pair is in the hash table (and probably also the needle)
we do a character-by-character comparison.

Performance is 30-40% faster than Boyer-Moore on 2GB of text:

./ag_hash --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
0.800452 seconds

./ag_master --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
1.310972 seconds

@ggreer
Copy link
Owner

ggreer commented Sep 6, 2015

This is a pretty cool idea, but it segfaults on my system when I try to search my code directory. Have you tried it on any large corpus of data?

@constinit
Copy link
Contributor Author

Thanks for reviewing. I think I solved the problem — I was overrunning the file buffer because I thought it's null terminated.

I tried a randomized testing approach on a large corpus, and I haven't managed to crash it since (except for unrelated issue #741). I encourage you to take another look.

This commit implements the fast substring search algorithm described by
Leonid Volnitsky (see http://volnitsky.com/project/str_search/).

The idea is we insert each consecutive pair of characters in the needle
into a hash table. We then step through pairs of characters in the
haystack. If the pair is in the hash table (and also probably the needle)
we do a character-by-character comparison.

Performance is 30-40% faster than Boyer-Moore on 2GB of text:

> ./ag_hash --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
0.800452 seconds

> ./ag_master --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
1.310972 seconds
@ggreer
Copy link
Owner

ggreer commented Nov 25, 2016

Thanks a lot for this PR. You've improved the speed of ag more than I've managed to in the past two years!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants