Add faster substring search #735

constinit · 2015-08-23T23:10:59Z

This commit implements the fast substring search algorithm described by
Leonid Volnitsky.

The full algorithm is described at:
http://volnitsky.com/project/str_search/

Briefly, we insert each consecutive pair of characters in the needle
into a hash table. We then step through pairs of characters in the
haystack. If the pair is in the hash table (and probably also the needle)
we do a character-by-character comparison.

Performance is 30-40% faster than Boyer-Moore on 2GB of text:

./ag_hash --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
0.800452 seconds

./ag_master --stats blahblahbla ~/Downloads/wiki/
0 matches
0 files contained matches
4 files searched
1956034523 bytes searched
1.310972 seconds

ggreer · 2015-09-06T05:23:19Z

This is a pretty cool idea, but it segfaults on my system when I try to search my code directory. Have you tried it on any large corpus of data?

constinit · 2015-09-07T23:03:39Z

Thanks for reviewing. I think I solved the problem — I was overrunning the file buffer because I thought it's null terminated.

I tried a randomized testing approach on a large corpus, and I haven't managed to crash it since (except for unrelated issue #741). I encourage you to take another look.

This commit implements the fast substring search algorithm described by Leonid Volnitsky (see http://volnitsky.com/project/str_search/). The idea is we insert each consecutive pair of characters in the needle into a hash table. We then step through pairs of characters in the haystack. If the pair is in the hash table (and also probably the needle) we do a character-by-character comparison. Performance is 30-40% faster than Boyer-Moore on 2GB of text: > ./ag_hash --stats blahblahbla ~/Downloads/wiki/ 0 matches 0 files contained matches 4 files searched 1956034523 bytes searched 0.800452 seconds > ./ag_master --stats blahblahbla ~/Downloads/wiki/ 0 matches 0 files contained matches 4 files searched 1956034523 bytes searched 1.310972 seconds

ggreer · 2016-11-25T23:35:37Z

Thanks a lot for this PR. You've improved the speed of ag more than I've managed to in the past two years!

constinit force-pushed the hash branch 5 times, most recently from c952340 to 7cee24b Compare September 7, 2015 22:00

constinit force-pushed the hash branch from 7cee24b to e96eb42 Compare September 20, 2015 01:12

ggreer merged commit e96eb42 into ggreer:master Nov 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add faster substring search #735

Add faster substring search #735

constinit commented Aug 23, 2015

ggreer commented Sep 6, 2015

constinit commented Sep 7, 2015

ggreer commented Nov 25, 2016

Add faster substring search #735

Add faster substring search #735

Conversation

constinit commented Aug 23, 2015

ggreer commented Sep 6, 2015

constinit commented Sep 7, 2015

ggreer commented Nov 25, 2016