-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how about the new approach base on 'Double Array Trie' #137
Comments
Looks interesting, thanks for the pointer! |
here, the paper.. waiting good news |
In addition to the performance enhancements, this might also be a good candidate for a 'shared memory data structure' discussed here #114 |
FWIW, @mischasan has had this other approach using interleaved arrays (for a long while) https://github.com/mischasan/aho-corasick https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/edit and is in C++. |
That @mischasan project seems great! I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well. |
Sympathy for your days of only 24 hours :-)
I'm living the life of a single working mum, with two covid-vulnerable
dependents.
More power to you for contributing open source.
An interleaved array, and the breadth-first insertion of tree nodes,
was a data-oriented design choice. Having one flat mem block shareable
by multiple processes paid off.
What hurdles to that does managed memory (python) present? Or not relevant?
Looked at your README: "... approximate (multi-pattern string search)"
had me curious.
spamd uses acism of exact chunks, to support regexes. What's your
"approximate" about?
The only work I planned for acism was restoring correct backlink-trimming.
Anyway, all the best.
…On 31/01/2021, Wojciech Muła ***@***.***> wrote:
That @mischasan project seems great!
I wish I could put more effort in this project, there are some obstacles I'm
unable to overcome. Like, a day has 24 hours. :) I quite often think about
improving the module, about trying different data structures, extending API
etc. But this would require to allocate a lot of free time if it has to be
done well.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
You seem to refer to a prior conversation; if it's not private, could
you point me at that?
…On 31/01/2021, Mischa Sandberg ***@***.***> wrote:
Sympathy for your days of only 24 hours :-)
I'm living the life of a single working mum, with two covid-vulnerable
dependents.
More power to you for contributing open source.
An interleaved array, and the breadth-first insertion of tree nodes,
was a data-oriented design choice. Having one flat mem block shareable
by multiple processes paid off.
What hurdles to that does managed memory (python) present? Or not relevant?
Looked at your README: "... approximate (multi-pattern string search)"
had me curious.
spamd uses acism of exact chunks, to support regexes. What's your
"approximate" about?
The only work I planned for acism was restoring correct backlink-trimming.
Anyway, all the best.
On 31/01/2021, Wojciech Muła ***@***.***> wrote:
> That @mischasan project seems great!
>
> I wish I could put more effort in this project, there are some obstacles
> I'm
> unable to overcome. Like, a day has 24 hours. :) I quite often think
> about
> improving the module, about trying different data structures, extending
> API
> etc. But this would require to allocate a lot of free time if it has to
> be
> done well.
>
> --
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub:
> #137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
Could you point me at the perf stats for pyaho?
And at anything you thought was non-obvious smart about it?
rspamd (not my project) uses million-string match sets; acism fits
them in <10MB -- necessary for a router-level process. Genome pattern
matching (U of Queensland; can't remember the name) used similar huge
sets. Did you have test case of those sizes?
…On 31/01/2021, Mischa Sandberg ***@***.***> wrote:
Sympathy for your days of only 24 hours :-)
I'm living the life of a single working mum, with two covid-vulnerable
dependents.
More power to you for contributing open source.
An interleaved array, and the breadth-first insertion of tree nodes,
was a data-oriented design choice. Having one flat mem block shareable
by multiple processes paid off.
What hurdles to that does managed memory (python) present? Or not relevant?
Looked at your README: "... approximate (multi-pattern string search)"
had me curious.
spamd uses acism of exact chunks, to support regexes. What's your
"approximate" about?
The only work I planned for acism was restoring correct backlink-trimming.
Anyway, all the best.
On 31/01/2021, Wojciech Muła ***@***.***> wrote:
> That @mischasan project seems great!
>
> I wish I could put more effort in this project, there are some obstacles
> I'm
> unable to overcome. Like, a day has 24 hours. :) I quite often think
> about
> improving the module, about trying different data structures, extending
> API
> etc. But this would require to allocate a lot of free time if it has to
> be
> done well.
>
> --
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub:
> #137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
@mischasan I am the culprit for "approximate" word addition in the README (I did not write the feature) and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier |
Aucune problème.
(Mieux en anglais; en dépit d'être canadien :)
The backlink pruning originally misimplemented (naïvement) in acism
would apply to any aho-corasick tree. I'll think about that.
When it comes to (simple) wildcard matching, it's hard to beat flex/lex.
…On 31/01/2021, Philippe Ombredanne ***@***.***> wrote:
> Looked at your README: "... approximate (multi-pattern string search)" had
> me curious. spamd uses acism of exact chunks, to support regexes. What's
> your "approximate" about?
@mischasan I am the culprit for "approximate" addition and this is
essentially because of the wildcard feature
https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I
reckon may be not entirely correct as a qualifier
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
@mischasan I do not have hard stats, but I use pyahocorasick with a dataset of about ~250MB for about 21K patterns of very different sizes: some a few bytes and some 10's of KB. This is in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data
... knowing that these are based on strings converted converted sequences of 16bit ints using a dictionary mapping known words to 16 bits ints, and this prior to being added to the Trie, so in my case this is likely more compact than if using the original bytes directly. 126MB is roughly the size of that automaton. |
@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on mischasan/aho-corasick#12 (comment) ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :) |
Just to clarify: the input pattern strings are UTF-16; or just the
internally-converted dictionary keys? And the dictionary mapping can
handle up to 65535 strings?
…On 31/01/2021, Philippe Ombredanne ***@***.***> wrote:
@mischasan I guess that a long while ago I was considering building a Python
wrapper for your C++ acism based on
mischasan/aho-corasick#12 (comment)
... but then I found Wojciech's pyahocorasick . Now it's all coming back to
me :)
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
And looking back at the acism version you talked about (licensing),
wow, five years just flew by. I've done just *way* too much proprietary work,
related to db engines.
…On 31/01/2021, Mischa Sandberg ***@***.***> wrote:
Just to clarify: the input pattern strings are UTF-16; or just the
internally-converted dictionary keys? And the dictionary mapping can
handle up to 65535 strings?
On 31/01/2021, Philippe Ombredanne ***@***.***> wrote:
> @mischasan I guess that a long while ago I was considering building a
> Python
> wrapper for your C++ acism based on
> mischasan/aho-corasick#12 (comment)
> ... but then I found Wojciech's pyahocorasick . Now it's all coming back
> to
> me :)
>
> --
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub:
> #137 (comment)
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.
|
@mischasan in my case the inputs are long open source license texts, medium notices and short mentions. They are broken into tokens (e.g. split in words where anything space or punctuation is a separator). Each word is looked up in a dictionary keyed by these words, and where the value is a 16 bits int. The sequence of all 16 bits values for a given input string (e.g. a license text) becomes the bytes fed to the automaton for indexing. At search time, the input text to search patterns for (e.g. some text where we want to find a license) is subjected to the same transformation in a sequence of 2 bytes tokens then used for an Aho Corasick search. That's a highly specialized use case of course.... the dictionary is basically mapping words known to be used in any license and relevant more or less "legally".... Think of these as the whole vocabulary of a language that I call "legalese". And yes in my case the dictionary has up to 65535 known words.... But guess what legalese vocabulary is much poorer than that ;) ... ATM is it about 20000 words long and growing fairly slowly when new licenses and notices are added. |
Aho Corasick algorithm based on Double Array Trie
https://github.com/hankcs/AhoCorasickDoubleArrayTrie
The text was updated successfully, but these errors were encountered: