how about the new approach base on 'Double Array Trie' #137

SeekPoint · 2021-01-03T14:15:21Z

Aho Corasick algorithm based on Double Array Trie
https://github.com/hankcs/AhoCorasickDoubleArrayTrie

WojciechMula · 2021-01-03T18:03:03Z

Looks interesting, thanks for the pointer!

SeekPoint · 2021-01-04T02:42:54Z

here, the paper..
https://dl.acm.org/doi/10.1109/32.31365

waiting good news

Dobatymo · 2021-01-04T03:35:20Z

In addition to the performance enhancements, this might also be a good candidate for a 'shared memory data structure' discussed here #114

pombredanne · 2021-01-31T09:19:23Z

FWIW, @mischasan has had this other approach using interleaved arrays (for a long while) https://github.com/mischasan/aho-corasick https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/edit and is in C++.
@WojciechMula take in this project is fairly tight and compact... IMHO any fundamental change to the core would need super careful testing and benchmarking as this is doubtful that they would perform any better until proven ;)

WojciechMula · 2021-01-31T17:35:36Z

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

mischasan · 2021-01-31T18:55:05Z

Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source. An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off. What hurdles to that does managed memory (python) present? Or not relevant? Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about? The only work I planned for acism was restoring correct backlink-trimming. Anyway, all the best.

…

On 31/01/2021, Wojciech Muła ***@***.***> wrote: That @mischasan project seems great! I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #137 (comment)

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

mischasan · 2021-01-31T18:57:26Z

You seem to refer to a prior conversation; if it's not private, could you point me at that?

…

On 31/01/2021, Mischa Sandberg ***@***.***> wrote: Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source. An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off. What hurdles to that does managed memory (python) present? Or not relevant? Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about? The only work I planned for acism was restoring correct backlink-trimming. Anyway, all the best. On 31/01/2021, Wojciech Muła ***@***.***> wrote: > That @mischasan project seems great! > > I wish I could put more effort in this project, there are some obstacles > I'm > unable to overcome. Like, a day has 24 hours. :) I quite often think > about > improving the module, about trying different data structures, extending > API > etc. But this would require to allocate a lot of free time if it has to > be > done well. > > -- > You are receiving this because you were mentioned. > Reply to this email directly or view it on GitHub: > #137 (comment) -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

mischasan · 2021-01-31T19:08:36Z

Could you point me at the perf stats for pyaho? And at anything you thought was non-obvious smart about it? rspamd (not my project) uses million-string match sets; acism fits them in <10MB -- necessary for a router-level process. Genome pattern matching (U of Queensland; can't remember the name) used similar huge sets. Did you have test case of those sizes?

…

On 31/01/2021, Mischa Sandberg ***@***.***> wrote: Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source. An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off. What hurdles to that does managed memory (python) present? Or not relevant? Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about? The only work I planned for acism was restoring correct backlink-trimming. Anyway, all the best. On 31/01/2021, Wojciech Muła ***@***.***> wrote: > That @mischasan project seems great! > > I wish I could put more effort in this project, there are some obstacles > I'm > unable to overcome. Like, a day has 24 hours. :) I quite often think > about > improving the module, about trying different data structures, extending > API > etc. But this would require to allocate a lot of free time if it has to > be > done well. > > -- > You are receiving this because you were mentioned. > Reply to this email directly or view it on GitHub: > #137 (comment) -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

pombredanne · 2021-01-31T19:35:58Z

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

@mischasan I am the culprit for "approximate" word addition in the README (I did not write the feature) and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier

mischasan · 2021-01-31T19:55:18Z

Aucune problème. (Mieux en anglais; en dépit d'être canadien :) The backlink pruning originally misimplemented (naïvement) in acism would apply to any aho-corasick tree. I'll think about that. When it comes to (simple) wildcard matching, it's hard to beat flex/lex.

…

On 31/01/2021, Philippe Ombredanne ***@***.***> wrote: > Looked at your README: "... approximate (multi-pattern string search)" had > me curious. spamd uses acism of exact chunks, to support regexes. What's > your "approximate" about? @mischasan I am the culprit for "approximate" addition and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #137 (comment)

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

pombredanne · 2021-01-31T20:05:28Z

@mischasan I do not have hard stats, but I use pyahocorasick with a dataset of about ~250MB for about 21K patterns of very different sizes: some a few bytes and some 10's of KB. This is in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data
The main automaton has these stats:

$ python
Python 3.6.10 (default, Jun 13 2020, 08:53:46) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from licensedcode.cache import *
>>> get_index().rules_automaton.get_stats()
{'nodes_count': 3156589, 'words_count': 21370, 'longest_word': 14062, 'links_count': 3156588, 'sizeof_node': 32, 'total_size': 126263552}

... knowing that these are based on strings converted converted sequences of 16bit ints using a dictionary mapping known words to 16 bits ints, and this prior to being added to the Trie, so in my case this is likely more compact than if using the original bytes directly. 126MB is roughly the size of that automaton.
I did not run explicit benchmarks on that, but it is anecdotally orders of magnitude faster than anything else I had tried before and since and that has a Python API.

pombredanne · 2021-01-31T20:08:43Z

@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on mischasan/aho-corasick#12 (comment) ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :)

mischasan · 2021-01-31T21:10:29Z

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

…

On 31/01/2021, Philippe Ombredanne ***@***.***> wrote: @mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on mischasan/aho-corasick#12 (comment) ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :) -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #137 (comment)

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

mischasan · 2021-01-31T21:12:45Z

And looking back at the acism version you talked about (licensing), wow, five years just flew by. I've done just *way* too much proprietary work, related to db engines.

…

On 31/01/2021, Mischa Sandberg ***@***.***> wrote: Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings? On 31/01/2021, Philippe Ombredanne ***@***.***> wrote: > @mischasan I guess that a long while ago I was considering building a > Python > wrapper for your C++ acism based on > mischasan/aho-corasick#12 (comment) > ... but then I found Wojciech's pyahocorasick . Now it's all coming back > to > me :) > > -- > You are receiving this because you were mentioned. > Reply to this email directly or view it on GitHub: > #137 (comment) -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

pombredanne · 2021-01-31T22:44:33Z

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

@mischasan in my case the inputs are long open source license texts, medium notices and short mentions. They are broken into tokens (e.g. split in words where anything space or punctuation is a separator). Each word is looked up in a dictionary keyed by these words, and where the value is a 16 bits int. The sequence of all 16 bits values for a given input string (e.g. a license text) becomes the bytes fed to the automaton for indexing.

At search time, the input text to search patterns for (e.g. some text where we want to find a license) is subjected to the same transformation in a sequence of 2 bytes tokens then used for an Aho Corasick search.

That's a highly specialized use case of course.... the dictionary is basically mapping words known to be used in any license and relevant more or less "legally".... Think of these as the whole vocabulary of a language that I call "legalese".
Once the dictionary transformation is done, words like "license" or "copyright" may be represented by xAB or x78 and are all concatenated back in a bytes string. After that step we have something plain from an ahocorasick point of view except that this is a 2bytes "alphabet" that's not ASCII and not Unicode or UTF, just an encoded bytes string.

And yes in my case the dictionary has up to 65535 known words.... But guess what legalese vocabulary is much poorer than that ;) ... ATM is it about 20000 words long and growing fairly slowly when new licenses and notices are added.

pombredanne added the question label Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how about the new approach base on 'Double Array Trie' #137

how about the new approach base on 'Double Array Trie' #137

SeekPoint commented Jan 3, 2021

WojciechMula commented Jan 3, 2021

SeekPoint commented Jan 4, 2021

Dobatymo commented Jan 4, 2021

pombredanne commented Jan 31, 2021

WojciechMula commented Jan 31, 2021

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021 •

edited

Loading

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021 •

edited

Loading

pombredanne commented Jan 31, 2021

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021

how about the new approach base on 'Double Array Trie' #137

how about the new approach base on 'Double Array Trie' #137

Comments

SeekPoint commented Jan 3, 2021

WojciechMula commented Jan 3, 2021

SeekPoint commented Jan 4, 2021

Dobatymo commented Jan 4, 2021

pombredanne commented Jan 31, 2021

WojciechMula commented Jan 31, 2021

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021 • edited Loading

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021 • edited Loading

pombredanne commented Jan 31, 2021

mischasan commented Jan 31, 2021 via email

mischasan commented Jan 31, 2021 via email

pombredanne commented Jan 31, 2021

pombredanne commented Jan 31, 2021 •

edited

Loading

pombredanne commented Jan 31, 2021 •

edited

Loading