-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegExpTreeDict use re2 engines when processing heavy regexps #45631
RegExpTreeDict use re2 engines when processing heavy regexps #45631
Conversation
6000 lines of test files, any suggestion? |
Here is the script https://gist.github.com/hanfei1991/6a6dae16f1f3fc5d778450da7f874e10 to convert https://github.com/ua-parser/uap-core/blob/master/regexes.yaml to yamls understood by RegExpTree |
test is flaky, need to figure it out |
This is fine 🔥 |
Taking a look rn. @hanfei1991 Perhaps you want to relinquish your fork of GitHub and push directly to the upstream repository? It will make it easier for everyone to checkout PRs. |
It will mess the branches of main repo. I don't like it ... |
I read https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/regexp-tree and to someone who is not familiar with regex dictionaries, the docs don't explain them good enough (IMHO). The docs needs at least an example. |
Okay, I was too rush to write a good doc. I would add more examples and explaination (maybe next pr) |
} | ||
|
||
[[maybe_unused]] | ||
bool check(const String & data) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cosmetic: suggest to rename to isSimpleRegex()
(+ invert the result). This will make the call prettier, i.e. if (use_vectorscan && checker.isSimpleRegex(regex))
instead of if (use_vectorscan && !checker.check(regex))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetic: bool check(const String & regexp) const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the re2 vs. hyperscan distinction again. The thing is that hyperscan
- does not support all of re2's syntax (see here),
- has different matching semantics than re2 (see here) and
- has different performance characteristics than re2 (ideally faster, sometimes slower)
Class ComplexRegexChecker
tries to avoid 3. by running a regex over a regex. But this is unfortunately only the tip of the iceberg. My fear is that will be regexes which match differently on re2 vs. hyperscan (case 2.) vs. don't even compile in hyperscan (case 1.). Building comprehensive check that covers all cases is infeasible and achieving good enough test coverage for hyperscan is also quite difficult.
So while the current approach is clever, I am scared that it will at one point in future break.
The regex matching functions in ClickHouse use either re2 OR hyperscan (which is documented in each case) but not both at a time and in particular there are no fallbacks implemented for above reasons.
My suggestion would be the following: RegexpTreeDictionary works by default with re2 syntax. If ClickHouse is compiled with "-DENABLE_HYPERSCAN" (which works btw. only on x86 and ARM) and if setting "regexp_dict_allow_hyperscan" is true, then the RegexpTreeDictionary will evaluate regexes exclusively with hyperscan. There will be no fallback to re2 and in particular no checking if the pattern is fast or slow in HyperScan (--> ComplexRegexChecker). It will be the user's responsibility to provide a Hyperscan-compatible pattern (which can also be evaluated quickly). At the same time, docs of setting "regexp_dict_allow_hyperscan" will note that this option shall be used only at the user's risk / that the setting is experimental.
This will then be in line with the overall way how ClickHouse treats hyperscan right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your concern makes sense. However, RegExpTree Dictionary is mainly for UA Parser, which does not likely process very special regex. The difference with hyperscan and re2 seems not easy to cause inconsistent problems.
If we don't add this fallback thing, almost all of the ua parsers are not able to work under hyperscan :( Let's do it like this util we find real problems
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Make RegExpTreeDictionary a ua parser which is compatible with https://github.com/ua-parser/uap-core
Documentation entry for user-facing changes