Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExpTreeDict use re2 engines when processing heavy regexps #45631

Merged
merged 11 commits into from
Feb 7, 2023

Conversation

hanfei1991
Copy link
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Make RegExpTreeDictionary a ua parser which is compatible with https://github.com/ua-parser/uap-core

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-improvement Pull request with some product improvements label Jan 25, 2023
@hanfei1991
Copy link
Member Author

6000 lines of test files, any suggestion?

@hanfei1991
Copy link
Member Author

Here is the script https://gist.github.com/hanfei1991/6a6dae16f1f3fc5d778450da7f874e10 to convert https://github.com/ua-parser/uap-core/blob/master/regexes.yaml to yamls understood by RegExpTree

@hanfei1991
Copy link
Member Author

test is flaky, need to figure it out

@alexey-milovidov
Copy link
Member

6000 lines of test files, any suggestion?

This is fine 🔥

@rschu1ze
Copy link
Member

rschu1ze commented Feb 3, 2023

Taking a look rn.

@hanfei1991 Perhaps you want to relinquish your fork of GitHub and push directly to the upstream repository? It will make it easier for everyone to checkout PRs.

@hanfei1991
Copy link
Member Author

Taking a look rn.

@hanfei1991 Perhaps you want to relinquish your fork of GitHub and push directly to the upstream repository? It will make it easier for everyone to checkout PRs.

It will mess the branches of main repo. I don't like it ...
We can use gh pr checkout 45631 to easily checkout PR

@rschu1ze
Copy link
Member

rschu1ze commented Feb 3, 2023

I read https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/regexp-tree and to someone who is not familiar with regex dictionaries, the docs don't explain them good enough (IMHO). The docs needs at least an example.

@hanfei1991
Copy link
Member Author

I read https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/regexp-tree and to someone who is not familiar with regex dictionaries, the docs don't explain them good enough (IMHO). The docs needs at least an example.

Okay, I was too rush to write a good doc. I would add more examples and explaination (maybe next pr)

src/Dictionaries/RegExpTreeDictionary.cpp Show resolved Hide resolved
src/Dictionaries/RegExpTreeDictionary.cpp Outdated Show resolved Hide resolved
src/Dictionaries/RegExpTreeDictionary.cpp Outdated Show resolved Hide resolved
src/Dictionaries/RegExpTreeDictionary.cpp Outdated Show resolved Hide resolved
src/Dictionaries/RegExpTreeDictionary.cpp Outdated Show resolved Hide resolved
}

[[maybe_unused]]
bool check(const String & data) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cosmetic: suggest to rename to isSimpleRegex() (+ invert the result). This will make the call prettier, i.e. if (use_vectorscan && checker.isSimpleRegex(regex)) instead of if (use_vectorscan && !checker.check(regex))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic: bool check(const String & regexp) const

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the re2 vs. hyperscan distinction again. The thing is that hyperscan

  1. does not support all of re2's syntax (see here),
  2. has different matching semantics than re2 (see here) and
  3. has different performance characteristics than re2 (ideally faster, sometimes slower)

Class ComplexRegexChecker tries to avoid 3. by running a regex over a regex. But this is unfortunately only the tip of the iceberg. My fear is that will be regexes which match differently on re2 vs. hyperscan (case 2.) vs. don't even compile in hyperscan (case 1.). Building comprehensive check that covers all cases is infeasible and achieving good enough test coverage for hyperscan is also quite difficult.

So while the current approach is clever, I am scared that it will at one point in future break.

The regex matching functions in ClickHouse use either re2 OR hyperscan (which is documented in each case) but not both at a time and in particular there are no fallbacks implemented for above reasons.

My suggestion would be the following: RegexpTreeDictionary works by default with re2 syntax. If ClickHouse is compiled with "-DENABLE_HYPERSCAN" (which works btw. only on x86 and ARM) and if setting "regexp_dict_allow_hyperscan" is true, then the RegexpTreeDictionary will evaluate regexes exclusively with hyperscan. There will be no fallback to re2 and in particular no checking if the pattern is fast or slow in HyperScan (--> ComplexRegexChecker). It will be the user's responsibility to provide a Hyperscan-compatible pattern (which can also be evaluated quickly). At the same time, docs of setting "regexp_dict_allow_hyperscan" will note that this option shall be used only at the user's risk / that the setting is experimental.

This will then be in line with the overall way how ClickHouse treats hyperscan right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your concern makes sense. However, RegExpTree Dictionary is mainly for UA Parser, which does not likely process very special regex. The difference with hyperscan and re2 seems not easy to cause inconsistent problems.

If we don't add this fallback thing, almost all of the ua parsers are not able to work under hyperscan :( Let's do it like this util we find real problems

@hanfei1991 hanfei1991 merged commit 021e6e9 into ClickHouse:master Feb 7, 2023
@hanfei1991 hanfei1991 deleted the hanfei/regexp-refine branch February 7, 2023 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants