Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiFuzzyMatchAny function requirement #38046

Closed
sophie-jeong opened this issue Jun 14, 2022 · 3 comments · Fixed by #38485
Closed

multiFuzzyMatchAny function requirement #38046

sophie-jeong opened this issue Jun 14, 2022 · 3 comments · Fixed by #38485
Assignees
Labels
feature st-discussion The story requires discussion /research / expert help / design & decomposition before will be taken

Comments

@sophie-jeong
Copy link

sophie-jeong commented Jun 14, 2022

In the multiFuzzyMatchAny function, Please make it possible to use column values instead of constants for the third variable
I really need this feature so much!
I need to extract names that match ( like operator ) similarly to the data in the array value category.names , and I plan to do this regularly.

Use case

SELECT
    names,
    multiFuzzyMatchAny(names, 1, a.`category.names`)
FROM
(
    SELECT category.names AS names
    FROM mall
    ARRAY JOIN category
) AS A
CROSS JOIN a
LIMIT 10
DB::Exception: Argument at index 2 for function multiFuzzyMatchAny must be constant. (ILLEGAL_COLUMN)

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@alexey-milovidov
Copy link
Member

The reason why it requires a constant argument is that compiling a regular expression (especially for fuzzy matching) can be heavy. Doing it for every record is unreasonable. The only viable case is when you have only a few different regexes.

@alexey-milovidov alexey-milovidov added the st-discussion The story requires discussion /research / expert help / design & decomposition before will be taken label Jun 15, 2022
@sophie-jeong
Copy link
Author

I'm expecting it to be heavy.
But isn't Clickhouse an analytics-specialized DBMS?
if so, I think a function that matches similar text is absolutely necessary.
How about providing the function as an on/off function by setting a set parameter?

@rschu1ze
Copy link
Member

Similar functionality (non-const pattern/regex arguments) was added very recently for functions (I)LIKE() and MATCH() with #37251. Haven't checked in detail yet but I have at least a rough idea what needs to be changed for multiFuzzyMatchAny().

multiFuzzyMatchAny() leverages Intel's hyperscan library (at least on x86). Alexey proposed to replace this lib by vectorscan which is a community fork with SIMD support on other platforms. Having vectorscan is not a strict requirement for this ticket, but it would simplify things later on (+ make string matching functions faster on ARM). Guess we should do this first.

@rschu1ze rschu1ze self-assigned this Jun 16, 2022
rschu1ze added a commit that referenced this issue Jun 17, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

* Addresses (but does not resolve) #38046

* This change renames config parameters, I guess this makes it
  backward-incompatible?

* A note about where the new dependency is referenced from (.gitmodules):
  The ClickHouse GitHub organization already hosts a fork of hyperscan.
  Unfortunately, it's not possible to host within the same GitHub
  organization a fork of the parent (hyperscan) and a fork of the
  parent's fork (vectorscan). As a workaround, ClickHouse's fork of
  hyperscan now contains a new branch which was reset to vectorscan's
  master branch.

TODO:
- make fork of vectorscan in CH org and use that fork
- search for "hyperscan" in code, remove all references
- throw hyperscan out (unlink submodule)

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 17, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Names of hyperscan-related config parameters (e.g.
    "max_hyperscan_regexp_length") are preserved for compatibility.
    Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names
    (e.g. "HyperscanDeleter") are preserved as vectorscan aims to be a
    drop-in replacement.

(*) The ClickHouse GitHub organization already hosts a fork of
    hyperscan. Unfortunately, it's not possible to host within the same
    GitHub organization a fork of the parent (hyperscan) and a fork of
    the parent's fork (vectorscan). As a workaround, ClickHouse's fork
    of hyperscan now contains a new branch which was reset to
    vectorscan's master branch.
rschu1ze added a commit that referenced this issue Jun 20, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

(*) The ClickHouse GitHub organization already hosts a fork of
    hyperscan. Unfortunately, it's not possible to host within the same
    GitHub organization a fork of the parent (hyperscan) and a fork of
    the parent's fork (vectorscan). As a workaround, ClickHouse's fork
    of hyperscan now contains a new branch which was reset to
    vectorscan's master branch.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 20, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

(*) The ClickHouse GitHub organization already hosts a fork of
    hyperscan. Unfortunately, it's not possible to host within the same
    GitHub organization a fork of the parent (hyperscan) and a fork of
    the parent's fork (vectorscan). As a workaround, ClickHouse's fork
    of hyperscan now contains a new branch which was reset to
    vectorscan's master branch.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 20, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

(*) The ClickHouse GitHub organization already hosts a fork of
    hyperscan. Unfortunately, it's not possible to host within the same
    GitHub organization a fork of the parent (hyperscan) and a fork of
    the parent's fork (vectorscan). As a workaround, ClickHouse's fork
    of hyperscan now contains a new branch which was reset to
    vectorscan's master branch.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 21, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 22, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 24, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
rschu1ze added a commit that referenced this issue Jun 24, 2022
This commit migrates ClickHouse to Vectorscan. The first 10 min of
[0] explain the reasons for it.

(*) Addresses (but does not resolve) #38046

(*) Config parameter names (e.g. "max_hyperscan_regexp_length") are
    preserved for compatibility. Likewise, error codes (e.g.
    "ErrorCodes::HYPERSCAN_CANNOT_SCAN_TEXT") and function/class names (e.g.
    "HyperscanDeleter") are preserved as vectorscan aims to be a drop-in
    replacement.

[0] https://www.youtube.com/watch?v=KlZWmmflW6M
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature st-discussion The story requires discussion /research / expert help / design & decomposition before will be taken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants