-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add soundex function, issue #39880 #48567
Conversation
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
LGTM. @alexey-milovidov could you take a look at @FriendLey first pull request? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
src/Functions/soundex.cpp
Outdated
|
||
REGISTER_FUNCTION(Soundex) | ||
{ | ||
factory.registerFunction<FunctionStringHashFixedString<SoundexImpl>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inheriting from the base class for fixed string hashes is a bit weird, to be honest. A) Soundex isn't a hash function B) It is much more common to generate String
s (not: FixedString
) from a function. Imho, it would be better to either copy-paste the few functions from FunctionStringHashFixedString
to here and inherit directly from IFunction
or look for better base class to inherit from (tip: check what function lower()
is doing, src/Functions/lower.cpp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Wikipedia, A hash function is any function that can be used to map data of arbitrary size to fixed-size values. Soundex indexing names to strings with fixed 4 lengths, it's like a type of hash function, So I directly reuse FunctionStringHashFixedString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the definition is taken literally, soundex is indeed a hash function. Most people would nevertheless not consider soundex a hash function - it completely lacks the pseudo-randomness of the output which is inherent to hashing.
Producing a FixedString
has a bigger problem though: FixedString is mostly used as space optimization in ClickHouse (consecutive strings need no delimiter between them). Only few functions are actually able to process them. This means that soundex in it's current form will not be composable with other functions. If we would produce data type String
this problem would not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patient explanation. I approve, FixedString is really not a good output format.
If you don't mind, I did some refactorings for consistency with the overall codebase and other functions. |
I just added ClickHouse to the list of DBMS on Wikipedia which implement sounded 😉 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add soundex function. Closes #39880
Documentation entry for user-facing changes