Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add soundex function, issue #39880 #48567

Merged
merged 22 commits into from
Apr 13, 2023
Merged

Conversation

FriendLey
Copy link
Contributor

@FriendLey FriendLey commented Apr 9, 2023

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add soundex function. Closes #39880

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@CLAassistant
Copy link

CLAassistant commented Apr 9, 2023

CLA assistant check
All committers have signed the CLA.

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Apr 9, 2023
@ucasfl ucasfl added the can be tested Allows running workflows for external contributors label Apr 9, 2023
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
FriendLey and others added 6 commits April 11, 2023 09:49
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
Co-authored-by: flynn <fenglv15@mails.ucas.ac.cn>
src/Functions/soundex.cpp Outdated Show resolved Hide resolved
@ucasfl
Copy link
Collaborator

ucasfl commented Apr 11, 2023

LGTM. @alexey-milovidov could you take a look at @FriendLey first pull request?

@Avogar Avogar self-assigned this Apr 11, 2023
@Avogar

This comment was marked as outdated.

@ucasfl

This comment was marked as outdated.

src/Functions/soundex.cpp Outdated Show resolved Hide resolved

REGISTER_FUNCTION(Soundex)
{
factory.registerFunction<FunctionStringHashFixedString<SoundexImpl>>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from the base class for fixed string hashes is a bit weird, to be honest. A) Soundex isn't a hash function B) It is much more common to generate Strings (not: FixedString) from a function. Imho, it would be better to either copy-paste the few functions from FunctionStringHashFixedString to here and inherit directly from IFunction or look for better base class to inherit from (tip: check what function lower() is doing, src/Functions/lower.cpp).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Wikipedia, A hash function is any function that can be used to map data of arbitrary size to fixed-size values. Soundex indexing names to strings with fixed 4 lengths, it's like a type of hash function, So I directly reuse FunctionStringHashFixedString.

wiki: https://en.wikipedia.org/wiki/Hash_function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the definition is taken literally, soundex is indeed a hash function. Most people would nevertheless not consider soundex a hash function - it completely lacks the pseudo-randomness of the output which is inherent to hashing.

Producing a FixedString has a bigger problem though: FixedString is mostly used as space optimization in ClickHouse (consecutive strings need no delimiter between them). Only few functions are actually able to process them. This means that soundex in it's current form will not be composable with other functions. If we would produce data type String this problem would not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patient explanation. I approve, FixedString is really not a good output format.

src/Functions/soundex.cpp Outdated Show resolved Hide resolved
@rschu1ze
Copy link
Member

If you don't mind, I did some refactorings for consistency with the overall codebase and other functions.

@rschu1ze rschu1ze merged commit 658d6c8 into ClickHouse:master Apr 13, 2023
143 checks passed
@rschu1ze
Copy link
Member

I just added ClickHouse to the list of DBMS on Wikipedia which implement sounded 😉
--> https://en.wikipedia.org/wiki/Soundex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SOUNDEX function
6 participants