-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Feat levenshtein final #59016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feat levenshtein final #59016
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
| """ | ||
|
|
||
| // 3. Column vs Column Test | ||
| qt_select_col_col "SELECT id, levenshtein(s1, s2) FROM ${tableName} ORDER BY id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't match with your testcase
| struct LevenshteinImpl { | ||
| static constexpr auto name = "levenshtein"; | ||
|
|
||
| // 必需的类型定义 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use chinese
| static thread_local std::vector<int32_t> a_code_points; | ||
| static thread_local std::vector<int32_t> b_code_points; | ||
|
|
||
| // ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
| static thread_local std::vector<int32_t> b_code_points; | ||
|
|
||
| // ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays | ||
| static thread_local std::vector<int32_t> prev_row; // Previous row distances (i-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make it static thread_local? it's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要这么做
static thread_local?这是错误的。
im sorry.i want to reuse the vector , reducing repeated heap allocation
| static thread_local std::vector<int32_t> curr_row; // Current row distances (i) | ||
|
|
||
| // 注意:std::string_view 使用 .data() 和 .size() 方法 | ||
| to_utf32(l.data(), l.size(), a_code_points); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other funcitons how deal with utf8. like SubstringUtil.
| size_t n = source.size(); | ||
| size_t m = target.size(); | ||
|
|
||
| if (n == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add unlikely here
2720540 to
f5b07c9
Compare
f5b07c9 to
3f4df34
Compare
3f4df34 to
f9e8317
Compare
zclllyybb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need deeper understand and modifications
| factory.register_function<FunctionMakeSet>(); | ||
| factory.register_function<FunctionExportSet>(); | ||
| factory.register_function<FunctionUnicodeNormalize>(); | ||
| factory.register_function<FunctionStringLevenshtein>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated
| } | ||
| }; | ||
|
|
||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git conflict
| auto& res_data = col_res->get_data(); | ||
|
|
||
| for (size_t i = 0; i < input_rows_count; ++i) { | ||
| auto l_view = col_left_str->get_data_at(i).to_string_view(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need convert to string_view? seems keep StringRef is ok
| for (size_t i = 0; i < input_rows_count; ++i) { | ||
| auto l_view = col_left_str->get_data_at(i).to_string_view(); | ||
| auto r_view = col_right_str->get_data_at(i).to_string_view(); | ||
| LevenshteinImpl::execute(l_view, r_view, res_data[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FunctionStringLevenshtein could totally be replaced by other existing base template. or, if you insist to write all by yourself, it's also ok. but then don't need to split to LevenshteinImpl. just make execute a member function.
| size_t m = target.size(); | ||
|
|
||
| // DP arrays: prev_dist (previous row), curr_dist (current row) | ||
| std::vector<int32_t> prev_dist(m + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use DorisVector to track memory alloc.
| } | ||
| if (UNLIKELY(r.empty())) { | ||
| std::vector<int32_t> tmp; | ||
| to_utf32(l.data(), l.size(), tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- directly convert to utf8 will lead to performance loss. check it firstly.
- your implementation seems weird.
lookSubReplaceImpland learn how to process it.
| const auto& source = l_code_points; | ||
| const auto& target = r_code_points; | ||
|
|
||
| size_t n = source.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make their name more meaningful
| std::vector<int32_t> curr_dist(m + 1); | ||
|
|
||
| for (size_t j = 0; j <= m; ++j) { | ||
| prev_dist[j] = static_cast<int32_t>(j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we make j size_t and always cast, it seems weird. you should think clear for why and how we do cast here.
What changes were made in this pull request?
Implement the Levenshtein distance function (
levenshtein(string_a, string_b)) in the Vectorized Engine.This function computes the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
Why are these changes needed?
To provide users with a common string function used for fuzzy string matching, data cleaning, and calculating string similarity.
Detailed implementation details
LevenshteinImplinfunction_string.h.std::string_viewas the input type forexecuteto align with modern Doris versions.FunctionBinaryToTypealias (FunctionStringLevenshtein) along with theLevenshteinWrapperadapter infunction_string.cpp.test_string_function_levenshtein.groovy) covering standard, boundary (empty string), NULL, and critical UTF-8 (Chinese characters) scenarios.Note for Reviewers
INT.std::vector<int32_t>for DP row storage and usesthread_localoptimization for memory reuse.