Feat levenshtein final #59016

kzh1458003655-web · 2025-12-13T14:56:27Z

What changes were made in this pull request?

Implement the Levenshtein distance function (levenshtein(string_a, string_b)) in the Vectorized Engine.

This function computes the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Why are these changes needed?

To provide users with a common string function used for fuzzy string matching, data cleaning, and calculating string similarity.

Detailed implementation details

UTF-8 Support: The implementation converts the input strings (which may contain multi-byte UTF-8 characters) into Unicode Code Points (UTF-32 integers) before performing the dynamic programming (DP) calculation. This ensures that the distance is calculated based on the number of characters, not the number of bytes.
Vectorized Integration:
- Implemented the logic within LevenshteinImpl in function_string.h.
- Used std::string_view as the input type for execute to align with modern Doris versions.
- Registered the function using the FunctionBinaryToType alias (FunctionStringLevenshtein) along with the LevenshteinWrapper adapter in function_string.cpp.
Tests: Added a comprehensive regression test case (test_string_function_levenshtein.groovy) covering standard, boundary (empty string), NULL, and critical UTF-8 (Chinese characters) scenarios.

Note for Reviewers

The function returns the distance as INT.
The implementation leverages std::vector<int32_t> for DP row storage and uses thread_local optimization for memory reuse.

Thearas · 2025-12-13T14:56:32Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

zclllyybb · 2025-12-13T14:58:31Z

...-test/suites/query_p0/sql_functions/string_functions/test_string_function_levenshtein.groovy

+    """
+
+    // 3. Column vs Column Test
+    qt_select_col_col "SELECT id, levenshtein(s1, s2) FROM ${tableName} ORDER BY id"


zclllyybb · 2025-12-13T14:59:30Z

...ssion-test/data/query_p0/sql_functions/string_functions/test_string_function_levenshtein.out

doesn't match with your testcase

zclllyybb · 2025-12-13T15:01:18Z

be/src/vec/functions/function_string.h

+struct LevenshteinImpl {
+    static constexpr auto name = "levenshtein";
+
+    // 必需的类型定义


dont use chinese

zclllyybb · 2025-12-13T15:02:55Z

be/src/vec/functions/function_string.h

+        static thread_local std::vector<int32_t> a_code_points;
+        static thread_local std::vector<int32_t> b_code_points;
+
+        // ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays


zclllyybb · 2025-12-13T15:03:35Z

be/src/vec/functions/function_string.h

+        static thread_local std::vector<int32_t> b_code_points;
+
+        // ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays
+        static thread_local std::vector<int32_t> prev_row; // Previous row distances (i-1)


why make it static thread_local? it's wrong.

为什么要这么做static thread_local？这是错误的。

im sorry.i want to reuse the vector , reducing repeated heap allocation

zclllyybb · 2025-12-13T15:04:30Z

be/src/vec/functions/function_string.h

+        static thread_local std::vector<int32_t> curr_row; // Current row distances (i)
+
+        // 注意：std::string_view 使用 .data() 和 .size() 方法
+        to_utf32(l.data(), l.size(), a_code_points);


see other funcitons how deal with utf8. like SubstringUtil.

zclllyybb · 2025-12-13T15:04:46Z

be/src/vec/functions/function_string.h

+        size_t n = source.size();
+        size_t m = target.size();
+
+        if (n == 0) {


add unlikely here

…ion test

zclllyybb

still need deeper understand and modifications

zclllyybb · 2025-12-16T14:56:34Z

be/src/vec/functions/function_string.cpp

    factory.register_function<FunctionMakeSet>();
    factory.register_function<FunctionExportSet>();
    factory.register_function<FunctionUnicodeNormalize>();
+    factory.register_function<FunctionStringLevenshtein>();


zclllyybb · 2025-12-16T14:56:59Z

be/src/vec/functions/function_string.h

    }
 };

+<<<<<<< HEAD


git conflict

zclllyybb · 2025-12-16T14:58:38Z

be/src/vec/functions/function_string.h

+        auto& res_data = col_res->get_data();
+
+        for (size_t i = 0; i < input_rows_count; ++i) {
+            auto l_view = col_left_str->get_data_at(i).to_string_view();


why need convert to string_view? seems keep StringRef is ok

zclllyybb · 2025-12-16T15:00:41Z

be/src/vec/functions/function_string.h

+        for (size_t i = 0; i < input_rows_count; ++i) {
+            auto l_view = col_left_str->get_data_at(i).to_string_view();
+            auto r_view = col_right_str->get_data_at(i).to_string_view();
+            LevenshteinImpl::execute(l_view, r_view, res_data[i]);


FunctionStringLevenshtein could totally be replaced by other existing base template. or, if you insist to write all by yourself, it's also ok. but then don't need to split to LevenshteinImpl. just make execute a member function.

zclllyybb · 2025-12-16T15:01:30Z

be/src/vec/functions/function_string.h

+        size_t m = target.size();
+
+        // DP arrays: prev_dist (previous row), curr_dist (current row)
+        std::vector<int32_t> prev_dist(m + 1);


use DorisVector to track memory alloc.

zclllyybb · 2025-12-16T15:08:20Z

be/src/vec/functions/function_string.h

+        }
+        if (UNLIKELY(r.empty())) {
+            std::vector<int32_t> tmp;
+            to_utf32(l.data(), l.size(), tmp);


directly convert to utf8 will lead to performance loss. check it firstly.

your implementation seems weird.
look SubReplaceImpl and learn how to process it.

zclllyybb · 2025-12-16T15:09:17Z

be/src/vec/functions/function_string.h

+        const auto& source = l_code_points;
+        const auto& target = r_code_points;
+
+        size_t n = source.size();


make their name more meaningful

zclllyybb · 2025-12-16T15:10:14Z

be/src/vec/functions/function_string.h

+        std::vector<int32_t> curr_dist(m + 1);
+
+        for (size_t j = 0; j <= m; ++j) {
+            prev_dist[j] = static_cast<int32_t>(j);


we make j size_t and always cast, it seems weird. you should think clear for why and how we do cast here.

kzh1458003655-web requested a review from zclllyybb as a code owner December 13, 2025 14:56

zclllyybb requested changes Dec 13, 2025

View reviewed changes

zclllyybb self-assigned this Dec 13, 2025

kzh1458003655-web force-pushed the feat-levenshtein-final branch from 2720540 to f5b07c9 Compare December 16, 2025 06:31

kzh1458003655-web added 6 commits December 16, 2025 14:37

[feature](function) Support levenshtein distance function

bde36e8

[feature](function) support levenshtein function

7f095f4

Fix test case: remove crashing line

a44d3e0

Fix final review issues and add license header

157c2d5

Format

cbe33a3

[Fix](function) support utf8 for levenshtein function and add regress…

7ad8317

…ion test

kzh1458003655-web force-pushed the feat-levenshtein-final branch from f5b07c9 to 3f4df34 Compare December 16, 2025 06:44

feat(vec): implement string function levenshtein with UTF-8 support

f9e8317

kzh1458003655-web force-pushed the feat-levenshtein-final branch from 3f4df34 to f9e8317 Compare December 16, 2025 06:59

zclllyybb requested changes Dec 16, 2025

View reviewed changes

Feat levenshtein final #59016

Are you sure you want to change the base?

Feat levenshtein final #59016

Conversation

kzh1458003655-web commented Dec 13, 2025

What changes were made in this pull request?

Why are these changes needed?

Detailed implementation details

Note for Reviewers

Uh oh!

Thearas commented Dec 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zclllyybb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants