Skip to content

Conversation

@kzh1458003655-web
Copy link

What changes were made in this pull request?

Implement the Levenshtein distance function (levenshtein(string_a, string_b)) in the Vectorized Engine.

This function computes the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Why are these changes needed?

To provide users with a common string function used for fuzzy string matching, data cleaning, and calculating string similarity.

Detailed implementation details

  1. UTF-8 Support: The implementation converts the input strings (which may contain multi-byte UTF-8 characters) into Unicode Code Points (UTF-32 integers) before performing the dynamic programming (DP) calculation. This ensures that the distance is calculated based on the number of characters, not the number of bytes.
  2. Vectorized Integration:
    • Implemented the logic within LevenshteinImpl in function_string.h.
    • Used std::string_view as the input type for execute to align with modern Doris versions.
    • Registered the function using the FunctionBinaryToType alias (FunctionStringLevenshtein) along with the LevenshteinWrapper adapter in function_string.cpp.
  3. Tests: Added a comprehensive regression test case (test_string_function_levenshtein.groovy) covering standard, boundary (empty string), NULL, and critical UTF-8 (Chinese characters) scenarios.

Note for Reviewers

  • The function returns the distance as INT.
  • The implementation leverages std::vector<int32_t> for DP row storage and uses thread_local optimization for memory reuse.

@Thearas
Copy link
Contributor

Thearas commented Dec 13, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

"""

// 3. Column vs Column Test
qt_select_col_col "SELECT id, levenshtein(s1, s2) FROM ${tableName} ORDER BY id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't match with your testcase

struct LevenshteinImpl {
static constexpr auto name = "levenshtein";

// 必需的类型定义
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use chinese

static thread_local std::vector<int32_t> a_code_points;
static thread_local std::vector<int32_t> b_code_points;

// ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

static thread_local std::vector<int32_t> b_code_points;

// ADMIN REQUIREMENT: Renaming or adding clear comments for these two DP arrays
static thread_local std::vector<int32_t> prev_row; // Previous row distances (i-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make it static thread_local? it's wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要这么做static thread_local?这是错误的。

im sorry.i want to reuse the vector , reducing repeated heap allocation

static thread_local std::vector<int32_t> curr_row; // Current row distances (i)

// 注意:std::string_view 使用 .data() 和 .size() 方法
to_utf32(l.data(), l.size(), a_code_points);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other funcitons how deal with utf8. like SubstringUtil.

size_t n = source.size();
size_t m = target.size();

if (n == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add unlikely here

@zclllyybb zclllyybb self-assigned this Dec 13, 2025
Copy link
Contributor

@zclllyybb zclllyybb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need deeper understand and modifications

factory.register_function<FunctionMakeSet>();
factory.register_function<FunctionExportSet>();
factory.register_function<FunctionUnicodeNormalize>();
factory.register_function<FunctionStringLevenshtein>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated

}
};

<<<<<<< HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git conflict

auto& res_data = col_res->get_data();

for (size_t i = 0; i < input_rows_count; ++i) {
auto l_view = col_left_str->get_data_at(i).to_string_view();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need convert to string_view? seems keep StringRef is ok

for (size_t i = 0; i < input_rows_count; ++i) {
auto l_view = col_left_str->get_data_at(i).to_string_view();
auto r_view = col_right_str->get_data_at(i).to_string_view();
LevenshteinImpl::execute(l_view, r_view, res_data[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionStringLevenshtein could totally be replaced by other existing base template. or, if you insist to write all by yourself, it's also ok. but then don't need to split to LevenshteinImpl. just make execute a member function.

size_t m = target.size();

// DP arrays: prev_dist (previous row), curr_dist (current row)
std::vector<int32_t> prev_dist(m + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use DorisVector to track memory alloc.

}
if (UNLIKELY(r.empty())) {
std::vector<int32_t> tmp;
to_utf32(l.data(), l.size(), tmp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. directly convert to utf8 will lead to performance loss. check it firstly.
  2. your implementation seems weird.
    look SubReplaceImpl and learn how to process it.

const auto& source = l_code_points;
const auto& target = r_code_points;

size_t n = source.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make their name more meaningful

std::vector<int32_t> curr_dist(m + 1);

for (size_t j = 0; j <= m; ++j) {
prev_dist[j] = static_cast<int32_t>(j);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we make j size_t and always cast, it seems weird. you should think clear for why and how we do cast here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants