Feature Request: Levenshtein distance with bounds #10345

zmbc · 2024-01-25T20:03:45Z

zmbc
Jan 25, 2024

Why do you want this feature?

String distances such as the Levenshtein distance are fairly computationally expensive. For many real-world applications, it isn't necessary to compute them exactly. We may care what the precise Levenshtein distances are for very similar strings but any string with distance > 5 is different enough that we don't care -- it's just completely different.

My understanding is that Levenshtein distance can be significantly faster (especially for long strings, and when the "don't care" cutoff is low) if you use the cutoff in the Levenshtein algorithm itself. See for example this StackOverflow answer.

My thinking is that the Levenshtein function could have an optional argument that specifies the cutoff, with the default remaining no cutoff.

Is this something the DuckDB team would be interested in? I could probably contribute a PR.

zmbc · 2024-04-10T20:22:27Z

zmbc
Apr 10, 2024
Author

Bumping this -- Is this something the DuckDB team would be interested in?

0 replies

soerenwolfers · 2024-04-12T14:45:42Z

soerenwolfers
Apr 12, 2024

In the meantime, can you just do CASE IF |len(a) - len(b)| > 5 THEN 'Infinity' ELSE levenshtein(a, b) END?

That's so trivial that it's questionable whether it's worth a new function. If, on the other hand, there was an approximate algorithm to compute Levenstein distances with a bounded relative error in a reduced computational complexity class, that would be a much more interesting thing to provide, in my opinion.

2 replies

zmbc Apr 12, 2024
Author

No, the optimization available is much more than the length check you suggest. For example "abcdefghijkl" and "xxxdfghijxkl" have identical lengths and Levenshtein distance 5, but you need only look at the first 3 characters to tell that the Levenshtein distance has to be at least 3.

I'm surprised not to be able to immediately find a formal write-up of what seems to me like a very intuitive optimization; perhaps the bounded Levenshtein is not very relevant outside of a few applications. The one I have in mind is record linkage, where it is extremely relevant.

Would it help for me to write a Python and/or pseudocode implementation to formalize this?

soerenwolfers Apr 12, 2024

Ah, sorry, the linked stack overflow answer didn't make that as clear. I think adding static element wise functions is probably the easiest thing you could change about duckdb. Why not make a PR following along the design of the original distance at

duckdb/src/core_functions/scalar/string/levenshtein.cpp

Line 15 in c5e173a

static idx_t LevenshteinDistance(const string_t &txt, const string_t &tgt) {

I'm sure if you have remaining questions (like how to wire you the extra integer argument) the maintainers would be happy to help get your PR over the line.

cmdlineluser · 2024-04-12T16:51:04Z

cmdlineluser
Apr 12, 2024

In Python I had been using RapidFuzz - which is a wrapper around its C++ implementation.

https://github.com/rapidfuzz/rapidfuzz-cpp/blob/main/rapidfuzz/distance/Levenshtein.hpp

I have little knowledge on the subject so I may be wrong, but I think this performs the optimization you're talking about?

Perhaps it could serve as inspiration (or maybe a duckdb rapidfuzz extension could be feasible?)

2 replies

zmbc Apr 12, 2024
Author

Yes, I had forgotten about rapidfuzz -- it does this optimization and a bunch more. It would be really interesting to use this in DuckDB!

zmbc Apr 12, 2024
Author

On a bit more poking, DuckDB already uses (an older version of) rapidfuzz for the Jaro and Jaro-Winkler distance functions, and these implementations already support a score cutoff with accompanying optimization. The only thing to do for those is allow the user to pass it. I'll start there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Levenshtein distance with bounds #10345

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Feature Request: Levenshtein distance with bounds #10345

zmbc Jan 25, 2024

Replies: 3 comments · 4 replies

zmbc Apr 10, 2024 Author

soerenwolfers Apr 12, 2024

zmbc Apr 12, 2024 Author

soerenwolfers Apr 12, 2024

cmdlineluser Apr 12, 2024

zmbc Apr 12, 2024 Author

zmbc Apr 12, 2024 Author

zmbc
Jan 25, 2024

Replies: 3 comments 4 replies

zmbc
Apr 10, 2024
Author

soerenwolfers
Apr 12, 2024

zmbc Apr 12, 2024
Author

cmdlineluser
Apr 12, 2024

zmbc Apr 12, 2024
Author

zmbc Apr 12, 2024
Author