Replies: 3 comments 4 replies
-
Bumping this -- Is this something the DuckDB team would be interested in? |
Beta Was this translation helpful? Give feedback.
-
In the meantime, can you just do That's so trivial that it's questionable whether it's worth a new function. If, on the other hand, there was an approximate algorithm to compute Levenstein distances with a bounded relative error in a reduced computational complexity class, that would be a much more interesting thing to provide, in my opinion. |
Beta Was this translation helpful? Give feedback.
-
In Python I had been using RapidFuzz - which is a wrapper around its C++ implementation. https://github.com/rapidfuzz/rapidfuzz-cpp/blob/main/rapidfuzz/distance/Levenshtein.hpp I have little knowledge on the subject so I may be wrong, but I think this performs the optimization you're talking about? Perhaps it could serve as inspiration (or maybe a duckdb rapidfuzz extension could be feasible?) |
Beta Was this translation helpful? Give feedback.
-
Why do you want this feature?
String distances such as the Levenshtein distance are fairly computationally expensive. For many real-world applications, it isn't necessary to compute them exactly. We may care what the precise Levenshtein distances are for very similar strings but any string with distance > 5 is different enough that we don't care -- it's just completely different.
My understanding is that Levenshtein distance can be significantly faster (especially for long strings, and when the "don't care" cutoff is low) if you use the cutoff in the Levenshtein algorithm itself. See for example this StackOverflow answer.
My thinking is that the Levenshtein function could have an optional argument that specifies the cutoff, with the default remaining no cutoff.
Is this something the DuckDB team would be interested in? I could probably contribute a PR.
Beta Was this translation helpful? Give feedback.
All reactions