Add fuzzy String matching logic to StringUtils #20

bripkens · 2014-04-28T15:24:14Z

Fuzzy algorithms are quite common nowadays with editors such as Sublime Text, TextMate, Atom and others implementing a fuzzy matching algorithm. This PR adds a new string similarity algorithm to StringUtils which behaves very similar to the ones used in the aforementioned editors.

An issue that needs to be mentioned is that this algorithm is not documented in form of a paper to the best of my knowledge. For this reason different variations of this algorithm may exist.

britter · 2014-04-28T15:40:50Z

Hello bripkens,

I've create https://issues.apache.org/jira/browse/LANG-999 to track this PR.

britter · 2014-05-02T08:31:45Z

The PR looks nice so far. I'm not sure about the name. I've done a little research: http://en.wikipedia.org/wiki/Approximate_string_matching. The term fuzzy string matching refers to a class of string matching algorithms. So the Levenshtein distance which we have already implemented is also a fuzzy string matching algorithm. For this reason I don't think getFuzzyScore(String, String, Locale) is a good name. Maybe we can find a different name?

bripkens · 2014-05-03T12:50:59Z

I see the problem that you are referring to. What about getSublimeTextFuzzyScore? This name might help to differentiate between the various algorithms.

thiagoh · 2014-05-03T15:32:27Z

i think that a good name would be getStringMatchingScoreor getApproximateStringMatchingScore. What you think about?

bripkens · 2014-05-04T05:55:17Z

getStringMatchingScore sounds good. It also doesn't have a problem that a name such as getSublimeTextFuzzyScore would have, i.e. the implication that it is the one Sublime Text text matching algorithm.

bripkens · 2014-05-04T06:02:40Z

PR updated. Method name changed to getStringMatchingScore.

britter · 2014-05-04T08:44:10Z

Hi guys,

thanks for the suggestions. here are my comments:

getSublimeTextFuzzyScore: I don't think we want the name of a product in our code base :-)
getStringMatchingScore or getApproximateStringMatchingScore: Both are too generic. Levenshtein Distance is also a string matching score.

I'd like to have a name that better communicates what the algorithm acutally does. Beside that I think it would be better to call it getXXXDistance because the other to methods (Levenshtein and Jaro-Winkler) also end with "Distance". Just a few ideas:

getCommonSubsequenceDistance
getMatchingSubsequenceDistance

I'm not sure whether the algorithm proposed here is just about matching subsequences, but from what I've seen it looks like a good name.

WDYT?

bripkens · 2014-05-04T15:36:12Z

Regarding 'distance': I always thought that distance implicated that a higher score means less similar. Since this is not actually the case, I second your argument Benedikt.

As for the term 'subsequence'. I do not believe that this should be part of the name as the algorithm's query parameter does not necessarily need to represent a subsequence of a term (take a look at the asf example in the JavaDoc). So what about getQueryDistance?

I feel that getQueryDistance actually captures the intent of the algorithm quite well. It measures the distance of a query to a given term (still quite generic though).

thiagoh · 2014-05-04T18:47:13Z

I think that the suffix Score matches better to the real intention of the method.. dont you think? Just as @bripkens said "Regarding 'distance': I always thought that distance implicated that a higher score means less similar." so i think Score would be a better choice..

What you think about?

getQueryMatchingScore
getMatchingSubsequenceScore

britter · 2014-05-05T17:26:34Z

Hi guys,

I think we're getting closer :-) getQuery(Matching)Distance/Score doesn't feel perfect. those names simply don't communicate what the method does, which will lead to confusion.

I've done some more research in this field and I can not believe that the algorithm implemented here doesn't have a name (look for example at http://en.wikipedia.org/wiki/Category:String_matching_algorithms), so currently I'd like to read some more and may be come up with the algorithm name.

Beside that I've come to the conclusion that simply calling everything a "distance" is simply wrong. From http://en.wikipedia.org/wiki/Levenshtein_distance I've learned, that this algorithm is called a distance, because it measures the editing distance between to strings (in other words: how many editing steps do I have to make to get from one string to the other). So now I'd say that get<Whatever>Scorewould be a better name :-)

thiagoh · 2014-05-05T19:21:47Z

Yes, @britter., I do think we're getting closer. I've read some blog arcticles and this http://www.reddit.com/r/Python/comments/1gg4hk/sublime_text_ctrlp_like_fuzzy_matching_in_few/ called my attention. It suggests the algorithm is an instance of a http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

britter · 2014-05-05T19:45:23Z

Another good read may be http://www.sublimetext.com/forum/viewtopic.php?f=6&t=5661&p=29353&hilit=matching+algorithm#p29353

thiagoh · 2014-05-05T20:10:44Z

I've found a lot of links that call this kind of method of getFuzzyScore, but @britter is correct when says that sounds too generic.
despite of that I really think is a cool feature and would be really nice to have it inside commons-lang.

What you think about?

getLCSScore
getLongestCommonSubsequenceScore

bripkens · 2014-05-06T06:01:59Z

Beside that I've come to the conclusion that simply calling everything a "distance" is simply wrong. From http://en.wikipedia.org/wiki/Levenshtein_distance I've learned, that this algorithm is called a distance, because it measures the editing distance between to strings

While true, the more formal term would be edit distance. Furthermore this is not true for the Jaro-Winkler Distance.

It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and mainly[...]

but

The higher the Jaro–Winkler distance for two strings is, the more similar the strings are[...]

Calling it distance therefore seems good to me if you specify distance as a similarity metric. Edit distance may then enforce that a higher score means less similar.

Regarding getLongestCommonSubsequenceScore: Is this algorithm really an implementation of the longest common subsequence problem? We should only call it that way if it matches the definition.

britter · 2014-05-06T17:39:30Z

Hi,

I know I'm making a roll backwards, but after this discussion I'm thinking that getFuzzyDistancemay be the name. Currently the problem is, that we're having 190+ methods in StringUtils, so is may be a bit confusing to have some getXXXDistance methods here and there. But I want to create a StringMetrics class for 4.0 and than it's very less confusing.

If fuzzy distance is the term which is used most often for this kind of algorithm, then lets call it like that.

@bripkens one last update of the PR? ;)

bripkens · 2014-05-06T18:30:34Z

PR updated. Method is now called getFuzzyDistance.

bripkens · 2014-05-08T05:51:17Z

Awesome, thank you :-)

…loses #20 from github. Thanks to Ben Ripkens. git-svn-id: https://svn.apache.org/repos/asf/commons/proper/lang/trunk@1593112 13f79535-47bb-0310-9956-ffa450edef68

Add fuzzy String matching logic to StringUtils

43c53e9

asfgit closed this in 4c31706 May 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy String matching logic to StringUtils #20

Add fuzzy String matching logic to StringUtils #20

bripkens commented Apr 28, 2014

britter commented Apr 28, 2014

britter commented May 2, 2014

bripkens commented May 3, 2014

thiagoh commented May 3, 2014

bripkens commented May 4, 2014

bripkens commented May 4, 2014

britter commented May 4, 2014

bripkens commented May 4, 2014

thiagoh commented May 4, 2014

britter commented May 5, 2014

thiagoh commented May 5, 2014

britter commented May 5, 2014

thiagoh commented May 5, 2014

bripkens commented May 6, 2014

britter commented May 6, 2014

bripkens commented May 6, 2014

bripkens commented May 8, 2014

Add fuzzy String matching logic to StringUtils #20

Add fuzzy String matching logic to StringUtils #20

Conversation

bripkens commented Apr 28, 2014

britter commented Apr 28, 2014

britter commented May 2, 2014

bripkens commented May 3, 2014

thiagoh commented May 3, 2014

bripkens commented May 4, 2014

bripkens commented May 4, 2014

britter commented May 4, 2014

bripkens commented May 4, 2014

thiagoh commented May 4, 2014

britter commented May 5, 2014

thiagoh commented May 5, 2014

britter commented May 5, 2014

thiagoh commented May 5, 2014

bripkens commented May 6, 2014

britter commented May 6, 2014

bripkens commented May 6, 2014

bripkens commented May 8, 2014