Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzy String matching logic to StringUtils #20

Closed
wants to merge 1 commit into from
Closed

Add fuzzy String matching logic to StringUtils #20

wants to merge 1 commit into from

Conversation

bripkens
Copy link

Fuzzy algorithms are quite common nowadays with editors such as Sublime Text, TextMate, Atom and others implementing a fuzzy matching algorithm. This PR adds a new string similarity algorithm to StringUtils which behaves very similar to the ones used in the aforementioned editors.

An issue that needs to be mentioned is that this algorithm is not documented in form of a paper to the best of my knowledge. For this reason different variations of this algorithm may exist.

@britter
Copy link
Member

britter commented Apr 28, 2014

Hello bripkens,

I've create https://issues.apache.org/jira/browse/LANG-999 to track this PR.

@britter
Copy link
Member

britter commented May 2, 2014

The PR looks nice so far. I'm not sure about the name. I've done a little research: http://en.wikipedia.org/wiki/Approximate_string_matching. The term fuzzy string matching refers to a class of string matching algorithms. So the Levenshtein distance which we have already implemented is also a fuzzy string matching algorithm. For this reason I don't think getFuzzyScore(String, String, Locale) is a good name. Maybe we can find a different name?

@bripkens
Copy link
Author

bripkens commented May 3, 2014

I see the problem that you are referring to. What about getSublimeTextFuzzyScore? This name might help to differentiate between the various algorithms.

@thiagoh
Copy link
Contributor

thiagoh commented May 3, 2014

i think that a good name would be getStringMatchingScoreor getApproximateStringMatchingScore. What you think about?

@bripkens
Copy link
Author

bripkens commented May 4, 2014

getStringMatchingScore sounds good. It also doesn't have a problem that a name such as getSublimeTextFuzzyScore would have, i.e. the implication that it is the one Sublime Text text matching algorithm.

@bripkens
Copy link
Author

bripkens commented May 4, 2014

PR updated. Method name changed to getStringMatchingScore.

@britter
Copy link
Member

britter commented May 4, 2014

Hi guys,

thanks for the suggestions. here are my comments:

  • getSublimeTextFuzzyScore: I don't think we want the name of a product in our code base :-)
  • getStringMatchingScore or getApproximateStringMatchingScore: Both are too generic. Levenshtein Distance is also a string matching score.

I'd like to have a name that better communicates what the algorithm acutally does. Beside that I think it would be better to call it getXXXDistance because the other to methods (Levenshtein and Jaro-Winkler) also end with "Distance". Just a few ideas:

  • getCommonSubsequenceDistance
  • getMatchingSubsequenceDistance

I'm not sure whether the algorithm proposed here is just about matching subsequences, but from what I've seen it looks like a good name.

WDYT?

@bripkens
Copy link
Author

bripkens commented May 4, 2014

Regarding 'distance': I always thought that distance implicated that a higher score means less similar. Since this is not actually the case, I second your argument Benedikt.

As for the term 'subsequence'. I do not believe that this should be part of the name as the algorithm's query parameter does not necessarily need to represent a subsequence of a term (take a look at the asf example in the JavaDoc). So what about getQueryDistance?

I feel that getQueryDistance actually captures the intent of the algorithm quite well. It measures the distance of a query to a given term (still quite generic though).

@thiagoh
Copy link
Contributor

thiagoh commented May 4, 2014

I think that the suffix Score matches better to the real intention of the method.. dont you think? Just as @bripkens said "Regarding 'distance': I always thought that distance implicated that a higher score means less similar." so i think Score would be a better choice..

What you think about?

  • getQueryMatchingScore
  • getMatchingSubsequenceScore

@britter
Copy link
Member

britter commented May 5, 2014

Hi guys,

I think we're getting closer :-) getQuery(Matching)Distance/Score doesn't feel perfect. those names simply don't communicate what the method does, which will lead to confusion.

I've done some more research in this field and I can not believe that the algorithm implemented here doesn't have a name (look for example at http://en.wikipedia.org/wiki/Category:String_matching_algorithms), so currently I'd like to read some more and may be come up with the algorithm name.

Beside that I've come to the conclusion that simply calling everything a "distance" is simply wrong. From http://en.wikipedia.org/wiki/Levenshtein_distance I've learned, that this algorithm is called a distance, because it measures the editing distance between to strings (in other words: how many editing steps do I have to make to get from one string to the other). So now I'd say that get<Whatever>Scorewould be a better name :-)

@thiagoh
Copy link
Contributor

thiagoh commented May 5, 2014

Yes, @britter., I do think we're getting closer. I've read some blog arcticles and this http://www.reddit.com/r/Python/comments/1gg4hk/sublime_text_ctrlp_like_fuzzy_matching_in_few/ called my attention. It suggests the algorithm is an instance of a http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

@britter
Copy link
Member

britter commented May 5, 2014

@thiagoh
Copy link
Contributor

thiagoh commented May 5, 2014

I've found a lot of links that call this kind of method of getFuzzyScore, but @britter is correct when says that sounds too generic.
despite of that I really think is a cool feature and would be really nice to have it inside commons-lang.

What you think about?

  • getLCSScore
  • getLongestCommonSubsequenceScore

@bripkens
Copy link
Author

bripkens commented May 6, 2014

Beside that I've come to the conclusion that simply calling everything a "distance" is simply wrong. From http://en.wikipedia.org/wiki/Levenshtein_distance I've learned, that this algorithm is called a distance, because it measures the editing distance between to strings

While true, the more formal term would be edit distance. Furthermore this is not true for the Jaro-Winkler Distance.

It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and mainly[...]

but

The higher the Jaro–Winkler distance for two strings is, the more similar the strings are[...]

Calling it distance therefore seems good to me if you specify distance as a similarity metric. Edit distance may then enforce that a higher score means less similar.

Regarding getLongestCommonSubsequenceScore: Is this algorithm really an implementation of the longest common subsequence problem? We should only call it that way if it matches the definition.

@britter
Copy link
Member

britter commented May 6, 2014

Hi,

I know I'm making a roll backwards, but after this discussion I'm thinking that getFuzzyDistancemay be the name. Currently the problem is, that we're having 190+ methods in StringUtils, so is may be a bit confusing to have some getXXXDistance methods here and there. But I want to create a StringMetrics class for 4.0 and than it's very less confusing.

If fuzzy distance is the term which is used most often for this kind of algorithm, then lets call it like that.

@bripkens one last update of the PR? ;)

@bripkens
Copy link
Author

bripkens commented May 6, 2014

PR updated. Method is now called getFuzzyDistance.

@asfgit asfgit closed this in 4c31706 May 7, 2014
@bripkens
Copy link
Author

bripkens commented May 8, 2014

Awesome, thank you :-)

asfgit pushed a commit that referenced this pull request Apr 27, 2015
…loses #20 from github. Thanks to Ben Ripkens.

git-svn-id: https://svn.apache.org/repos/asf/commons/proper/lang/trunk@1593112 13f79535-47bb-0310-9956-ffa450edef68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants