New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fuzzy String matching logic to StringUtils #20
Conversation
Hello bripkens, I've create https://issues.apache.org/jira/browse/LANG-999 to track this PR. |
The PR looks nice so far. I'm not sure about the name. I've done a little research: http://en.wikipedia.org/wiki/Approximate_string_matching. The term fuzzy string matching refers to a class of string matching algorithms. So the Levenshtein distance which we have already implemented is also a fuzzy string matching algorithm. For this reason I don't think |
I see the problem that you are referring to. What about |
i think that a good name would be |
|
PR updated. Method name changed to |
Hi guys, thanks for the suggestions. here are my comments:
I'd like to have a name that better communicates what the algorithm acutally does. Beside that I think it would be better to call it
I'm not sure whether the algorithm proposed here is just about matching subsequences, but from what I've seen it looks like a good name. WDYT? |
Regarding 'distance': I always thought that distance implicated that a higher score means less similar. Since this is not actually the case, I second your argument Benedikt. As for the term 'subsequence'. I do not believe that this should be part of the name as the algorithm's query parameter does not necessarily need to represent a subsequence of a term (take a look at the asf example in the JavaDoc). So what about I feel that |
I think that the suffix What you think about?
|
Hi guys, I think we're getting closer :-) I've done some more research in this field and I can not believe that the algorithm implemented here doesn't have a name (look for example at http://en.wikipedia.org/wiki/Category:String_matching_algorithms), so currently I'd like to read some more and may be come up with the algorithm name. Beside that I've come to the conclusion that simply calling everything a "distance" is simply wrong. From http://en.wikipedia.org/wiki/Levenshtein_distance I've learned, that this algorithm is called a distance, because it measures the editing distance between to strings (in other words: how many editing steps do I have to make to get from one string to the other). So now I'd say that |
Yes, @britter., I do think we're getting closer. I've read some blog arcticles and this http://www.reddit.com/r/Python/comments/1gg4hk/sublime_text_ctrlp_like_fuzzy_matching_in_few/ called my attention. It suggests the algorithm is an instance of a http://en.wikipedia.org/wiki/Longest_common_subsequence_problem |
Another good read may be http://www.sublimetext.com/forum/viewtopic.php?f=6&t=5661&p=29353&hilit=matching+algorithm#p29353 |
I've found a lot of links that call this kind of method of What you think about?
|
While true, the more formal term would be edit distance. Furthermore this is not true for the Jaro-Winkler Distance.
but
Calling it distance therefore seems good to me if you specify distance as a similarity metric. Edit distance may then enforce that a higher score means less similar. Regarding |
Hi, I know I'm making a roll backwards, but after this discussion I'm thinking that If fuzzy distance is the term which is used most often for this kind of algorithm, then lets call it like that. @bripkens one last update of the PR? ;) |
PR updated. Method is now called |
Awesome, thank you :-) |
…loses #20 from github. Thanks to Ben Ripkens. git-svn-id: https://svn.apache.org/repos/asf/commons/proper/lang/trunk@1593112 13f79535-47bb-0310-9956-ffa450edef68
Fuzzy algorithms are quite common nowadays with editors such as Sublime Text, TextMate, Atom and others implementing a fuzzy matching algorithm. This PR adds a new string similarity algorithm to StringUtils which behaves very similar to the ones used in the aforementioned editors.
An issue that needs to be mentioned is that this algorithm is not documented in form of a paper to the best of my knowledge. For this reason different variations of this algorithm may exist.