Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Levenshtein distance from Jellyfish library #1389

Merged
merged 1 commit into from Apr 6, 2015
Merged

Use Levenshtein distance from Jellyfish library #1389

merged 1 commit into from Apr 6, 2015

Conversation

tomjaspers
Copy link
Collaborator

I've created a small benchmark where I generated 10 misspellings (using a probabilistic approach) for all the artists / albums / songs in my library (about ~5600 items).

These are the results comparing 3 popular C packages, with our current one

python-Levenshtein
  elapsed time: 0.0654819011688
jellyfish
  elapsed time: 0.0748250484467
editdistance
  elapsed time: 0.19926404953
current one: handrolled (from Wikibooks) 
  elapsed time: 12.266146183

Despite being pretty informal benchmarks, they do show the two orders of magnitude difference.

I've decided to go with jellyfish over python-Levenshtein because:

  • python-Levenshtein is no longer actively mainted
  • python-Levenshtein crashes if two strings are different types (e.g., str and unicode)
  • the performance difference between the two is neglectable, esp. considering our current one
  • Jellyfish works on Windows (it falls back on a Python implementation if the C extensions can't be compiled, which is the same as our current one)

In #646 it was brought up to use the package optionally (and fallback on our own implementation if it wasn't installed). I don’t really see a reason for this, considering Jellyfish works everywhere and does this already.

Oh and the refactoring plans mention something about using difflib.ratio(), but it is in fact not a Levenshtein distance.

Its about two orders of magnitude faster than current 'handrolled' one

See #646
@sampsyo
Copy link
Member

sampsyo commented Mar 31, 2015

Fantastic! I'm all for it. And I didn't know that Jellyfish had a fallback to pure-Python built in—since that's the case, you're right we don't need our own fallback.

Since the next release is already huge and just around the corner, let's hold off for a couple of days and make this the first change in the next version.

@brunal brunal added this to the 1.3.12 milestone Apr 1, 2015
@sampsyo
Copy link
Member

sampsyo commented Apr 6, 2015

OK, version bumped! Merging this now. Thanks again!

sampsyo added a commit that referenced this pull request Apr 6, 2015
Use Levenshtein distance from Jellyfish library
@sampsyo sampsyo merged commit 020ed0c into beetbox:master Apr 6, 2015
tomjaspers added a commit that referenced this pull request Apr 6, 2015
@tomjaspers tomjaspers deleted the faster-levenshtein branch April 8, 2015 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants