Use Levenshtein distance from Jellyfish library #1389

tomjaspers · 2015-03-31T12:08:36Z

I've created a small benchmark where I generated 10 misspellings (using a probabilistic approach) for all the artists / albums / songs in my library (about ~5600 items).

These are the results comparing 3 popular C packages, with our current one

python-Levenshtein
  elapsed time: 0.0654819011688
jellyfish
  elapsed time: 0.0748250484467
editdistance
  elapsed time: 0.19926404953
current one: handrolled (from Wikibooks) 
  elapsed time: 12.266146183

Despite being pretty informal benchmarks, they do show the two orders of magnitude difference.

I've decided to go with jellyfish over python-Levenshtein because:

python-Levenshtein is no longer actively mainted
python-Levenshtein crashes if two strings are different types (e.g., str and unicode)
the performance difference between the two is neglectable, esp. considering our current one
Jellyfish works on Windows (it falls back on a Python implementation if the C extensions can't be compiled, which is the same as our current one)

In #646 it was brought up to use the package optionally (and fallback on our own implementation if it wasn't installed). I don’t really see a reason for this, considering Jellyfish works everywhere and does this already.

Oh and the refactoring plans mention something about using difflib.ratio(), but it is in fact not a Levenshtein distance.

Its about two orders of magnitude faster than current 'handrolled' one See #646

sampsyo · 2015-03-31T12:23:43Z

Fantastic! I'm all for it. And I didn't know that Jellyfish had a fallback to pure-Python built in—since that's the case, you're right we don't need our own fallback.

Since the next release is already huge and just around the corner, let's hold off for a couple of days and make this the first change in the next version.

sampsyo · 2015-04-06T04:40:45Z

OK, version bumped! Merging this now. Thanks again!

Use Levenshtein distance from Jellyfish library

Use Levenshtein distance from Jellyfish library

57196e6

Its about two orders of magnitude faster than current 'handrolled' one See #646

brunal added this to the 1.3.12 milestone Apr 1, 2015

sampsyo added a commit that referenced this pull request Apr 6, 2015

Merge pull request #1389 from tomjaspers/faster-levenshtein

020ed0c

Use Levenshtein distance from Jellyfish library

sampsyo merged commit 020ed0c into beetbox:master Apr 6, 2015

sampsyo mentioned this pull request Apr 6, 2015

Fetchart: minwidth and enforce_ratio options #1394

Merged

tomjaspers added a commit that referenced this pull request Apr 6, 2015

Changelog entries for #1389 and #1394

5afe2e1

tomjaspers deleted the faster-levenshtein branch April 8, 2015 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Levenshtein distance from Jellyfish library #1389

Use Levenshtein distance from Jellyfish library #1389

tomjaspers commented Mar 31, 2015

sampsyo commented Mar 31, 2015

sampsyo commented Apr 6, 2015

Use Levenshtein distance from Jellyfish library #1389

Use Levenshtein distance from Jellyfish library #1389

Conversation

tomjaspers commented Mar 31, 2015

sampsyo commented Mar 31, 2015

sampsyo commented Apr 6, 2015