<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#performance-benchmark" data-toc-modified-id="performance-benchmark-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>performance benchmark</a></span><ul class="toc-item"><li><span><a href="#load-packages" data-toc-modified-id="load-packages-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>load packages</a></span></li><li><span><a href="#long-string-test" data-toc-modified-id="long-string-test-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>long string test</a></span></li><li><span><a href="#short-string-test" data-toc-modified-id="short-string-test-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>short string test</a></span></li></ul></li><li><span><a href="#main_function" data-toc-modified-id="main_function-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>main_function</a></span></li></ul></div>

cleaned code for string_similarity_algorithm

## performance benchmark
- damerau-levenshtein distance 
    - [x] stringdist (restricted Damerau-Levenshtein distance)
    - fastDamerauLevenshtein
    - pyxDamerauLevenshtein (restricted edit distance and no custom weights)
    - jellyfish (true Damerau-Levenshtein but no custom weights)
    - editdistance (restricted edit distance and no custom weights)
    - textdistance (true Damerau-Levenshtein but no custom weights)
    - pylev
    - abydos

- restricted Damerau-Levenshtein distance vs true Damerau-Levenshtein distance
    -  it assumes no characters were added or deleted between the transposed characters.
    - it's simple and it can be optimized for memory usage like Wagner-Fischer. The only difference is you need to keep three rows in memory at any given time, instead of two.
    - not a true distance metric because it does not satisfy the triangle inequality. This makes it a poor choice for applications that involve evaluating the similarity of more than two strings, such as clustering.

- additional reading: https://www.lemoda.net/text-fuzzy/damerau-levenshtein/

In [1]:
#! pip install StringDist
#! pip install fastDamerauLevenshtein
#! pip install pyxDamerauLevenshtein
#! pip install jellyfish
#! pip install editdistance # only Levenshtein distance
#! pip install "textdistance[extras]"
#! pip install pylev
#! pip install abydos

Looking in indexes: https://artifactory.healthcareit.net/artifactory/api/pypi/ai-python-release/simple
Collecting fastDamerauLevenshtein
  Downloading https://artifactory.healthcareit.net/artifactory/api/pypi/ai-python-release/packages/packages/42/37/1d3f799161bdc4aebea549f3d782f55107e1d9988c60ed85a30618782d0c/fastDamerauLevenshtein-1.0.7.tar.gz (36 kB)
Building wheels for collected packages: fastDamerauLevenshtein
  Building wheel for fastDamerauLevenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for fastDamerauLevenshtein: filename=fastDamerauLevenshtein-1.0.7-cp39-cp39-macosx_10_9_x86_64.whl size=14564 sha256=455dd6b320eaa7aa19e7144ba9719e6543a21f244b17fac630ebed63413f7ecb
  Stored in directory: /Users/erzhang/Library/Caches/pip/wheels/e7/57/67/2ea87c7e25cd66ed092a4ce8b9568584e755bcf3d2477210d9
Successfully built fastDamerauLevenshtein
Installing collected packages: fastDamerauLevenshtein
Successfully installed fastDamerauLevenshtein-1.0.7
Looking in indexes: https://artifac

In [None]:
from stringdist import rdlevenshtein_norm as d1
#stringdist.rdlevenshtein_norm('test', 'testing')

from fastDamerauLevenshtein import damerauLevenshtein as d2
# damerauLevenshtein('car', 'cars', similarity=True)  expected result: 0.75

from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance as d3
# normalized_damerau_levenshtein_distance('smtih', 'smith')  # expected result: 0.2
# 0.0 means that the sequences are identical

from jellyfish import damerau_levenshtein_distance as d4
#jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs') 
#expected result: 1

# editdistance only computes Levenshtein distance

from textdistance import DamerauLevenshtein
d5 = DamerauLevenshtein()
# d5.normalized_similarity('smtih', 'smith') 
# expected result: 0.8

from pylev import damerau_levenshtein as d6
# d6('smtih', 'smith')
# expected result: 1

from abydos.distance import DamerauLevenshtein
d7 = DamerauLevenshtein()

### load packages

In [45]:
str1, str2 = 'smtih', 'smith'

In [49]:
from stringdist import rdlevenshtein_norm as d1
1-d1(str1, str2)

0.8

In [50]:
from fastDamerauLevenshtein import damerauLevenshtein as d2
d2(str1, str2, similarity=True)

0.8

In [52]:
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance as d3
1-d3(str1, str2)

0.7999999970197678

In [56]:
from jellyfish import damerau_levenshtein_distance as d4
1-d4(str1, str2)/max(len(str1), len(str2))

0.8

In [47]:
from textdistance import DamerauLevenshtein
d5 = DamerauLevenshtein()
# d5.normalized_similarity('smtih', 'smith') 
# expected result: 0.8
d5.normalized_similarity(str1, str2)

0.8

In [44]:
from pylev import damerau_levenshtein as d6

1-d6(str1, str2)/max(len(str1), len(str2))

0.8

In [37]:
from abydos.distance import DamerauLevenshtein
d7 = DamerauLevenshtein()
1 - d7.dist(str1, str2)

0.8

### long string test

In [67]:
str1, str2 = '1 QUAL LANE', '1 QUAIL LN'
print(str1)
print(str2)

1 QUAL LANE
1 QUAIL LN


In [59]:
print(1-d1(str1, str2))
print(d2(str1, str2, similarity=True))
print(1-d3(str1, str2))
print(1-d4(str1, str2)/max(len(str1), len(str2)))
print(d5.normalized_similarity(str1, str2))
print(1-d6(str1, str2)/max(len(str1), len(str2)))
print(1 - d7.dist(str1, str2))

0.7272727272727273
0.7272727272727273
0.7272727191448212
0.7272727272727273
0.7272727272727273
0.7272727272727273
0.7272727272727273


1 ns  = 0.001 µs  (micro)

In [60]:
%%timeit
1-d1(str1, str2)

564 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [61]:
%%timeit
d2(str1, str2, similarity=True)

2.27 µs ± 59.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [62]:
%%timeit
1-d3(str1, str2)

2.69 µs ± 52.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [63]:
%%timeit
1-d4(str1, str2)/max(len(str1), len(str2))

1.14 µs ± 15.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [64]:
%%timeit
d5.normalized_similarity(str1, str2)

5.76 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [65]:
%%timeit
1-d6(str1, str2)/max(len(str1), len(str2))

55.2 µs ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [66]:
%%timeit
1 - d7.dist(str1, str2)

187 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### short string test

In [68]:
str1, str2 = 'smtih', 'smith'

In [69]:
print(1-d1(str1, str2))
print(d2(str1, str2, similarity=True))
print(1-d3(str1, str2))
print(1-d4(str1, str2)/max(len(str1), len(str2)))
print(d5.normalized_similarity(str1, str2))
print(1-d6(str1, str2)/max(len(str1), len(str2)))
print(1 - d7.dist(str1, str2))

0.8
0.8
0.7999999970197678
0.8
0.8
0.8
0.8


In [70]:
%%timeit
1-d1(str1, str2)

243 ns ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [71]:
%%timeit
d2(str1, str2, similarity=True)

1.28 µs ± 43 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [72]:
%%timeit
1-d3(str1, str2)

1.15 µs ± 46.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [73]:
%%timeit
1-d4(str1, str2)/max(len(str1), len(str2))

891 ns ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [74]:
%%timeit
d5.normalized_similarity(str1, str2)

5.74 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [75]:
%%timeit
1-d6(str1, str2)/max(len(str1), len(str2))

11.3 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [76]:
%%timeit
1 - d7.dist(str1, str2)

33.9 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
