# Benchmarks for String Similarity Scoring Functions

Install the most commonly used Python packages for string similarity scoring. This includes JellyFish for Levenshtein and Levenshten-Damerau distance, RapidFuzz for Levenshtein distance, and BioPython for Needleman-Wunsh scores among others.

In [1]:
!pip install rapidfuzz  # https://github.com/rapidfuzz/RapidFuzz
!pip install python-Levenshtein  # https://github.com/maxbachmann/python-Levenshtein
!pip install levenshtein # https://github.com/maxbachmann/Levenshtein
!pip install jellyfish # https://github.com/jamesturk/jellyfish/
!pip install editdistance # https://github.com/roy-ht/editdistance
!pip install distance # https://github.com/doukremt/distance
!pip install polyleven # https://github.com/fujimotos/polyleven
!pip install biopython # https://github.com/biopython/biopython
!pip install stringzilla # https://github.com/ashvardanian/stringzilla



## Levenshtein Distance Between Short English Words

We will be conducting benchmarks on a real-world dataset of English words. Let's download the dataset and load it into memory.

In [None]:
!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt

In [1]:
words = open("../leipzig1M.txt", "r").read().split()
words = tuple(words)
print(f"{len(words):,} words")

21,191,455 words


In [4]:
import random

def checksum_distances(tokens, distance_function, n: int = 1000000):
    distances_sum = 0
    while n:
        a = random.choice(tokens)
        b = random.choice(tokens)
        distances_sum += distance_function(a, b)
        n -= 1
    return distances_sum

In [5]:
import stringzilla as sz

In [6]:
%%timeit
checksum_distances(words, sz.edit_distance)

1.25 s ± 45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
checksum_distances(proteins, sz.edit_distance, 10_000)

792 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
from rapidfuzz.distance import Levenshtein as rf

In [9]:
%%timeit
checksum_distances(words, rf.distance)

1.25 s ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%timeit
checksum_distances(proteins, rf.distance, 10_000)

47.4 ms ± 434 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
import editdistance as ed

In [None]:
%%timeit
checksum_distances(words, ed.eval)

In [None]:
import jellyfish as jf

In [None]:
%%timeit
checksum_distances(words, jf.levenshtein_distance)

In [None]:
import Levenshtein as le

In [None]:
%%timeit
checksum_distances(words, le.distance)

## Needleman-Wunsch Alignment Scores Between Random Protein Sequences

For Needleman-Wunsh, let's generate some random protein sequences:

In [None]:
import random

In [None]:
proteins = [''.join(random.choice('ACGT') for _ in range(300)) for _ in range(1_000)]
print(f"{len(proteins):,} proteins")

1,000 proteins


In [None]:
from Bio import Align
from Bio.Align import substitution_matrices
aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1

In [None]:
%%timeit
checksum_distances(proteins, aligner.score, 10_000)