# Benchmarks for String Similarity Scoring Functions

Install the most commonly used Python packages for string similarity scoring. This includes JellyFish for Levenshtein and Levenshten-Damerau distance, RapidFuzz for Levenshtein distance, and BioPython for Needleman-Wunsh scores among others.

In [None]:
!pip install rapidfuzz  # https://github.com/rapidfuzz/RapidFuzz
!pip install python-Levenshtein  # https://github.com/maxbachmann/python-Levenshtein
!pip install levenshtein # https://github.com/maxbachmann/Levenshtein
!pip install jellyfish # https://github.com/jamesturk/jellyfish/
!pip install editdistance # https://github.com/roy-ht/editdistance
!pip install distance # https://github.com/doukremt/distance
!pip install polyleven # https://github.com/fujimotos/polyleven
!pip install biopython # https://github.com/biopython/biopython
!pip install stringzilla # https://github.com/ashvardanian/stringzilla

## Levenshtein Distance Between Short English Words

We will be conducting benchmarks on a real-world dataset of English words. Let's download the dataset and load it into memory.

In [None]:
!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt

In [1]:
words = open("../leipzig1M.txt", "r").read().split()
words = tuple(words)
print(f"{len(words):,} words")

21,191,455 words


In [2]:
import random

def checksum_distances(tokens, distance_function, n: int = 1000000):
    distances_sum = 0
    while n:
        a = random.choice(tokens)
        b = random.choice(tokens)
        distances_sum += distance_function(a, b)
        n -= 1
    return distances_sum

In [3]:
import random

In [4]:
proteins = [''.join(random.choice('ACGT') for _ in range(10_000)) for _ in range(1_000)]
print(f"{len(proteins):,} proteins")

1,000 proteins


In [5]:
import stringzilla as sz

In [None]:
%%timeit
checksum_distances(words, sz.edit_distance)

In [None]:
%%timeit
checksum_distances(proteins, sz.edit_distance, 100)

In [None]:
from rapidfuzz.distance import Levenshtein as rf

In [None]:
%%timeit
checksum_distances(words, rf.distance)

In [None]:
%%timeit
checksum_distances(proteins, rf.distance, 100)

In [None]:
import editdistance as ed

In [None]:
%%timeit
checksum_distances(words, ed.eval)

In [None]:
import jellyfish as jf

In [None]:
%%timeit
checksum_distances(words, jf.levenshtein_distance)

In [None]:
import Levenshtein as le

In [None]:
%%timeit
checksum_distances(words, le.distance)

## Needleman-Wunsch Alignment Scores Between Random Protein Sequences

For Needleman-Wunsh, let's generate some random protein sequences:

In [6]:
from Bio import Align
from Bio.Align import substitution_matrices
aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1

In [7]:
aligner.substitution_matrix

Array([[ 4., -1., -2., -2.,  0., -1., -1.,  0., -2., -1., -1., -1., -1.,
        -2., -1.,  1.,  0., -3., -2.,  0., -2., -1.,  0., -4.],
       [-1.,  5.,  0., -2., -3.,  1.,  0., -2.,  0., -3., -2.,  2., -1.,
        -3., -2., -1., -1., -3., -2., -3., -1.,  0., -1., -4.],
       [-2.,  0.,  6.,  1., -3.,  0.,  0.,  0.,  1., -3., -3.,  0., -2.,
        -3., -2.,  1.,  0., -4., -2., -3.,  3.,  0., -1., -4.],
       [-2., -2.,  1.,  6., -3.,  0.,  2., -1., -1., -3., -4., -1., -3.,
        -3., -1.,  0., -1., -4., -3., -3.,  4.,  1., -1., -4.],
       [ 0., -3., -3., -3.,  9., -3., -4., -3., -3., -1., -1., -3., -1.,
        -2., -3., -1., -1., -2., -2., -1., -3., -3., -2., -4.],
       [-1.,  1.,  0.,  0., -3.,  5.,  2., -2.,  0., -3., -2.,  1.,  0.,
        -3., -1.,  0., -1., -2., -1., -2.,  0.,  3., -1., -4.],
       [-1.,  0.,  0.,  2., -4.,  2.,  5., -2.,  0., -3., -3.,  1., -2.,
        -3., -1.,  0., -1., -3., -2., -2.,  1.,  4., -1., -4.],
       [ 0., -2.,  0., -1., -3., -2., -2.

Let's convert the BLOSUM matrix into a dense form with 256x256 elements. This will allow us to use the matrix with the Needleman-Wunsh algorithm implemented in StringZilla.

In [8]:
import numpy as np

subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
subs_reconstructed = np.zeros((256, 256), dtype=np.int8)

# Initialize all banned characters to a the largest possible penalty
subs_reconstructed.fill(127)
for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
    for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
        reconstructed_row = ord(packed_row_aminoacid)
        reconstructed_column = ord(packed_column_aminoacid)
        subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]

(subs_reconstructed < 127).sum()

576

In [9]:
proteins[1]

'TCGCGATTCGGGAGGTCGCAGGTAGTGCAGTATCTCAGACCCGTGTTTTGTGTAGAGCAATTATCGTAGGACGCAAGATACATGTGCGTCTCCCACGACCGTTCACGAACAATGATAGCTTTGTAAAGGCTCCTTGAGAAGTTTTTTGACTGCTCGACTGGTTCTAAACATGTCCCGGCCTATTGCCCCAAAACCTGTGTGGATACTCACCCACGTCACATAATTTCGCGAATTTTACTGTTAACGAAAGGTGCCAGAAGCGGGACTAGCTCTGCTAGCTGTAACGGCCTACACATTCATCTTGGGAACGTACCGCCTACCTGAACAACGCAGTGTTAAGAGTAAACCAACTCAATTGGATGATTTCTGCGCTTCCGCAACAAAGCGAGGTTCTAACGAACACTGAGATATATTCGCGACAATCCTTTTAGTTCAGGAACGCTGACGGCAGGTTGTTATGCGCACCATTGATTATGAGTTAGGTGCACTGGCACAAAGTCTCTGTCCCGCGTACACTCGCTCCCGGCTTCGCAAACCTGAGGTCATTACGTATAAAATCTACATGTGAGACTAGTTTCGCGCATATGATGAGGTAAGATATCTCTGTTTCGTGCTGCGGTGGGTTTAATCATAGTTCTTAATACCCCTCTGTTAATCACAAACCCTTATCTAGCGTGGGTGAGGCATTTTGATTCTTTTCTGGTTTAGACTAAGGTACGCGGTAGTAGAATGATAACGGGCCAATTATGACTGAGAAGCAAGAGTAGAACGCGTCGCCAAACGCGCTATGCGATTCTGCAGAGCCGGCGGTATTTGATTTAAAGGTACAGATGGGAGCATGCTATAGAGGTACTAACAATTAAGATCTGACGGACATACCTATATCAACGTGACTTGTACATATGTGTTTTTATGGAAATTTGCAAGCTGCGATGAGCCGGGCTGGAGACGCTAACCCATGACGGTTGCGATATATGGGCGTTTGAGTCTCGTGCGTGC

In [10]:
aligner.score(proteins[0], proteins[1])

47815.0

In [11]:
sz.alignment_score(proteins[0], proteins[1], substitution_matrix=subs_reconstructed, gap_score=1)

47815

In [12]:
%%timeit
def sz_score(a, b): return sz.alignment_score(a, b, substitution_matrix=subs_reconstructed, gap_score=1)
checksum_distances(proteins, sz_score, 100)

7.74 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
checksum_distances(proteins, aligner.score, 100)