# Introduction
What is the best string similarity algorithm? Well, it’s quite hard to answer this question, at least without knowing anything else, like what you require it for. And even after having a basic idea, it’s quite hard to pinpoint to a good algorithm without first trying them out on different datasets. It’s a trial and error process.

# Types of algorithms

Based on the properties of operations, string similarity algorithms can be classified into a couple of domains. 

- ***Edit distance based:*** Algorithms falling under this category try to compute the number of operations needed to transforms one string to another. More the number of operations, less is the similarity between the two strings. One point to note, in this case, every index character of the string is given equal importance.

- ***Token-based:*** In this category, the expected input is a set of tokens, rather than complete strings. The idea is to find the similar tokens in both sets. More the number of common tokens, more is the similarity between the sets. A string can be transformed into sets by splitting using a delimiter. This way, we can transform a sentence into tokens of words or n-grams characters. Note, here tokens of different length have equal importance.

- ***Sequence-based:*** Here, the similarity is a factor of common sub-strings between the two strings. The algorithms, try to find the longest sequence which is present in both strings, the more of these sequences found, higher is the similarity score. Note, here combination of characters of same length have equal importance.

In [1]:
import textdistance

In [2]:
textdistance.hamming('test', 'text')
# 1

1

In [3]:
textdistance.hamming.distance('test', 'text')
# 1

1

In [4]:
textdistance.hamming.similarity('test', 'text')
# 3

3

In [5]:
textdistance.hamming.normalized_distance('test', 'text')
# 0.25

0.25

In [6]:
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

0.75

In [7]:
textdistance.Hamming(qval=2).distance('test', 'text')
# 2

2

In [8]:
textdistance.levenshtein("this test", "that test") # 2

2

In [9]:
textdistance.levenshtein("test this", "this test") # 6

6

In [10]:
textdistance.jaro_winkler("this test", "test this") # .666666666...

0.6666666666666666

In [11]:
textdistance.jaccard("this test", "that test")

0.6363636363636364

In [12]:
textdistance.jaccard("this test", "test this")

1.0

In [13]:
textdistance.cosine("apple", "ppale") # 1.0

1.0

In [14]:
textdistance.cosine("this test", "that test")

0.7777777777777778

In [15]:
textdistance.needleman_wunsch("AAAGGT", "ATACGGA")

3.0

In [16]:
# adjust the gap cost
textdistance.needleman_wunsch.gap_cost = 3

In [17]:
textdistance.needleman_wunsch("AAAGGT", "ATACGGA")

1.0

In [18]:
textdistance.mra("tie", "tye") # 1

1

In [19]:
textdistance.jaro_winkler.external = False

In [20]:
textdistance.jaro_winkler("second test", "2nd test")

0.7418831168831169