# Text Similarity Using Levenshtein Distance

The Levenshtein Distance algorithm calculates the similarity between a source string $s$ and a target string $t$. The distance is the number of insertions, deletions, and substitutions necessary to transform the source string into the target string. The algorithm has many use cases, including spell checking, checking for plaigarism, and speech recognition. The greater the Levenshtein Distance (LD), the less similar the source string $s$ is to the target string $t$. For example, 

+ If $s$ is "apple" and $t$ is "apple," then $\text{LD}(s, t) = 0$.
+ If $s$ is "stop" and $t$ is "stopping," then $\text{LD}(s, t) = 1$.

Mathematically, the Levenshtein Distance between two strings is given by

$$
\text{LD}_{a, b}(i, j) = 
\begin{cases}
    \max{(i, j)} \qquad \text{if} \min{(i,j)} = 0\\
    \min{
        \begin{cases}
            \text{LD}_{a,b}(i - 1, j) + 1 \\
            \text{LD}_{a,b}(i, j - 1) + 1 \\
            \text{LD}_{a,b}(i - 1, j - 1) + 1_{a_{i} \neq b_{j}}  \qquad \text{otherwise}
        \end{cases}
        }
\end{cases}
$$

## Creating Our Functions 

We'll create two functions. A minimum() function and the actual Levenshtein Distance LD() function. First, let's create our minimum() function.

In [95]:
# The minimum function #
def minimum(a, b, c):
    mi = a
    
    if (b <= mi):
        mi = b
    if (c <= mi):
        mi = c
    
    return mi

# The LD function #
def LD(s, t):
    
    n = len(s) # length of string s
    m = len(t) # length of string t
    
    if n == 0:
        return n
    
    d = [[0 for x in range(n + 1)] for y in range(m + 1)] # matrix 
    
    for i in range(0, n):
        d[i][0] = n
    
    for j in range(0, m):
        d[0][j] = j
        
    for i in range(1, n):
        s_i = s[i - 1]
        
        for j in range(1, m):
            t_j = t[j - 1]
            
            if s_i == t_j:
                cost = 0
            else:
                cost = 1
                
            d[i][j] = minimum(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost)
    
    return d[i][j]

Now that we have created our $\text{LD}()$ function, let's compute similarity between strings.

In [100]:
str1 = "apple"
str2 = "apple"
str3 = "banana"
str4 = "cananas"

In [101]:
LD(str1, str2)

0

In [102]:
LD(str3, str4)

2

# Conclusion

This was a fairly simple example on the Levenshtein Distance algorithm for computing text similarity. The algorithm can be extended to more complicated cases, such as comparing two documents for plaigarism, creating a spell check program, or even DNA analysis. 