<div style="text-align: right">
    <i>
        LING 5981/6080: Fundamentals of Python <br>
        Fall 2020 <br>
        Aniello De Santo
    </i>
</div>

# Notebook 13: measuring string distances

Here is the discussion of some algorithms that measure distances between two strings. The algorithms exemplified here are **Hamming distance** and **Levenstein distance**, however, many others exist and are widely used as well. Check [this link](https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/) for the other algorithms.


## Measuring distances

Similarity of strings is important for:
  * determining how natural languages are related;
  * seeing how close two gene sequences are;
  * detecting OCR errors and other spellcheckers...
  
  However, the similarity can be measured in many different ways.

## Hamming distance

**Hamming distance** is measured by the number of times two strings had **different** symbols on same indices. 

    String 1: linguist
    String 2: language
    Distance: .-...--- : 4
    
    String 1: physics
    String 2: psychic
    Distance: .-.---- : 5
    
    String 1: abab
    String 2: baba
    Distance: ---- : 4

**Practice.** Implement a function measuring Hamming distance between two strings. For now, assume that only strings of the same length can be compared.

In [None]:
def hamming_distance(str1, str2):
    """ Implements Hamming distance between str1 and str2.
    
    Arguments:
      - str1 (str): some string;
      - str2 (str): another string.
      
    Outputs:
      - int: value of the Hamming distance between str1 and str2.
    """
    pass

Test your implementation below.

In [None]:
test_words = [("linguist", "language"), ("physics", "psychic"), ("abab", "baba")]

for pair in test_words:
    print(pair, "===>", hamming_distance(pair[0], pair[1]))

We know that the items within `test_words` are always pairs. In such cases, during the iteration, Python allows to "unpack" the values of the items.

In [None]:
test_words = [("linguist", "language"), ("physics", "psychic")]

for v, u in test_words:
    print((v, u), "===>", hamming_distance(v, u))

#### `assert` statement

Let us enhance the solution using _assertions_. Like raising errors, they catch unsatisfactory states of objects or programs. **Assertions** are a way to show which assumptions are made in this code, i.e. to restrict the code in such a way that it works only if some assumptions hold.

    assert conditional_expression
    
If the conditional expression evaluates to false, AssertionError is raised.

In [None]:
def add_numbers(a, b):
    assert type(a) is int and type(b) is int
    print(a + b)
    
add_numbers(5, 3)

An error message can be added to the assertion error as well:

    assert conditional_expression, message

In [None]:
def add_numbers(a, b):
    assert type(a) is int and type(b) is int, "I only work with numbers!"
    print(a + b)
    
add_numbers(5, 3)

**Practice.** Now, in the implementation of the Hamming distance function, add the assertion checking that both strings are of the same length.

**Question&Practice.** What are the ways to extend Hamming distance to strings of non-equal length? Let's implement one of the possible extensions!

## Edit distance (Levenshtein distance)

**Edit distance** between strings $A$ and $B$ measures the number of _edit operations_ that one needs to perform with a string $A$ to get the string $B$.

_Edit operations:_
  * **replace** ("aaa" -> "aba");
  * **add** ("aaa" -> "aaab");
  * **remove** ("aba" -> "aa").
  
For example, the edit distance between "anna" and "nana" is $2$.

1. remove the initial "a" from "anna" $\rightarrow$ "nna";
2. insert "a" after the initial "n" in "nna" $\rightarrow$ "nana".

Alternatively,

1. replace the initial "a" by "n", "anna" $\rightarrow$ "nnna";
2. replace the second position by "a", "nnna" $\rightarrow$ "nana".


Notice, that the two input strings do not need to be of the same length anymore.

### Algorithm finding smallest possible edit distance between two strings

    Given:   two strings
    Example: str1 = "aba" and str2 = "bab"
    
    1. Construct the matrix M of the following shape:
    
            ""    a    b    a
        ""   .    .    .    .
        b    .    .    .    .
        a    .    .    .    .
        b    .    .    .    .
        
        Consider only the cells denoted as . as a part of the matrix.
        Then, the size of the matrix is len(str1) + 1 by len(str2) + 1.
        
        M[m][n] stands for the value of the cell in the m-th row and
        n-th column.
        
    2. Every cell will stand for the edit distance from the beginning
       of the strings A and B, and up to that cell.
       
       We can pre-fill some values already:
       
            ""    a    b    a
        ""   0    1    2    3
        b    1    .    .    .
        a    2    .    .    .
        b    3    .    .    .
        
        Indeed, a distance from "" to "" is 0, from "" to "a" is 1, from
        "" to "ab" is 2, and so on.
        
        M[0][0] = 0;
        M[0][3] = 3;
        M[2][0] = 2; etc.
        
     3. For every cell in every remaining rows, look at the value of the
        row and the column that corresponds to it.
        Are these two values same?
        
        3.1 Same -> copy the value from the cell B when X is the 
            current cell.
            
             B    ...
            ...    X
            
            M[n][m] = M[n-1][m-1]
            
        3.2 Different -> put into the current cell X the minimum value of
            the cells A, B, C and add 1 to X.
            
            A     B
            C     X
            
            M[n][m] = min(M[n-1][m-1], M[n-1][m], M[n][m-1]) + 1
            
      4. Finally, the lowest right cell contains the minimal edit distance
         between the string str1 and str2.

**Example.** For example, consider the following table calculating the edit distance between the strings "aba" and "bab".

            ""    a    b    a
        ""   0    1    2    3
        b    1    1    1    2
        a    2    1    2    1
        b    3    2    1    2
        
The value of the lowest bottom cell is $2$, and therefore the edit distance between "aba" and "bab" is $2$.

**Practice 1.** Extend the table above in order to calculate the edit distance between "abab" and "baba". Notice that the result is different from the one predicted by the Hamming distance!

**Practice 2.** Calculate the edit distance between strings "linguist" and "language".

### Implementation of edit distance.

Let's implement the algorithm finding the smallest edit distance step-by-step.

In [None]:
def edit_distance(str1, str2):
    """
    This function implements the algorithm calculating
    the minimal edit distance between two strings.
    
    Arguments:
      -- str1 (str): some string;
      -- str2 (str): another string.
      
    Outputs:
      -- int: the smallest edit distance in-between
              str1 and str2.
    """
    pass

**Part 1**. Let us start implementing the algorithm by initializing a matrix (a list of lists) $n$ by $m$, where $n = \textrm{len(str1)} + 1$ and $m = \textrm{len(str2)} + 1$. As a default value of cells, let us choose None so that we see which cells are not initialized yet.

**Part 2.** Pre-fill the first row and the first column of that table.

**Part 3.** Now, iterate over every non-initialized cell of the matrix and fill it according to the rules discussed above:
  * `M[n][m] = M[n-1][m-1]` if the current positions of the two strings match;
  * `M[n][m] = min(M[n-1][m-1], M[n-1][m], M[n][m-1]) + 1` otherwise.

Now, let's test the algorithm.

In [None]:
test_words = [("linguist", "language"), ("physics", "psychic"), ("abab", "baba")]

for v, u in test_words:
    print((v, u), "===>", edit_distance(v, u))

**Dynamic programming** (DP) is a method of solving a problem by breaking it in smaller sub-problems, solving smaller problems once, saving the results, and then re-using those results. 
The matrix-based approach to calculating the edit distance is a traditional example of DP.

### Normalizing the values

However, Levenshtein distance gives not always intuitive predictions.

In [None]:
test_words = [("Eugenio", "Evgenij"), ("of", "in")]

for v, u in test_words:
    print((v, u), "===>", edit_distance(v, u))

**Normalization** is an adjustment of values so that they follow a common scale.

As of the edit distance algorithm, it makes sense to abstract from the lengths of the two strings in order to avoid the problem highlighted above.
The "traditional" way to perform this normalization is to divide the raw edit distance score by the sum of lengths of the two strings.

    normalized_edit_distance(str1, str2) = edit_distance(str1, str2) / (len(str1) + len(str2))

**Practice.** Implement normalized edit distance.