# Text Similarity Metrics

Exercise notebook

Course: Algorytmy Tekstowe at AGH University

## Preprocessing and vectorization

1. Preprocessing: Convert the text documents to lowercase and remove all punctuation marks (using regular expressions, for example).
2. Vocabulary creation: Create a vocabulary by taking all unique words from all text documents.
3. Word frequency vectors: Create two vectors, each representing the frequency of each word in the vocabulary in each text document.

In [21]:
import re
from sortedcontainers import SortedSet

def preprocess(text: str) -> str:
    text = text.lower()
    marks = ['.', ';', ':']
    for mark in marks:
        text = text.translate({ord(mark):None})
    # Your code here:
    # Convert the text to lowercase.
    # Remove all punctuation marks;
    return text

def text_to_vec(docs: list[str]) -> list[list[int]]:
    # Your code here:
    # Convert documents to numerical vectors.
    # Preprocess documents with the preprocess() function.
    # Represent documents as vectors of word frequencies, 
    # you will need to extract a vocabulary from all the documents.
    alphabet = SortedSet()
    sentences = []

    for doc in docs:
        sentences.append([])
        sentences[-1] = preprocess(doc).split()
        alphabet.update(sentences[-1])
    
    freq_vecs = []    
    n = len(alphabet)
    
    for sentence in sentences:
        freq_vecs.append([0] * n)
        for word in sentence:
            freq_vecs[-1][alphabet.index(word)] += 1

    return freq_vecs

In [22]:
# Tests
text_a = "The quick brown fox jumped over the lazy dog."
text_b = "The lazy dog was jumped over by the quick brown fox."
vec_a, vec_b = text_to_vec([text_a, text_b])


assert(set(vec_a) == set([1, 1, 1, 2, 1, 1, 1, 1, 0, 0]))
assert(set(vec_b) == set([1, 1, 1, 2, 1, 1, 1, 1, 1, 1]))

## Cosine similarity

$$
\begin{equation}
    \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}= \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}}
    \qquad\begin{aligned}
    &\text{where:} \\
    &\mathbf{A}\text{ and }\mathbf{B} \text{ are the two vectors being compared}\\
    &n \text{ is the dimensionality of the vectors}\\
    &\theta \text{ represents the angle between two vectors } \mathbf{A} \text{ and } \mathbf{B} \text{ in a high-dimensional space}
    \end{aligned}
\end{equation}
$$

The dot product of $\mathbf{A}$ and $\mathbf{B}$ is divided by the product of their Euclidean lengths to normalize the result to a range of [-1, 1]. A value of 1 indicates that the two vectors are identical, while a value of -1 indicates that they are completely dissimilar.


In [16]:
import math
import numpy as np

def cosine_similarity(text_a: str, text_b: str) -> float:
    # Your code here:
    # Implement the cosine similarity
    cosine_similarity = 0
    vec_a, vec_b = text_to_vec([text_a, text_b])
    vec_a, vec_b = np.array(vec_a), np.array(vec_b)
    
    cosine_similarity = vec_a.dot(vec_b)/(np.linalg.norm(vec_a)*np.linalg.norm(vec_b))
    return cosine_similarity

In [23]:
# Tests
dist = cosine_similarity(text_a, text_b)
assert(abs(dist - 0.91986) < 0.0001)

## Dice coefficient / Sørensen-Dice Index

$$
\begin{equation}
    \text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|} 
    \qquad\begin{aligned}
    &\text{where:} \\
    &A \text{ and } B \text{ represent the two sets being compared} \\
    &|A| \text{ and } |B| \text{ represent the cardinality (number of elements) of the sets} \\
    &\text{and } |A \cap B| \text{ represents the size of the intersection of the two sets}
    \end{aligned}
\end{equation}
$$


In [34]:
def dice(text_a: str, text_b: str) -> float:
    # Your code here:
    # Implement the Dice coefficient
    dice = 0
    set_A = set(preprocess(text_a).split())
    set_B = set(preprocess(text_b).split())
    dice = 2*len(set_A.intersection(set_B)) / (len(set_A) + len(set_B))
    return dice

dice(text_a, text_b)

0.8888888888888888

In [35]:
# Tests
dist = dice(text_a, text_b)
assert(abs(dist - 0.88888) < 0.0001)

## Euclidean distance

$$
\begin{equation}
    d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}
    \qquad\begin{aligned}
    &\text{where:} \\
    &d(x,y) \text{ is the Euclidean distance} \\
    &x_i, y_i \text{ are the values of the i-th dimension of vectors } x \text{ and } y \\
    &n \text{ is the number of dimensions in the vectors}
    \end{aligned}
\end{equation}
$$

In [36]:
def euclidean_distance(text_a: str, text_b: str) -> float:
    # Your code here:
    # Implement the Euclidean distance
    dist = 0
    vec_a, vec_b = text_to_vec([text_a, text_b])
    vec_a, vec_b = np.array(vec_a), np.array(vec_b)

    dist = np.linalg.norm(vec_a - vec_b)
    
    return dist

In [37]:
# Tests

dist = euclidean_distance(text_a, text_b)
assert(abs(dist - 1.4142135) < 0.0001)

## LCS - Longest Common Subsequence

Longest, common, continuous subsequence of two sequences, aka "the longest substring".

In [38]:
# miacierz najdłuższego wspólnego podciągu - tablica w której dynamicznie zapisujemy 
# dlóugość najdłuższego wspólnego podciągu między prefiksami słów "s" i "t"
def lcs_matrix(s, t, key=lambda x:x):
    m, n = len(s), len(t)
    # inicjalizacja macierzy
    d = [[0] * (n + 1) for i in range(m + 1)]
    # wypłnianie wartościami
    for j in range(1,n+1):
        for i in range(1,m+1):
            if key(s[i-1]) == key(t[j-1]):
                d[i][j] = d[i-1][j-1] + 1
            else:
                d[i][j] = max(d[i][j-1], d[i-1][j])
    return d

#długość najdłurzszego wspólnego podciągu O(nm)
def lcs_len(s, t, key=lambda x:x):
    m, n = len(s), len(t)
    d = lcs_matrix(s, t, key)
    return d[m][n]

In [41]:
from typing import Any, Sequence

def lcs(seq_a: Sequence[Any], seq_b: Sequence[Any]) -> int:
    # Your code here:
    # Implement the longest common subsequence calculation.
    # It should work on any sequences, not only on strings.
    lcs = 0
    lcs = lcs_len(seq_a, seq_b)
    return lcs

def word_lcs(text_a: str, text_b: str) -> int:
    # You code here:
    # Using the above function implement the LCS algorithm for texts.
    # Make sure it works on whole words, not on characters.
    seq_a = []
    seq_b = []

    seq_a = preprocess(text_a).split()
    seq_b = preprocess(text_b).split()
    return lcs(seq_a, seq_b)


In [42]:
# Tests
assert lcs("banana", "ananas") == 5
assert word_lcs(text_a, text_b) == 4

## Levenshtein distance

The minimal number of operations that needs to be performed in order to turn sequence A into sequence B.

Available operations:

* Replace element
* Remove element
* Add element

In [45]:
# miacierz odlgłości edycyjnej - tablica w której dynamicznie zapisujemy 
# odległość edycyjną między prefiksami słów "s" i "t"
def levenshtein_distance_matrix(s, t):
    m, n = len(s), len(t)
    # inicjalizacja macierzy
    d = [[0] * (n + 1) for i in range(m + 1)]
    for i in range(m + 1):
        d[i][0] = i
    for j in range(n + 1):
        d[0][j] = j
    # wypełnienie macierzy
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            if s[i - 1] == t[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1
    return d

# odległość edycyjna między "s" i "t"
def levenshtein_distance(s, t):
    m, n = len(s), len(t)
    return levenshtein_distance_matrix(s,t)[m][n]

In [46]:

def levenshtein(seq_a: Sequence[Any], seq_b: Sequence[Any]) -> int:
    # Your code here:
    # Implement the Levenshtein distance calculation.
    # It should work on any sequences, not only on strings.

    dist = levenshtein_distance(seq_a, seq_b)

    return dist


def word_levenshtein(text_a: str, text_b: str) -> int:
    # You code here:
    # Using the above function implement the LCS algorithm for texts.
    # Make sure it works on whole words, not on characters.
    seq_a = []
    seq_b = []
    
    seq_a = preprocess(text_a).split()
    seq_b = preprocess(text_b).split()
    
    return levenshtein(seq_a, seq_b)


In [47]:
# Tests
assert levenshtein("banana", "ananas") == 2
assert word_levenshtein(text_a, text_b) == 7