# Text Similarity Metrics

Exercise notebook

Course: Algorytmy Tekstowe at AGH University

# 1. Zaimplementuj przynajmniej 3 "metryki" spośród wymienionych: cosinusowa, LCS, DICE, euklidesowa, Levenshteina.

## Preprocessing and vectorization

1. Preprocessing: Convert the text documents to lowercase and remove all punctuation marks (using regular expressions, for example).
2. Vocabulary creation: Create a vocabulary by taking all unique words from all text documents.
3. Word frequency vectors: Create two vectors, each representing the frequency of each word in the vocabulary in each text document.

In [1]:
import re
from collections import Counter

def preprocess(text: str) -> str:
    # Convert the text to lowercase
    text = text.lower()
    
    # Remove all punctuation marks
    text = re.sub(r'[^\w\s]', '', text)
    
    return text

def text_to_vec(docs: list[str]) -> list[list[int]]:
    # Create vocabulary
    vocab = set()
    for doc in docs:
        doc = preprocess(doc)
        words = doc.split()
        vocab.update(words)
    
    # Create word frequency vectors
    freq_vecs = []
    for doc in docs:
        doc = preprocess(doc)
        words = doc.split()
        word_counts = Counter(words)
        freq_vec = [word_counts[word] for word in vocab]
        freq_vecs.append(freq_vec)
    
    return freq_vecs

In [2]:
# Tests
text_a = "The quick brown fox jumped over the lazy dog."
text_b = "The lazy dog was jumped over by the quick brown fox."
vec_a, vec_b = text_to_vec([text_a, text_b])


assert(set(vec_a) == set([1, 1, 1, 2, 1, 1, 1, 1, 0, 0]))
assert(set(vec_b) == set([1, 1, 1, 2, 1, 1, 1, 1, 1, 1]))

## Cosine similarity

$$
\begin{equation}
    \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}= \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}}
    \qquad\begin{aligned}
    &\text{where:} \\
    &\mathbf{A}\text{ and }\mathbf{B} \text{ are the two vectors being compared}\\
    &n \text{ is the dimensionality of the vectors}\\
    &\theta \text{ represents the angle between two vectors } \mathbf{A} \text{ and } \mathbf{B} \text{ in a high-dimensional space}
    \end{aligned}
\end{equation}
$$

The dot product of $\mathbf{A}$ and $\mathbf{B}$ is divided by the product of their Euclidean lengths to normalize the result to a range of [-1, 1]. A value of 1 indicates that the two vectors are identical, while a value of -1 indicates that they are completely dissimilar.


In [3]:
import math

def cosine_similarity(text_a: str, text_b: str) -> float:
    freq_vecs = text_to_vec([text_a, text_b])
    dot_product = sum(a * b for a, b in zip(freq_vecs[0], freq_vecs[1]))
    
    norm_a = math.sqrt(sum(a ** 2 for a in freq_vecs[0]))
    norm_b = math.sqrt(sum(b ** 2 for b in freq_vecs[1]))
    
    return dot_product / (norm_a * norm_b)

In [4]:
# Tests
dist = cosine_similarity(text_a, text_b)
assert(abs(dist - 0.91986) < 0.0001)

## Dice coefficient / Sørensen-Dice Index

$$
\begin{equation}
    \text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|} 
    \qquad\begin{aligned}
    &\text{where:} \\
    &A \text{ and } B \text{ represent the two sets being compared} \\
    &|A| \text{ and } |B| \text{ represent the cardinality (number of elements) of the sets} \\
    &\text{and } |A \cap B| \text{ represents the size of the intersection of the two sets}
    \end{aligned}
\end{equation}
$$


In [5]:
def dice(text_a: str, text_b: str) -> float:
    text_a = preprocess(text_a)
    text_b = preprocess(text_b)
    
    set_a = set(text_a.split())
    set_b = set(text_b.split())
    
    intersection_ = len(set_a.intersection(set_b))
    union_ = len(set_a) + len(set_b)
    
    return 2 * intersection_ / union_

dice(text_a, text_b)

0.8888888888888888

In [6]:
# Tests
dist = dice(text_a, text_b)
assert(abs(dist - 0.88888) < 0.0001)

## Euclidean distance

$$
\begin{equation}
    d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}
    \qquad\begin{aligned}
    &\text{where:} \\
    &d(x,y) \text{ is the Euclidean distance} \\
    &x_i, y_i \text{ are the values of the i-th dimension of vectors } x \text{ and } y \\
    &n \text{ is the number of dimensions in the vectors}
    \end{aligned}
\end{equation}
$$

In [7]:
def euclidean_distance(text_a: str, text_b: str) -> float:
    x, y = text_to_vec([text_a, text_b])

    dist = 0
    for i in range(len(x)):
        dist += (x[i] - y[i]) ** 2

    return math.sqrt(dist)

In [8]:
# Tests

dist = euclidean_distance(text_a, text_b)
assert(abs(dist - 1.4142135) < 0.0001)

## LCS - Longest Common Subsequence

Longest, common, continuous subsequence of two sequences, aka "the longest substring".

In [9]:
from typing import Any, Sequence

def lcs(seq_a: Sequence[Any], seq_b: Sequence[Any]) -> int:
    n = len(seq_a)
    m = len(seq_b)
    dp = [[0] * (m+1) for _ in range(n+1)]
    for i in range(n):
        for j in range(m):
            if seq_a[i] == seq_b[j]:
                dp[i+1][j+1] = dp[i][j] + 1
            else:
                dp[i+1][j+1] = max(dp[i+1][j], dp[i][j+1])
    return dp[-1][-1]

def word_lcs(text_a: str, text_b: str) -> int:
    # Split the texts into words
    seq_a = text_a.split()
    seq_b = text_b.split()

    return lcs(seq_a, seq_b)


In [10]:
# Tests
assert lcs("banana", "ananas") == 5
assert word_lcs(text_a, text_b) == 4

# 2. Zaimplementuj przynajmniej 1 sposoby oceny jakości klasteryzacji (np. indeks Daviesa-Bouldina).

# Davies-Bouldin

#### Centroid klastra #### 

średnia pozycja wszystkich punktów należących do klastra.
Dla klastra o n punktach i d wymiarach, gdzie $ \mathbf{x}_i $ oznacza i-ty punkt w klastrze:

$  c = \frac{1}{n}\sum_{i=1}^{n} \mathbf{x}_i $

#### Odległość między centroidami ####

odległość euklidesowa pomiędzy centroidami dwóch różnych klastrów.
Dla klastrów $ C_i $ i $ C_j $ :

$  \Delta_{ij} =|| C_{i} - C_{j}||_{2} $

#### Odległość wewnętrzna ####

średnia odległość punktów w klastrze od jego centroidu. Dla klastra $ C_i $ :

$ s_{i} = \frac{1}{n_{i}}\sum_{x \epsilon C_{i}}^{} ||x - c_{i}||_{2} $

#### Współczynnik Daviesa-Bouldina ####

$ R_{i} = \frac{s_{i} + s_{j}}{ \Delta_{ij}} $


gdzie $ j $ jest klastrem różny od $ i $ i dla którego wartość $ \frac{s_i+s_j}{\Delta_{ij}} $ jest maksymalna.
Ostateczna wartość współczynnika Daviesa-Bouldina to średnia wartość $ R_i $ dla wszystkich klastrów.



In [11]:
import numpy as np

def get_cluster_matrices(data_points, labels):
    k = max(labels) + 1
    cluster_matrices = [[] for _ in range(k)]

    for i, label in enumerate(labels):
        cluster_matrices[label].append(data_points[i])

    return [np.vstack(cluster_matrices[i]) for i in range(k)]

def get_centroids(cluster_matrices):
    return [np.mean(matrix, axis=0) for matrix in cluster_matrices]


def get_avg_distances(cluster_matrices, centroids):
    avg_distances = []
    for matrix, centroid in zip(cluster_matrices, centroids):
        distance = np.linalg.norm(matrix - centroid, axis=1)
        avg_distance = np.mean(distance)
        avg_distances.append(avg_distance)
    return avg_distances

def davies_bouldin(data_points, labels):
    cluster_matrices = get_cluster_matrices(data_points, labels)
    centroids = get_centroids(cluster_matrices)
    avg_distance = get_avg_distances(cluster_matrices, centroids)

    k = len(centroids)
    R = np.zeros((k, k))
    
    for i in range(k):
        for j in range(i + 1, k):
            dist = avg_distance[i] + avg_distance[j]
            delta = np.linalg.norm(centroids[i] - centroids[j])
            R[i, j] = dist / delta
            R[j, i] = R[i, j]

    return np.mean(np.max(R, axis=1))

# 3. Stwórz stoplistę najczęściej występujących słów i zastosuj ją jako pre-processing dla nazw. Algorytmy klasteryzacji powinny działać na dwóch wariantach: z pre-processingiem i bez pre-processingu.

In [12]:
def stoplist(text, frequency= 200):
    words = text.split()
    counted = {word: words.count(word) for word in set(words)}
    return {word for word in counted if counted[word] >= frequency}

# 4. Wykonaj klasteryzację zawartości załączonego pliku (lines.txt) przy użyciu  metryk zaimplementowanych w pkt. 1. Każda linia to adres pocztowy firmy, różne sposoby zapisu tego samego adresu powinny się znaleźć w jednym klastrze

In [50]:
from sklearn.cluster import KMeans

def get_res(metric, preproc):
    # Wczytanie danych z pliku lines.txt
    with open('lines.txt', 'r') as f:
        lines = [line.strip() for line in f.readlines()]
    
    # preprocessing
    if preproc:
        stop_words = stoplist(text)
        lines = [ ' '.join([word for word in line.split() if word not in stop_words]) for line in lines]

    # Utworzenie macierzy cech na podstawie danej metryki
    n = len(lines)
    dist_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1, n):
            dist_matrix[i, j] = metric(lines[i], lines[j])
            dist_matrix[j, i] = dist_matrix[i, j]

    # Utworzenie klastrów z użyciem metody K-średnich
    k = 10
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(dist_matrix)

    # Wypisanie klastrów
    for i in range(k):
        cluster_indices = np.where(kmeans.labels_ == i)[0]
        cluster_lines = [lines[idx] for idx in cluster_indices]
        print(f'Cluster {i+1}:')
        cnt = 0
        for line in cluster_lines:
            if cnt < 5:
                print(line[:100])
                cnt+=1
        print()

## Cosine

### preprocessing 

In [51]:
get_res(cosine_similarity, True)

Cluster 1:
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDE
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDE
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER

Cluster 2:
"FMG SHIPPING AND FORWARDING, LTD."190020, SAINT PETERSBURG,LIFLYANDSKAYA STR., 6,LITERA "A",BUILDIN
"FMG SHIPPING AND FORWARDING, LTD."190020 ST.PETERSBURG, RUSSIA BUMAZHNAYA STR., 18, OFF. 310
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A",
1/FMG SHIPPING AND FORWARDING LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A ",+++
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, L

### bez preprocessingu

In [52]:
get_res(cosine_similarity, False)

Cluster 1:
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD.  191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FED
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG,  SHPALERNAYA STREET, 51 RUSSIAN FED
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER

Cluster 2:
"FMG SHIPPING AND FORWARDING, LTD."190020, SAINT PETERSBURG,LIFLYANDSKAYA STR., 6,LITERA "A",BUILDIN
"FMG SHIPPING AND FORWARDING, LTD."190020 ST.PETERSBURG, RUSSIA BUMAZHNAYA STR., 18, OFF. 310
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A",
1/FMG SHIPPING AND FORWARDING  LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A ",+++
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, 

## Dice

### preprocessing

In [53]:
get_res(dice, True)

Cluster 1:
"TC ""UNOTRANS"" LTD" "190020 ST.PETERSBURG," "BUMAZHNAYA STR.9, K.1,OFFICE 305" "TEL/FAX: +7 812 44
"TC"UNOTRANS"LTD 190020 ST.PETERSBURG,BUMAZHNAYA STR.9,K.1,OFF.305 TEL/FAX:+7 812 445 28 43
"TC"UNOTRANS"LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9,K.1,OFF.305 TEL/FAX:+7 812 445 28 43
"TC "UNOTRANS" LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9, K.1, OFF. 305 TEL/FAX: +7 812 445 28 43
"TC "UNOTRANS" LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9, K.1, OFFICE305 TEL/FAX: +7 812 445 28 43

Cluster 2:
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDE
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDE
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 R

### bez preprocessingu 

In [54]:
get_res(dice, False)

Cluster 1:
"TC ""UNOTRANS"" LTD" "190020 ST.PETERSBURG," "BUMAZHNAYA STR.9, K.1,OFFICE 305" "TEL/FAX: +7 812 44
"TC"UNOTRANS"LTD 190020 ST.PETERSBURG,BUMAZHNAYA STR.9,K.1,OFF.305 TEL/FAX:+7 812 445 28 43
"TC"UNOTRANS"LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9,K.1,OFF.305 TEL/FAX:+7 812 445 28 43
"TC "UNOTRANS" LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9, K.1, OFF. 305 TEL/FAX: +7 812 445 28 43
"TC "UNOTRANS" LTD 190020 ST.PETERSBURG, BUMAZHNAYA STR.9, K.1, OFFICE305 TEL/FAX: +7 812 445 28 43

Cluster 2:
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD.  191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FED
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD. 191015 SAINT-PETERSBURG,  SHPALERNAYA STREET, 51 RUSSIAN FED
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 RUSSIAN FEDER
"MEAT TRADE COMPANY "ST.PETERSBURG"LTD 191015 SAINT-PETERSBURG, SHPALERNAYA STREET, 51 R

## LCS

### preprocessing 

In [55]:
get_res(word_lcs, True)

Cluster 1:
A. HARTRODT HONG KONG LIMITED FLAT1207-1216, 12/F., BLOCK B SOUTHMARK, 11 YIP HING STREET WONG CHUK 
A. HARTRODT HONG KONG LIMITED FLAT1207-1216, 12/F., BLOCK B SOUTHMARK, 11 YIP HING STREET WONG CHUK 
A. HARTRODT SHENZHEN LOGISTICSCOMPANY LIMITED RM 1804-05, GOLDEN BUSINESS CENTRE, 2028 SHENNAN ROAD 
A.HARTRODT SHENZHEN LOGISTICS COMPANY LIMITED RM 1804-05 GOLDEN BUSINESS CENTRE 2028 SHENNAN ROAD EA
A.HARTRODT SHENZHEN LOGISTICS COMPANY LIMITED RM 1804-05 GOLDEN BUSINESS CENTRE 2028 SHENNAN ROAD EA

Cluster 2:
"FMG SHIPPING AND FORWARDING, LTD."190020, SAINT PETERSBURG,LIFLYANDSKAYA STR., 6,LITERA "A",BUILDIN
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A",
1/FMG SHIPPING AND FORWARDING LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A ",+++
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A", BUILD
1/FMG SHIPPING AND FORWARDING LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR.

### bez preprocessingu

In [57]:
get_res(word_lcs, False)

Cluster 1:
A. HARTRODT HONG KONG LIMITED FLAT1207-1216, 12/F., BLOCK B SOUTHMARK, 11 YIP HING STREET WONG CHUK 
A. HARTRODT HONG KONG LIMITED FLAT1207-1216, 12/F., BLOCK B  SOUTHMARK, 11 YIP HING STREET WONG CHUK
A. HARTRODT SHENZHEN LOGISTICSCOMPANY LIMITED RM 1804-05, GOLDEN BUSINESS CENTRE, 2028 SHENNAN ROAD 
A.HARTRODT SHENZHEN LOGISTICS COMPANY LIMITED RM 1804-05 GOLDEN BUSINESS CENTRE 2028 SHENNAN ROAD EA
A.HARTRODT SHENZHEN LOGISTICS COMPANY LIMITED RM 1804-05 GOLDEN BUSINESS CENTRE 2028 SHENNAN ROAD EA

Cluster 2:
"FMG SHIPPING AND FORWARDING, LTD."190020, SAINT PETERSBURG,LIFLYANDSKAYA STR., 6,LITERA "A",BUILDIN
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A",
1/FMG SHIPPING AND FORWARDING  LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A ",+++
1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A", BUILD
1/FMG SHIPPING AND FORWARDING  LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA  S

# 5. Porównaj jakość wyników sposobami zaimplementowanymi w pkt. 2.

# 6. Czy masz jakiś pomysł na poprawę jakości klasteryzacji w tym zadaniu?

<ul>
     <li> Wybór najlepszej metryki - zależy od konkretnej sytuacji i rodzaju danych, z którymi mamy do czynienia. Nie ma jednej uniwersalnej metryki, która będzie najlepsza w każdym przypadku.</li>
    <li> Implementacja nieco bardziej zlożonego preprocessingu. Można na przykład rozważyć, które dane są najbardziej istotne i sprowadzić je do ustandaryzowanej formy i tylko je wziąć pod uwagę.</li>
    <li> Dobranie odpowiednich paramterów, np. ilości klastrów, częstości wystąpień słowa w stopliście. </li>
</ul>