### Description

The notebook serves as an educational guide for understanding and applying different similarity metrics to textual data, with a specific emphasis on comparing requirement texts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from scipy.spatial.distance import jaccard, dice
import numpy as np
from Levenshtein import distance as levenshtein_distance

In [3]:
REQ1 = "The train must automatically signal its arrival when it is 500 meters from the station."
REQ2 = "A train should signal its approach automatically as it reaches within 500 meters of a station."

##### [Note] Similarity Metrics functions can be implemented in python or by just importing functions from readily available packages like scipy or sklearn. Both methods are mentioned below!

#### [Note] For vector-based similarity metrics, we use a simple bag-of-words representation model for vectorization. Most sophisticated models  like TFIDF word embeddings could yeild better results.

##### Dice & JSI can be calculated based on string- or vector-level representation.

[Note] The difference in results arises from what is being compared: At string-level, the sequences are compared directly at a granular level, such as n grams within the requirements. At binary vector level, the calculation is based on presence or absence of the features. This is useful high-dimensional sparse data. Both of the methods are mentioned below.

### 1. DICE

In [4]:
# At string level

def dice_coefficient(req_a, req_b):
    # Same as Jaccard, but normalized for set size
    req_a_set = set(req_a.split())
    req_b_set = set(req_b.split())
    intersection = req_a_set & req_b_set
    return 2 * len(intersection) / (len(req_a_set) + len(req_b_set))

print(f"Dice Coefficient: {dice_coefficient(REQ1, REQ2)}")

Dice Coefficient: 0.5161290322580645


In [5]:
# At binary-vectors level

vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
vectors = vectorizer.toarray()

binary_vectors = np.where(vectors > 0, 1, 0)
dice_sim = 1 - dice(binary_vectors[0], binary_vectors[1])
print(f"Dice Coefficient: {dice_sim}")

Dice Coefficient: 0.5714285714285714


### 2. JSI

In [6]:
# at string-level
def jaccard_similarity(req_a, req_b):
    # Represent requirements as sets of unique words/terms
    req_a_set = set(req_a.split())
    req_b_set = set(req_b.split())
    intersection = req_a_set & req_b_set
    union = req_a_set | req_b_set
    return len(intersection) / len(union)

print(f"Jaccard Similarity Index: {jaccard_similarity(REQ1, REQ2)}")

Jaccard Similarity Index: 0.34782608695652173


In [7]:
# at binary-vectorizer level
vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
vectors = vectorizer.toarray()

binary_vectors = np.where(vectors > 0, 1, 0)
jaccard_sim = 1 - jaccard(binary_vectors[0], binary_vectors[1])
print(f"Jaccard Similarity Index: {jaccard_sim}")

Jaccard Similarity Index: 0.4


### 3. Edit distance (levenshtein)

In [8]:
# string-level

In [9]:
def levenshtein_distance(req_a, req_b):
    dp = np.zeros((len(req_a) + 1, len(req_b) + 1))

    # Initialize base cases
    for i in range(len(req_a) + 1):
        dp[i, 0] = i  # Deletion cost to transform req_a into empty string
    for j in range(len(req_b) + 1):
        dp[0, j] = j  # Insertion cost to transform req_b into empty string

    # Fill the DP table
    for i in range(1, len(req_a) + 1):
        for j in range(1, len(req_b) + 1):
            if req_a[i - 1] == req_b[j - 1]:
                cost = 0  # No cost if characters are the same
            else:
                cost = 1  # Substitution cost
            dp[i, j] = min(
                dp[i - 1, j] + 1,  # Deletion from req_a
                dp[i, j - 1] + 1,  # Insertion into req_a
                dp[i - 1, j - 1] + cost  # Substitution or deletion/insertion combo
            )

    return dp[len(req_a), len(req_b)]

levenshtein_distance(REQ1, REQ2)

49.0

##### OR: You can also calculate with just scipy package calling levenshtein_distance

In [10]:
edit_dist = levenshtein_distance(REQ1, REQ2)
print(f"Edit Distance: {edit_dist}")

Edit Distance: 49.0


### 4. Euclidean Distance

In [11]:
# vector-level

In [12]:
def euclidean_distance(req_a, req_b):
    # Vectorize requirements and calculate distance
    vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
    vectors = vectorizer.toarray()
    
    return np.linalg.norm(vectors[0] - vectors[1])

print(f"Euclidean Distance: {euclidean_distance(REQ1, REQ2)}")

Euclidean Distance: 3.872983346207417


##### OR: You can also calculate with just scipy package calling euclidean_distances

In [13]:
vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
vectors = vectorizer.toarray()
euc_dist = euclidean_distances([vectors[0]], [vectors[1]])[0][0]

print(f"Euclidean Distance: {euc_dist}")

Euclidean Distance: 3.872983346207417


### 5. Cosine Similarity

In [17]:
def cosine_similarity(req_a, req_b):
    
    vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
    vectors = vectorizer.toarray()
    
    req_a_vec = vectors[0]
    req_b_vec = vectors[1]
    dot_product = np.dot(req_a_vec, req_b_vec)
    
    mag_a = np.linalg.norm(req_a_vec)
    mag_b = np.linalg.norm(req_b_vec)
    
    return dot_product / (mag_a * mag_b)

print(f"Cosine Similarity: {cosine_similarity(REQ1, REQ2)}")

Cosine Similarity: 0.5185629788417315


##### OR: You can also calculate with just sklearn package calling cosine_similarity


In [16]:
vectorizer = CountVectorizer().fit_transform([REQ1, REQ2])
vectors = vectorizer.toarray()
# [Read More at:] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
cos_sim = cosine_similarity([vectors[0]], [vectors[1]])
print(f"Cosine Similarity: {cos_sim}")                                                 

Cosine Similarity: 0.5185629788417315
