**What is TEXT SIMILARITY?**
Text similarity is a really useful natural language processing (NLP) tool. It allows you to find similar pieces of text and has many real-world use cases.

**Different Text Similarity Algorithms:**
1. Cosine Similarity
2. Levenshtein Distance
3. Jaccard Index
4. Euclidean Distance
5. Hamming Distance
6. Word Embeddings
7. Pre-trained language models

In this exercise, we will be exploring all the algorithms, except *Word Embeddings* and *Pre-trained models*.

**Applications of Text Similarity Detection:**
1. Plagirism Detection
2. Document Classification
3. Information Retrieval
4. Language Translation
5. Sentiment Analysis
6. Summarization

In [1]:
f = open('txt1.txt','r',encoding='utf8')
text1 = f.read()
g = open('txt2.txt','r',encoding='utf8')
text2 = g.read()

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# tokenize the documents
tokens1 = nltk.word_tokenize(text1.lower())
tokens2 = nltk.word_tokenize(text2.lower())
print(tokens1)
print(tokens2)

['the', 'history', 'of', 'automobiles', 'is', 'an', 'expansive', 'narrative', 'that', 'unfolds', 'across', 'centuries', ',', 'tracing', 'the', 'evolution', 'of', 'transportation', 'from', 'its', 'humble', 'beginnings', 'to', 'the', 'complex', ',', 'interconnected', 'network', 'of', 'roads', 'and', 'vehicles', 'that', 'define', 'modern', 'society', '.', 'the', 'concept', 'of', 'self-propelled', 'vehicles', 'has', 'roots', 'dating', 'back', 'to', 'antiquity', ',', 'with', 'early', 'experiments', 'involving', 'steam-powered', 'machines', 'in', 'the', '17th', 'century', '.', 'however', ',', 'it', 'was', "n't", 'until', 'the', 'late', '19th', 'and', 'early', '20th', 'centuries', 'that', 'the', 'automobile', 'as', 'we', 'know', 'it', 'began', 'to', 'take', 'shape', '.', 'karl', 'benz', "'s", 'creation', 'of', 'the', 'first', 'practical', 'automobile', 'in', '1885', ',', 'featuring', 'a', 'gasoline-powered', 'internal', 'combustion', 'engine', ',', 'marked', 'a', 'significant', 'milestone', '

In [4]:
# lemmatize the tokens
lemmatizer = WordNetLemmatizer()
tokens1 = [lemmatizer.lemmatize(token) for token in tokens1]
tokens2 = [lemmatizer.lemmatize(token) for token in tokens2]
print(tokens1)
print(tokens2)

['the', 'history', 'of', 'automobile', 'is', 'an', 'expansive', 'narrative', 'that', 'unfolds', 'across', 'century', ',', 'tracing', 'the', 'evolution', 'of', 'transportation', 'from', 'it', 'humble', 'beginning', 'to', 'the', 'complex', ',', 'interconnected', 'network', 'of', 'road', 'and', 'vehicle', 'that', 'define', 'modern', 'society', '.', 'the', 'concept', 'of', 'self-propelled', 'vehicle', 'ha', 'root', 'dating', 'back', 'to', 'antiquity', ',', 'with', 'early', 'experiment', 'involving', 'steam-powered', 'machine', 'in', 'the', '17th', 'century', '.', 'however', ',', 'it', 'wa', "n't", 'until', 'the', 'late', '19th', 'and', 'early', '20th', 'century', 'that', 'the', 'automobile', 'a', 'we', 'know', 'it', 'began', 'to', 'take', 'shape', '.', 'karl', 'benz', "'s", 'creation', 'of', 'the', 'first', 'practical', 'automobile', 'in', '1885', ',', 'featuring', 'a', 'gasoline-powered', 'internal', 'combustion', 'engine', ',', 'marked', 'a', 'significant', 'milestone', 'in', 'automotive

In [5]:
# remove stopwords
stop_words = set(stopwords.words('english'))
tokens1 = [token for token in tokens1 if token not in stop_words]
tokens2 = [token for token in tokens2 if token not in stop_words]
print(tokens1)
print(tokens2)

['history', 'automobile', 'expansive', 'narrative', 'unfolds', 'across', 'century', ',', 'tracing', 'evolution', 'transportation', 'humble', 'beginning', 'complex', ',', 'interconnected', 'network', 'road', 'vehicle', 'define', 'modern', 'society', '.', 'concept', 'self-propelled', 'vehicle', 'ha', 'root', 'dating', 'back', 'antiquity', ',', 'early', 'experiment', 'involving', 'steam-powered', 'machine', '17th', 'century', '.', 'however', ',', 'wa', "n't", 'late', '19th', 'early', '20th', 'century', 'automobile', 'know', 'began', 'take', 'shape', '.', 'karl', 'benz', "'s", 'creation', 'first', 'practical', 'automobile', '1885', ',', 'featuring', 'gasoline-powered', 'internal', 'combustion', 'engine', ',', 'marked', 'significant', 'milestone', 'automotive', 'history', '.', 'invention', 'paved', 'way', 'mass', 'production', 'car', ',', 'feat', 'achieved', 'company', 'like', 'ford', 'introduction', 'assembly', 'line', 'manufacturing', 'technique', '1913.', 'widespread', 'adoption', 'autom

In [6]:
print(len(tokens1))
print(len(tokens2))

149
160


In [7]:
# Remove punctuations
import string
tokens1 = [token for token in tokens1 if token not in string.punctuation]
tokens2 = [token for token in tokens2 if token not in string.punctuation]

In [8]:
print(len(tokens1))
print(len(tokens2))

127
127


In [9]:
# vectorize the tokens
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])
#vector1 = vectorizer.fit_transform([' '.join(tokens1)])
#vector2 = vectorizer.fit([' '.join(tokens2)])
vector1 = vectors[0]
vector2 = vectors[1]

In [10]:
# cosine similarity
cosine_sim = cosine_similarity(vector1, vector2)[0][0]
print("Cosine Similarity: ", cosine_sim)

Cosine Similarity:  0.6508009939293828


In [11]:
# levenshtein distance
import Levenshtein
lev_distance = Levenshtein.distance(tokens1,tokens2)
print("Levenshtein Distance:", lev_distance)

Levenshtein Distance: 126


In [12]:
# jaccard index
import textdistance
jaccard_index = textdistance.jaccard.normalized_similarity(tokens1, tokens2)
print("Jaccard Index:", jaccard_index)

Jaccard Index: 0.09956709956709953


In [17]:
# euclidean distance
import numpy
euclidean_distance = numpy.linalg.norm(vector1.toarray() - vector2.toarray())
print("Euclidean Distance: ", euclidean_distance)

Euclidean Distance:  0.8357021072973521


In [18]:
# hamming distance
hamming_distance = textdistance.hamming.normalized_similarity(tokens1, tokens2)
print("Hamming Distance: ", hamming_distance)

Hamming Distance:  0.007874015748031482


1. I have tried the same measures with a bunch of different documents of varying sizes (20 - 250).
2. The similarity scores seems to vary with the inclusion and exclusion of punctuations, but I don't think that will be the case in all the situations.