# Finding Text Similarity
-  https://www.scaler.com/topics/nlp/text-similarity-nlp/

## which pairs are similar
-  Global warming is here. Ocean temperature is rising
-  I'm reading a book. The is book is about nlp.
-  Text similarity in nlp is easy. I like data science.
-  This place is great. This is great news.
-  It might not rain today. It might not work today

In [None]:
# First 2 pair, similar
# next 3 not similar

## The Similarity Problem
-  We are trying to find text similarity using NLP or just some similarity between texts. To find this, we must first define two aspects.
    -  The method that will be used to calculate the similarities.
    -  The algorithm that will be used for the transformation of our texts to embeddings.

## Pre-processing for Text Similarity
-  Now, before we get started with the two aspects discussed above, we must first pre-process our text to find out text similarity using NLP. The pipeline that we are going to use for pre-processing is - Normalization, Tokenization, Removal of Stop Words, Stemming & Lemmatization.
-  For this pre-processing pipeline, we will make use of one of the most popular libraries in python - NLTK. Here's the code for the same.

In [1]:
#libraries
import regex as re
import string
import nltk
from nltk.stem import WordNetLemmatizer

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


In [4]:
# this function pre-processes our text
def preProcessText(text):
    processed = []
    for doc in range(len(text)):
        doc = re.sub(r"\\n", "", doc)
        doc = re.sub(r"\W", " ", doc) #remove non words char
        doc = re.sub(r"\d"," ", doc) #remove digits char
        doc = re.sub(r'\s+[a-z]\s+', "", doc) # remove a single char
        doc = re.sub(r'^[a-z]\s+', "", doc) #remove a single character at the start of a document
        doc = re.sub(r'\s+', " ", doc)  #replace an extra space with a single space
        doc = re.sub(r'^\s', "", doc) # remove space at the start of a doc
        doc = re.sub(r'\s$', "", doc) # remove space at the end of a document
        processed.append(doc.lower())
    return processed

In [5]:
def tokenize(text):
    tokens = re.split('W+',text)
    return tokens

In [6]:
# removing stop words
stops = nltk.corpus.stopwords.words('english')

In [8]:
def removeStopWords(text):
    text = [word for word in text if word not in stops]
    return text

In [9]:
# lemmatization of text
wnl = WordNetLemmatizer()

In [10]:
def lemmatize(text):
    lemm_text = [wnl.lemmatize(word) for word in text]
    return lemm_text

In [11]:
def process(text):
    text = list(map(preProcessText, text))
    text = list(map(tokenize, text))
    text = list(map(removeStopWords, text))
    text = list(map(lemmatize, text))
    return text

#### Once all this preprocessing is done, the next task in hand is to decide the algorithm that we would use to convert our processed text into embeddings. A few of the algorithms that can be used are:

## TF-IDF
-  TF-IDF is one of the methods that can be used to convert text to embeddings. It is short for Term Frequency - Inverse Document Frequency. Before moving on to the concept of TF-IDF, let's discuss a very basic document-term matrix.
-  To build a document-term matrix you only need a matrix where each row is a phrase/sentence or in NLP terms - a document and every column is a word. The values filled in the matrix are the number of times a word (column) appeared in that document (row).
-  Hence, the document term matrix for our 5th pair of sentences would look like this:

## Word2Vec
-  This technique makes use of neural networks to represent our documents as embeddings. The Word2Vec model usually also captures the contextual meaning of words very well. The embeddings are represented as multidimensional arrays. To generate the word embeddings using Word2Vec, there are two unsupervised algorithms - Skip Gram or Continuous Bag of Words (CBOW). Both of these algorithms are architectures that make use of neural networks to learn the underlying word representations.
-  https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.html
-  https://www.analyticsvidhya.com/blog/2023/07/step-by-step-guide-to-word2vec-with-gensim/

### Continuous Bag of Words (CBOW)
-  In this model, the distributed representations of context, or in simpler terms, the surrounding words are combined to predict the word that lies in the middle.
-  For example, if we had a sentence - "I like to drink coffee the whole day", here's what the inputs and outputs of the cbow model would look like -

### Skip-Gram
-  The Skip-Gram model is the complete opposite of the continuous bag of words (CBOW) model. Here, instead of using the surrounding words to predict the middle word, we pass as input a target word to predict the neighboring words.
-  For the same sentence as above, the inputs and outputs of the skip-gram model would look like this 

### Methods to Find Text Similarity in NLP
-  after discussing some algorithms that can be used to convert the text into embeddings, we can move on to the second aspect -- the statistical method that will be used to compute similarities.

### Cosine Similarity
-  Cosine similarity is most definitely the most widely used method to compare vectors. In simple terms, cosine similarity is just the dot product between two vectors or the cosine angle between two vectors.
-  To find the dot product between two vectors A and B, we use the formula:

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_cosine_similarity(vec_1, vec_2):  
    return cosine_similarity(vec_1.reshape(1, -1), vec_2.reshape(1, -1))[0][0]

In [14]:
# finding cosine similarity between two vectors
vec1 = np.array([[12, 41, 60, 11, 21]])
vec2 = np.array([[40, 11, 4, 11, 14]])
print(get_cosine_similarity(vec1, vec2))

0.452270576122071


In [15]:
# creating a vector similar to vec1
vec3 = np.array([[12, 45, 60, 11, 25]])
print(get_cosine_similarity(vec1, vec3))

0.9983311419668022


- The output - 1 above signifies that the two vectors are not very similar, and not very different either. For the second output, since vec3 has almost all same values as vec1, we can see that the cosine similarity is high - 0.9983, signifying that the two vectors are extremely similar.

### Euclidean Distance
-  Moving on to another similarity metric - Euclidean Distance. This is a simple metric that measures the distance between two points by making use of the Pythagorean theorem. To calculate the Euclidean distance between two vectors, simply apply the following formula to the calculated document vectors:

In [16]:
from math import sqrt

def euclideanDistance(vec1: [int], vec2: [int]) -> float:
    return sqrt(sum(pow(x - y, 2) for x, y in zip(vec1, vec2)))

In [17]:
vec_1 = [1, 3, 2]
vec_2 = [5, 0, -3]
print(euclideanDistance(vec_1, vec_2))

7.0710678118654755


### Word Mover’s Distance
-  In both the similarity metrics discussed above, we did not take into consideration semantics. The Word Mover's distance tries to measure the semantic distance between two documents. The Word Mover's Distance leverages the results obtained from other advanced word embedding techniques such as Glove and word2vec to generate its embeddings that can be scaled to very large data sets. With these embedding techniques, we can preserve semantic relationships.
-  The word mover's distance between two text documents is calculated by the minimum cumulative distance that words from text document A need to travel to match exactly the point cloud of text document B.
-  https://radimrehurek.com/gensim/models/word2vec.html
-  https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial

A couple of intriguing properties of word mover's distance are:

It does not have any hyper-parameters and is pretty straightforward to understand and use!
It is quite comprehensible since the distance between documents may be taken down and explained as the sparse distance between a couple of individual words.
The knowledge encoded in word2vec and glove is naturally incorporated which leads to high accuracy.
Let's now look at an example with code implementation of the word mover's distance.

In [18]:
sent_a = "Modi had a chat with Bear Grylls in Jim Corbett"
sent_b = "The Prime Minister met the TV host in a national park"

sent_a = sent_a.lower().split()
sent_b = sent_b.lower().split()

In [21]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
sent_a = [word for word in sent_a if word not in stop]
sent_b = [word for word in sent_b if word not in stop]

In [25]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
#model = Word2Vec()
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)

In [26]:
dist1 = model.wmdistance(sent_a, sent_b)
print(dist1)

# let's write a sentence that is not so similar to sent_a
sent_c = "Leos are born in August"
sent_c = sent_c.lower().split()
sent_c = [word for word in sent_c if word not in stop]

AttributeError: 'Word2Vec' object has no attribute 'wmdistance'

In [None]:
dist2 = model.wmdistance(sent_a, sent_c)
print(dist2)