# **Text Similarity - Natural Language Processing**

## **Introduction**

Natural Language Processing (NLP) is an area of artificial intelligence that seeks to enable computers to understand, interpret, and produce human language efficiently.

## **Objective**

The objective of this project is to return a similarity score between sentences. For this project, the containment measure will be used, which is a measure of similarity between two texts.

## **Data**

Given the following original sentence:

> Olhando para a escala na parede, qual valor indicaria melhor a sua dor hoje?

We will compare the following 3 sentences, returning the similarity score between the original and the comparative.

|Original Sentence |Comparative Sentences | Similarity Score |
|--|--|--|
|Olhando para a escala na parede, qual valor indicaria melhor a sua dor hoje?|De acordo com a escala de dor ali na parede, qual valor você acha que mais representa a sua dor?|???
||De 0 a 10, qual o nível de intensidade da sua dor atualmente?|???
||Qual a intensidade da sua dor?|???


## **Getting started**


#### **Importing Libraries**
---

To start implementing the code, we will import some libraries that will be necessary for the project creation.

In [1]:
import numpy as np
import sklearn
import pandas as pd

#### **Declaring Variables**
---

We will declare some variables to store the original text and the comparative texts.

In [2]:
default_text = "Olhando para a escala na parede, qual valor indicaria melhor a sua dor hoje?"
text_1 = "De acordo com a escala de dor ali na parede, qual valor você acha que mais representa a sua dor?"
text_2 = "De 0 a 10, qual o nível de intensidade da sua dor atualmente?"
text_3 = "Qual a intensidade da sua dor?"

#### **Creating Vocabulary**
---

Now we will use the Count Vectorizer to transform the words of the sentences that will be compared into a numeric format, to be used later in the comparisons. We will use unigrams and implement regex to consider unique words in this comparison.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

words = CountVectorizer(analyzer='word', ngram_range=(0,1), token_pattern=r'\b\w+\b')

texts = [text_1, text_2, text_3]

for text in texts:
    vocabulary = words.fit([text, default_text]).vocabulary_
    print(vocabulary)

{'de': 5, 'acordo': 2, 'com': 4, 'a': 0, 'escala': 7, 'dor': 6, 'ali': 3, 'na': 12, 'parede': 15, 'qual': 16, 'valor': 20, 'você': 21, 'acha': 1, 'que': 17, 'mais': 10, 'representa': 18, 'sua': 19, 'olhando': 13, 'para': 14, 'indicaria': 9, 'melhor': 11, 'hoje': 8}
{'de': 5, '0': 0, 'a': 2, '10': 1, 'qual': 18, 'o': 14, 'nível': 13, 'intensidade': 10, 'da': 4, 'sua': 19, 'dor': 6, 'atualmente': 3, 'olhando': 15, 'para': 16, 'escala': 7, 'na': 12, 'parede': 17, 'valor': 20, 'indicaria': 9, 'melhor': 11, 'hoje': 8}
{'qual': 12, 'a': 0, 'intensidade': 6, 'da': 1, 'sua': 13, 'dor': 2, 'olhando': 9, 'para': 10, 'escala': 3, 'na': 8, 'parede': 11, 'valor': 14, 'indicaria': 5, 'melhor': 7, 'hoje': 4}


#### **Transforming Vocabularies into Arrays**
---

Transforming the vocabularies into arrays for a comparison of each comparative sentence with the original sentence

In [4]:
for text in texts:
    array = words.fit_transform([text, default_text]).toarray()
    print(array)

[[2 1 1 1 1 2 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1]
 [2 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0]]
[[1 1 1 1 1 2 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0]
 [0 0 2 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1]]
[[1 1 1 0 0 0 1 0 0 0 0 0 1 1 0]
 [2 0 1 1 1 1 0 1 1 1 1 1 1 1 1]]


#### **Creating a similarity measure** 
---

Now let's calculate the intersection between the comparative texts and the original text.

In [5]:
for text in texts:
    array = words.fit_transform([text, default_text]).toarray()
    intersections = np.sum(np.amin(array, axis=0))
    print(intersections)

9
4
4


#### **Similarity Score**
---

Given the previous comparison and calculating the similarity score, we obtain the following result.

In [6]:
divider = words.fit_transform([default_text]).toarray()
count_default = np.sum(divider)

for text in texts:
    array = words.fit_transform([text, default_text]).toarray()
    intersections = np.sum(np.amin(array, axis=0))
    score = intersections/count_default
    print(score)

0.6428571428571429
0.2857142857142857
0.2857142857142857


#### **Creating the dataframe**
---

For a better visualization, we will create a dataframe with the obtained results.

In [7]:
result_list = []

for text in texts:
    array = words.fit_transform([text, default_text]).toarray()
    intersections = np.sum(np.amin(array, axis=0))
    divider = words.fit_transform([default_text]).toarray()
    count_default = np.sum(divider)
    score = intersections/count_default
    result_dict = {'Original sentence': default_text, 'Comparative sentences': text, 'Similarity score': score}
    result_list.append(result_dict)

df = pd.DataFrame(result_list)

pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', None)

df_output = pd.DataFrame(df, columns=['Original sentence', 'Comparative sentences', 'Similarity score'])

df_output.head()

Unnamed: 0,Original sentence,Comparative sentences,Similarity score
0,"Olhando para a escala na parede, qual valor in...","De acordo com a escala de dor ali na parede, q...",0.642857
1,"Olhando para a escala na parede, qual valor in...","De 0 a 10, qual o nível de intensidade da sua ...",0.285714
2,"Olhando para a escala na parede, qual valor in...",Qual a intensidade da sua dor?,0.285714


#### **Conclusion**
---

Using the containment measure to calculate the similarity between texts, we obtained the following result:

**Original Phrase**<br />
• Looking at the pain scale on the wall, what value would best indicate your pain today?<br />

**Comparative Phrases**<br />
• According to the pain scale on the wall, what value do you think best represents your pain?<br />
• On a scale of 0 to 10, what is the intensity level of your current pain?<br />
• What is the intensity of your pain?<br />

**Similarity Scores**<br />
• The first sentence obtained a score of 64.28%.<br />
• The second sentence obtained a score of 28.57%.<br />
• The third sentence obtained a score of 28.57%.<br />

Just as in this study, other models and measures could also be used, which would leave open to new improvements and consequently new results.