# **Semantic Similarity between 2 sentances**
## **Problem Statement:**
#### Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other. STS is the assessment of pairs of sentences according to their degree of semantic similarity. The task involves producing real-valued similarity scores for sentence pairs.
## **Data Description:**
#### The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw dataset. Each pair of the sentence may or may not be semantically similar. The candidate is to predict a value between 0-1 indicating a degree of similarity between the pair of text paras.

#### 1: Highly similar
#### 0: Highly dissimilar

## **Approach**
This is a problem of Natural Language Processing (NLP) and before building any deep learning model in NLP, text embedding plays a major role. The Text embedding converts text (sentences in our case) into numerical vectors.

After converting the sentences into vectors we can calculate how close these vectors are based on euclidean distance/ cosine similarity or any other method. and that itself can tell how similar our sentences are. In our case, we have used cosine similarity. 

But, how to convert keywords into vectors? we are not converting just based on keyword but the context and meaning.

we have used Universal Sentence Encoder(USE). It encodes text into higher dimensional vectors that can be used for our semantic similarity task. The pre-trained Universal Sentence Encoder(USE) is publicly available in tensorflow hub.

# Importing required libraries:
### First Let's import the required libraries and Load the Universal Sentence Encoder's TF Hub module

In [1]:
import tensorflow as tf       # To work with USE4
import pandas as pd           # To work with tables 
import tensorflow_hub as hub  # contains USE4
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #Model is imported from this URL
model = hub.load(module_url)
def embed(input):
    return model(input)

# **Reading Data**

In [2]:
Data = pd.read_csv("Text_Similarity_Dataset.csv")

In [3]:
Data.head()

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...


In [4]:
Data.shape 

(696, 3)

In [5]:
Data['text1'][0]

'savvy searchers fail to spot ads internet search engine users are an odd mix of naive and sophisticated  suggests a report into search habits.  the report by the us pew research center reveals that 87% of searchers usually find what they were looking for when using a search engine. it also shows that few can spot the difference between paid-for results and organic ones. the report reveals that 84% of net users say they regularly use google  ask jeeves  msn and yahoo when online.  almost 50% of those questioned said they would trust search engines much less  if they knew information about who paid for results was being hidden. according to figures gathered by the pew researchers the average users spends about 43 minutes per month carrying out 34 separate searches and looks at 1.9 webpages for each hunt. a significant chunk of net users  36%  carry out a search at least weekly and 29% of those asked only look every few weeks. for 44% of those questioned  the information they are looking

In [6]:
type(Data['text1'][0]) # we can see that all the data is in string type

str

# Encoding text to vectors:
We have used USE version 4.
It is trained on the whole wikipedia data.
Our Sentence have a sequence of words. we give this sentence to our model (USE4), it gives us a "dense numeric vector". 
Here, we passed sentence pair and got a vector pair. 

In [7]:
message = [Data['text1'][0], Data['text2'][0]]
message_embeddings = embed(message)
message_embeddings

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[ 0.05397232, -0.04840369, -0.05309717, ...,  0.04776645,
        -0.06002418, -0.0236285 ],
       [-0.04064719, -0.05544911, -0.05753231, ...,  0.05157081,
        -0.05860626, -0.05815786]], dtype=float32)>

In [8]:
type(message_embeddings)

tensorflow.python.framework.ops.EagerTensor

### Here we can see that the type of the vector retured is tensorflow.python.framework.ops.EagerTensor so, we cannot directly use it to compute the cosine similarity. We need to convert it into a numpy array first.
---



In [9]:
type(message_embeddings[0])

tensorflow.python.framework.ops.EagerTensor

In [10]:
type(tf.make_ndarray(tf.make_tensor_proto(message_embeddings)))

numpy.ndarray

In [11]:
a_np = tf.make_ndarray(tf.make_tensor_proto(message_embeddings))

# Finding Cosine similarity
we ran a for loop for all the sentence pair present in our data and found the vector representation of our sentences. For each vector pair, we found the cosine between the by using usual cosine formula.
i.e.  

###cosin = dot(a,b)/norm(a)*norm(b)

we get the value ranging from -1 to 1. But, we need values ranging from 0 to 1 hence we will add 1 to the cosine similarity value and then normalizze it. 


In [12]:
from numpy import dot                                           # to calculate the dot product of two vectors
from numpy.linalg import norm                                   #for finding the norm of a vector

ans = []                                                        # This list will contain the cosin similarity value for each vector pair present.
for i in range(len(Data)):
    messages = [Data['text1'][i], Data['text2'][i]]               #storing each sentence pair in messages
    message_embeddings = embed(messages)                          #converting the sentence pair to vector pair using the embed() function
    a = tf.make_ndarray(tf.make_tensor_proto(message_embeddings)) #storing the vector in the form of numpy array
    cos_sim = dot(a[0], a[1])/(norm(a[0])*norm(a[1]))             #Finding the cosine between the two vectors
    ans.append(cos_sim)                                           #Appending the values into the ans list

In [13]:
len(ans) 

696

In [14]:
Ans = pd.DataFrame(ans, columns = ['Similarity_Score'])         #converting the ans list into Dataframe so that we can add it to our "Data"

In [15]:
Ans.head()

Unnamed: 0,Similarity_Score
0,0.170659
1,0.188169
2,0.463088
3,0.421391
4,0.39246


In [16]:
Data = Data.join(Ans)  #Joining the Similarity_Score Dataframe (Ans) to our main Data

In [17]:
Data['Similarity_Score'] = Data['Similarity_Score'] + 1               #adding 1 to each of the values of Similarity_Score to make the values from 0 to 2. (Initially it was from [-1,1])

In [18]:
Data.head(2)

Unnamed: 0,Unique_ID,text1,text2,Similarity_Score
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...,1.170659
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...,1.188169


In [19]:
Data['Similarity_Score'] = Data['Similarity_Score']/Data['Similarity_Score'].abs().max() #Normalizing the Similarity_Score to get the value between 0 and 1

In [20]:
Data.head()

Unnamed: 0,Unique_ID,text1,text2,Similarity_Score
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...,0.651335
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...,0.661077
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...,0.814037
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...,0.790838
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...,0.774741


In [21]:
Submission = Data[['Unique_ID', 'Similarity_Score']]

In [22]:
Submission.head()

Unnamed: 0,Unique_ID,Similarity_Score
0,0,0.651335
1,1,0.661077
2,2,0.814037
3,3,0.790838
4,4,0.774741


In [23]:
Submission.set_index("Unique_ID", inplace = True)

In [24]:
from google.colab import files
Submission.to_csv('Submission.csv') 
files.download('Submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

TypeError: ignored