Semantic Similarity Model - It will calculate the similarity score for two different sentences.

Model used - sentence-transformers/bert-base-nli-mean-tokens

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Loading the data from the .csv file and converting it to list, where sentence1 is having 'text1' column value & sentence2 is having 'text2' column values

In [3]:
df = pd.read_csv('Precily_Text_Similarity_sample.csv', encoding='utf-8')
sentence1 = df.text1.tolist()
sentence2 = df.text2.tolist()

model_name = 'sentence-transformers/bert-base-nli-mean-tokens'

Adding the sentence1 and sentence2 list to Sentences list as such that the sentence2[0] comes after sentence1[0]

In [4]:
sentences = []
sentences = [item for sublist in zip(sentence1, sentence2) for item in sublist]


In [4]:
sentences

['broadband challenges tv viewing the number of europeans with broadband has exploded over the past 12 months  with the web eating into tv viewing habits  research suggests.  just over 54 million people are hooked up to the net via broadband  up from 34 million a year ago  according to market analysts nielsen/netratings. the total number of people online in europe has broken the 100 million mark. the popularity of the net has meant that many are turning away from tv  say analysts jupiter research. it found that a quarter of web users said they spent less time watching tv in favour of the net  the report by nielsen/netratings found that the number of people with fast internet access had risen by 60% over the past year.  the biggest jump was in italy  where it rose by 120%. britain was close behind  with broadband users almost doubling in a year. the growth has been fuelled by lower prices and a wider choice of always-on  fast-net subscription plans.  twelve months ago high speed interne

Initializing tokenizer & model with the hugging face pretrained 'sentence-transformers' and 'bert-base-nli-mean-tokens' respectively

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Tokenizing all the sentences

In [6]:
tokens = {'input_ids': [], 'attention_mask': []}

Iterating tokenizer for all the input sentence

In [7]:
for sentence in sentences:
    new_tokens = tokenizer.encode_plus(sentence, max_length=128,
                                       truncation=True, padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

Stacking the tokens so that it will be converted to single pytorch tensor 

In [8]:
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

In [9]:
tokens['input_ids'].shape

torch.Size([400, 128])

Passing the tokens to the model

In [10]:
outputs = model(**tokens)

In [11]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3117,  0.3173, -0.7577,  ...,  0.1216,  1.4572,  0.1638],
         [ 0.3270,  0.3581, -0.1369,  ..., -0.1159,  1.1774, -0.0059],
         [-0.0047,  0.5353, -0.5966,  ...,  0.0892,  0.9726, -0.2937],
         ...,
         [-0.5084, -0.1953, -0.6451,  ...,  0.1628,  1.3361,  0.0447],
         [-0.8518, -0.1075, -0.2885,  ...,  0.0305,  0.9707,  0.0444],
         [-0.2298,  0.4919, -0.4359,  ..., -0.0204,  1.4100,  0.0787]],

        [[-0.8053,  0.4447, -0.0258,  ..., -0.1588,  0.6907, -0.4419],
         [-0.1226,  0.4971, -0.1549,  ..., -0.2928,  0.5732, -0.7079],
         [-0.6810,  0.7066,  0.1803,  ..., -0.3262,  0.5653, -0.6268],
         ...,
         [-0.1318,  0.1959,  0.1768,  ..., -0.1902,  0.3838, -0.0228],
         [ 0.0735,  0.4843,  0.1758,  ..., -0.1621,  0.3166, -0.3355],
         [-0.4902,  0.8651,  0.4686,  ..., -0.2613,  0.4778, -0.5513]],

        [[-0.4956,  0.3217, -0.4395,  ..., -0.2438, -

In [12]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

Assigning embeddings last_hidden_state tensors so that after performing mean pooling operating convert it into the sentence vector

In [11]:
embeddings = outputs.last_hidden_state
embeddings.shape

torch.Size([400, 128, 768])

Multiplying the attention_mask (i.e., 0 for no word in a cell and 1 for a word in a cell) to each embeddings which will help in removing the unnecessary embeddings

In [12]:
attention = tokens['attention_mask']
attention.shape

torch.Size([400, 128])

Changing the dimension of attention to match with embeddings

In [13]:
mask = attention.unsqueeze(-1).expand(embeddings.shape).float()

In [14]:
mask_embeddings = embeddings * mask

Converting mask_embeddings to a single value

In [15]:
summed = torch.sum(mask_embeddings, 1)
summed.shape

torch.Size([400, 768])

Taking the counts of mask

In [16]:
counts = torch.clamp(mask.sum(1), min=1e-9)
counts.shape

torch.Size([400, 768])

Getting the mean pooling of the masked embeddings

In [17]:
mean_pooled = summed/counts
mean_pooled.shape

torch.Size([400, 768])

In [18]:
mean_pooled

tensor([[-0.2648,  0.2231, -0.4899,  ...,  0.0097,  1.2247,  0.0544],
        [-0.4501,  0.5434,  0.0246,  ..., -0.2090,  0.4181, -0.4787],
        [-0.2469,  0.1297, -0.1575,  ..., -0.1625, -0.3272, -0.3834],
        ...,
        [-0.2629,  0.8346, -0.0420,  ..., -0.3503, -0.6176, -0.2161],
        [-0.7407,  0.6367, -0.7799,  ..., -0.0980,  0.6192,  0.1845],
        [-0.6438,  0.8662, -0.2314,  ...,  0.2144,  0.1373,  0.3366]],
       grad_fn=<DivBackward0>)

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


By Cosine similarity method, comparing which vector is most likely to match each other. Here I took Sentence1 comparision to sentence2

In [21]:
# for i in range(len(mean_pooled)-1):
mean_pooled = mean_pooled.detach().numpy()

# cosine_similarity(
#     [mean_pooled[0]],
#     [mean_pooled[1]]
#  )
similarity_score=[]
for i in range(0, len(mean_pooled)-1, 2):
    similarity = cosine_similarity([mean_pooled[i]], [mean_pooled[i+1]])[0, 0]
    similarity = "%.2f" % similarity
    similarity_score.append(similarity)
    print(f"Cosine Similarity between text1 and text2 : {similarity}")
    


Cosine Similarity between text1 and text2 : 0.62
Cosine Similarity between text1 and text2 : 0.65
Cosine Similarity between text1 and text2 : 0.57
Cosine Similarity between text1 and text2 : 0.55
Cosine Similarity between text1 and text2 : 0.65
Cosine Similarity between text1 and text2 : 0.53
Cosine Similarity between text1 and text2 : 0.62
Cosine Similarity between text1 and text2 : 0.72
Cosine Similarity between text1 and text2 : 0.64
Cosine Similarity between text1 and text2 : 0.61
Cosine Similarity between text1 and text2 : 0.63
Cosine Similarity between text1 and text2 : 0.64
Cosine Similarity between text1 and text2 : 0.63
Cosine Similarity between text1 and text2 : 0.76
Cosine Similarity between text1 and text2 : 0.57
Cosine Similarity between text1 and text2 : 0.63
Cosine Similarity between text1 and text2 : 0.75
Cosine Similarity between text1 and text2 : 0.65
Cosine Similarity between text1 and text2 : 0.45
Cosine Similarity between text1 and text2 : 0.69
Cosine Similarity be

In [48]:
similarity_score



['0.62',
 '0.65',
 '0.57',
 '0.55',
 '0.65',
 '0.53',
 '0.62',
 '0.72',
 '0.64',
 '0.61',
 '0.63',
 '0.64',
 '0.63',
 '0.76',
 '0.57',
 '0.63',
 '0.75',
 '0.65',
 '0.45',
 '0.69',
 '0.68',
 '0.69',
 '0.62',
 '0.63',
 '0.73',
 '0.62',
 '0.66',
 '0.56',
 '0.50',
 '0.64']

In [22]:
existing_data = pd.read_csv('Precily_Text_Similarity_sample.csv')
existing_data['Similarity Score'] = similarity_score
existing_data.to_csv('Results.csv', index=False)


In [23]:
import pickle
pickle_out = open("Cosine_Similarity.pkl","wb")
pickle.dump(similarity, pickle_out)
pickle_out.close()