Semantic Similarity Model - It will calculate the similarity score for two different sentences.

Model used -> sentence-transformers/bert-base-nli-mean-tokens

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


Loading the data from the .csv file and converting it to list, where sentence1 is having 'text1' column value & sentence2 is having 'text2' column values

In [49]:
df = pd.read_csv('Precily_Text_Similarity_sample.csv', encoding='utf-8')
sentence1 = df.text1.tolist()
sentence2 = df.text2.tolist()

model_name = 'sentence-transformers/bert-base-nli-mean-tokens'

Adding the sentence1 and sentence2 list to Sentences list as such that the sentence2[0] comes after sentence1[0]

In [47]:
sentences = []
sentences = [item for sublist in zip(sentence1, sentence2) for item in sublist]


In [50]:
sentences

['Beneath the starry night, the silhouettes of ancient ruins whispered tales of civilizations long forgotten.',
 'My things are not good.']

Initializing tokenizer & model with the hugging face pretrained 'sentence-transformers' and 'bert-base-nli-mean-tokens' respectively

In [51]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Tokenizing all the sentences

In [52]:
tokens = {'input_ids': [], 'attention_mask': []}

Iterating tokenizer for all the input sentence

In [53]:
for sentence in sentences:
    new_tokens = tokenizer.encode_plus(sentence, max_length=128,
                                       truncation=True, padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

Stacking the tokens so that it will be converted to single pytorch tensor 

In [54]:
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

In [55]:
tokens['input_ids'].shape

torch.Size([2, 128])

Passing the tokens to the model

In [56]:
outputs = model(**tokens)

In [57]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.1199e-01,  7.7777e-01,  6.0520e-01,  ..., -2.0810e-01,
           8.5562e-02,  2.4778e-01],
         [ 7.9102e-01,  8.8493e-01,  3.5878e-01,  ...,  2.1151e-01,
          -1.1117e-01, -2.0535e-01],
         [ 6.9259e-01,  5.4534e-01,  9.6198e-01,  ...,  7.8976e-02,
          -1.1591e-01, -1.5669e-01],
         ...,
         [ 2.7982e-01,  4.8222e-01,  5.7033e-01,  ...,  2.0525e-01,
          -3.1677e-01, -1.2900e-01],
         [ 2.7551e-01,  4.4563e-01,  4.4077e-01,  ...,  1.9826e-01,
          -4.0210e-01, -2.3859e-02],
         [ 3.7187e-01,  2.8694e-01,  3.8014e-01,  ...,  2.5909e-01,
          -4.3329e-01,  5.5404e-02]],

        [[ 6.3068e-01,  5.6814e-02,  1.8481e+00,  ..., -4.0533e-01,
          -5.2693e-01, -5.2335e-02],
         [ 7.0392e-01,  4.9431e-02,  1.9409e+00,  ..., -4.8916e-01,
          -5.4688e-01,  1.1321e-01],
         [ 4.5755e-01,  1.6717e-01,  1.3666e+00,  ..., -5.0511e-01,
          -5.

In [58]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

Assigning embeddings last_hidden_state tensors so that after performing mean pooling operating convert it into the sentence vector

In [59]:
embeddings = outputs.last_hidden_state
embeddings.shape

torch.Size([2, 128, 768])

Multiplying the attention_mask (i.e., 0 for no word in a cell and 1 for a word in a cell) to each embeddings which will help in removing the unnecessary embeddings

In [60]:
attention = tokens['attention_mask']
attention.shape

torch.Size([2, 128])

Changing the dimension of attention to match with embeddings

In [61]:
mask = attention.unsqueeze(-1).expand(embeddings.shape).float()

In [62]:
mask_embeddings = embeddings * mask

Converting mask_embeddings to a single value

In [63]:
summed = torch.sum(mask_embeddings, 1)
summed.shape

torch.Size([2, 768])

Taking the counts of mask

In [64]:
counts = torch.clamp(mask.sum(1), min=1e-9)
counts.shape

torch.Size([2, 768])

Getting the mean pooling of the masked embeddings

In [65]:
mean_pooled = summed/counts
mean_pooled.shape

torch.Size([2, 768])

In [66]:
mean_pooled

tensor([[ 0.6129,  0.7202,  0.7687,  ...,  0.0462, -0.0825, -0.1133],
        [ 0.5945,  0.0229,  1.7749,  ..., -0.4434, -0.6163, -0.2169]],
       grad_fn=<DivBackward0>)

In [67]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


By Cosine similarity method, comparing which vector is most likely to match each other. Here I took Sentence1 comparision to sentence2

In [68]:
# for i in range(len(mean_pooled)-1):
mean_pooled = mean_pooled.detach().numpy()

# cosine_similarity(
#     [mean_pooled[0]],
#     [mean_pooled[1]]
#  )
 
for i in range(0, len(mean_pooled)-1, 2):
    similarity = cosine_similarity([mean_pooled[i]], [mean_pooled[i+1]])[0, 0]
    print(f"Cosine Similarity between text1 and text2 : {similarity}")

Cosine Similarity between text1 and text2 : 0.3213573694229126


In [23]:
import pickle
pickle_out = open("Cosine_Similarity.pkl","wb")
pickle.dump(similarity, pickle_out)
pickle_out.close()