Using SBERT to find similar sentences

references: 

https://medium.com/mlearning-ai/nice-classification-recommendation-using-sentence-bert-b1af32d0131e
https://www.sbert.net/docs/usage/semantic_textual_similarity.html
https://www.sbert.net/examples/applications/paraphrase-mining/README.html


In [5]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

In [27]:

#Cosine Similarity function
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

# #Load the Nice classification data
# ncl_all = pd.read_csv('consolidated_nice_classifications.csv', sep=',')

# load the dataframe 
title_number = "12"
df = pd.read_parquet(f"../dataframe/{title_number}.parquet")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117306 entries, 0 to 117305
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   p_id          117306 non-null  object
 1   text          117306 non-null  object
 2   child_ids     117306 non-null  object
 3   cfr_links     117306 non-null  object
 4   other_links   117306 non-null  object
 5   link_targets  117306 non-null  object
dtypes: object(6)
memory usage: 5.4+ MB


In [38]:
# Rather than use the whole title, let's just use a sample
# sample_size = 1000
# df_sample = df[:sample_size]
# df_sample.info()

In [33]:
#Load the SBERT model
#Takes 26 minutes to train on my non-GPU notebook 
#12th Gen Intel i7-1260P (16) @ 4.700GHz, 32MB RAM
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
batch_size = 128

#Create the sentence embeddings for the Nice classifications
sentence_embeddings = sbert_model.encode(df['text'], batch_size=batch_size, show_progress_bar=True)
print("Shape of embeddings = ", sentence_embeddings.shape)
# sentence_embeddings[0]

Batches: 100%|██████████| 917/917 [26:03<00:00,  1.71s/it]  


Shape of embeddings =  (117306, 384)


In [36]:

#Create the sentence embeddings for the example product description
# #query = "A handbag is a medium-to-large bag typically used by women to hold personal items. It is often fashionably designed. Versions of the term are 'purse', 'pocketbook', 'pouch', or 'clutch', terms which suggest rather smaller versions."
# #query = 'A protection, safety, and private security agency. We specialize in the areas of close protection, property and home security and event security.'
# query = 'A Home appliance is any consumer-electronic machine use to complete some household task, such as cooking or cleaning. Home appliances can be classified into: Major appliances (or white goods) and Small appliances'
# #query = 'ovens for laboratory use'
# #query = 'computer game software for use on mobile and cellular telephones'
query = 'Transfer of fund via electronic methods such as ATM, POS, internet banking, mobile banking, etc.'
query_vec = sbert_model.encode([query])[0]
print("Shape of query embeddings = ", query_vec.shape)

Shape of query embeddings =  (384,)


In [37]:

#Calculate the similarity of the product description to the Nice classifications
cfr_sim = []
for cfr in sentence_embeddings:
    cfr_sim.append(cosine(query_vec, cfr))

df['similarity'] = cfr_sim

#Display the top 20 matches
df.sort_values(by=['similarity'], ascending=False).head(20).style.set_properties(subset=['text'], **{'width-min': '50px'})

Unnamed: 0,p_id,text,child_ids,cfr_links,other_links,link_targets,similarity
18333,205.3(b),(b)Electronic fund transfer—,['205.3(b)(1)' '205.3(b)(1)(i)' '205.3(b)(1)(ii)' '205.3(b)(1)(iii)'  '205.3(b)(1)(iv)' '205.3(b)(1)(v)' '205.3(b)(2)' '205.3(b)(2)(i)'  '205.3(b)(2)(ii)' '205.3(b)(2)(iii)' '205.3(b)(2)(iv)' '205.3(b)(3)'  '205.3(b)(3)(i)' '205.3(b)(3)(ii)' '205.3(b)(3)(iii)'],['/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)(iii)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)(iii)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(3)(ii)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(3)(i)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(3)(ii)'],[],['205.3(b)(2)(iii)' '205.3(b)(2)(iii)' '205.3(b)(2)' '205.3(b)(3)(ii)'  '205.3(b)(3)(i)' '205.3(b)(3)(ii)'],0.738359
84092,1005.3(b),(b)Electronic fund transfer—,['1005.3(b)(1)' '1005.3(b)(1)(i)' '1005.3(b)(1)(ii)' '1005.3(b)(1)(iii)'  '1005.3(b)(1)(iv)' '1005.3(b)(1)(v)' '1005.3(b)(2)' '1005.3(b)(2)(i)'  '1005.3(b)(2)(ii)' '1005.3(b)(2)(iii)' '1005.3(b)(3)' '1005.3(b)(3)(i)'  '1005.3(b)(3)(ii)'],['/on/2023-09-28/title-12/section-1005.3#p-1005.3(b)(2)'  '/on/2023-09-28/title-12/section-1005.3#p-1005.3(b)(3)(ii)'  '/on/2023-09-28/title-12/section-1005.3#p-1005.3(b)(3)(i)'],[],['1005.3(b)(2)' '1005.3(b)(3)(ii)' '1005.3(b)(3)(i)'],0.738359
18340,205.3(b)(2),(2)Electronic fund transfer using information from a check.,['205.3(b)(2)(i)' '205.3(b)(2)(ii)' '205.3(b)(2)(iii)' '205.3(b)(2)(iv)'],['/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)(iii)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)(iii)'  '/on/2023-09-28/title-12/section-205.3#p-205.3(b)(2)'],[],['205.3(b)(2)(iii)' '205.3(b)(2)(iii)' '205.3(b)(2)'],0.731448
84099,1005.3(b)(2),(2)Electronic fund transfer using information from a check.,['1005.3(b)(2)(i)' '1005.3(b)(2)(ii)' '1005.3(b)(2)(iii)'],['/on/2023-09-28/title-12/section-1005.3#p-1005.3(b)(2)'],[],['1005.3(b)(2)'],0.731448
85036,Supplement-I-to-Part-1005 1.,1.Fund transfers covered.The term “electronic fund transfer” includes:,['Supplement-I-to-Part-1005(1.)(i.)' 'Supplement-I-to-Part-1005(1.)(ii.)'  'Supplement-I-to-Part-1005(1.)(iii.)'  'Supplement-I-to-Part-1005(1.)(iv.)' 'Supplement-I-to-Part-1005(1.)(v.)'  'Supplement-I-to-Part-1005(1.)(vi.)'],[],[],[],0.730888
18839,Supplement-I-to-Part-205 1.,1.Fund transfers covered.The term electronic fund transfer includes:,['Supplement-I-to-Part-205(1.)(i.)' 'Supplement-I-to-Part-205(1.)(ii.)'  'Supplement-I-to-Part-205(1.)(iii.)' 'Supplement-I-to-Part-205(1.)(iv.)'  'Supplement-I-to-Part-205(1.)(v.)' 'Supplement-I-to-Part-205(1.)(vi.)'],[],[],[],0.728903
18334,205.3(b)(1),"(1)Definition.The term electronic fund transfer means any transfer of funds that is initiated through an electronic terminal, telephone, computer, or magnetic tape for the purpose of ordering, instructing, or authorizing a financial institution to debit or credit a consumer's account. The term includes, but is not limited to—",['205.3(b)(1)(i)' '205.3(b)(1)(ii)' '205.3(b)(1)(iii)' '205.3(b)(1)(iv)'  '205.3(b)(1)(v)'],[],[],[],0.728607
34845,233.2(y)(2),"(2)An electronic fund transfer, or funds transmitted by or through a money transmitting business, or the proceeds of an electronic fund transfer or money transmitting service, from or on behalf of such other person; or",[],[],[],[],0.723829
84093,1005.3(b)(1),"(1)Definition.The term “electronic fund transfer” means any transfer of funds that is initiated through an electronic terminal, telephone, computer, or magnetic tape for the purpose of ordering, instructing, or authorizing a financial institution to debit or credit a consumer's account. The term includes, but is not limited to:",['1005.3(b)(1)(i)' '1005.3(b)(1)(ii)' '1005.3(b)(1)(iii)'  '1005.3(b)(1)(iv)' '1005.3(b)(1)(v)'],[],[],[],0.723091
84174,1005.8(a)(1)(iii),(iii)Fewer types of available electronic fund transfers; or,[],[],[],[],0.704384
