# Semantic similarity using word embeddings

In [1]:
#pulling in the data
import pandas as pd;
df = pd.read_json("hf://datasets/toughdata/quora-question-answer-dataset/Quora-QuAD.jsonl", lines=True)
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,question,answer
0,Why whenever I get in the shower my girlfriend...,Isn’t it awful? You would swear that there was...
1,"What is a proxy, and how can I use one?",A proxy server is a system or router that prov...
2,"What song has the lyrics ""someone left the cak...",MacArthur's Park\n
3,I am the owner of an adult website called http...,Don't let apps that are liers put adds on your...
4,Does the Bible mention anything about a place ...,St. John in the book of Revelation mentions an...


# Preprocessing

1. Removing all characters that are not alpha numeric
2. Removing stopwords - commonly used words such as 'a', 'to', 'in' and so on.. that do not contribute to the semantic similarity between two sentences.

This will be applied to both the questions and the user question sentence.

Note : There is an option to not perform stopword removal. This is because some of the later models such as BERT work well without stopword removal - and we will try that out.

In [2]:
import re
import gensim
from gensim.parsing.preprocessing import remove_stopwords

def clean_sentence(sentence, stopwords=False):

    sentence = sentence.lower().strip()
    sentence = re.sub(r'[^a-z0-9\s]', '', sentence)

    if stopwords:
         sentence = remove_stopwords(sentence)

    return sentence

def get_cleaned_sentences(df,stopwords=False):
    sents=df[["question"]];
    cleaned_sentences=[]

    for index,row in df.iterrows():
        #print(index,row)
        cleaned=clean_sentence(row["question"],stopwords);
        cleaned_sentences.append(cleaned);
    return cleaned_sentences;

cleaned_sentences=get_cleaned_sentences(df,stopwords=True)
print(cleaned_sentences);

print("\n")

cleaned_sentences_with_stopwords=get_cleaned_sentences(df,stopwords=False)
print(cleaned_sentences_with_stopwords);






# Bag of words Model

The initial model we'll employ for assessing semantic similarity utilizes the Bag of Words (BOW) approach. In BOW, each sentence is transformed into a vector, with its length corresponding to the total number of words in the vocabulary. Each element within the vector represents the frequency of a specific word in the sentence. Below, we demonstrate this with an example that prints the dictionary and the questions in the BOW sparse format.

It's important to note that a vector representation of a sentence is also known as an "embedding," as it embeds the sentence into an M-dimensional space if the vector has a length of M.   

In [None]:
import numpy

sentences=cleaned_sentences_with_stopwords
#sentences=cleaned_sentences

# Splitting it by white space
sentence_words = [[word for word in document.split() ]
         for document in sentences]

from gensim import corpora

dictionary = corpora.Dictionary(sentence_words)
for key, value in dictionary.items():
    print(key, ' : ', value)

import pprint
bow_corpus = [dictionary.doc2bow(text) for text in sentence_words]
for sent,embedding in zip(sentences,bow_corpus):
    print(sent)
    print(embedding)

question_orig="How to use proxy?"
question=clean_sentence(question_orig,stopwords=False);
question_embedding = dictionary.doc2bow(question.split())


print("\n\n",question,"\n",question_embedding)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
whats going on with all the fires in russia is this sabotage ukrainian special forces something else
[(3, 1), (7, 1), (16, 1), (41, 1), (93, 1), (201, 1), (339, 1), (664, 1), (899, 1), (1349, 1), (1564, 1), (1825, 1), (2085, 1), (3645, 1), (3646, 1), (3647, 1), (3648, 1)]
how does one get officially married if they dont believe in god would you still have to have a priest
[(0, 1), (3, 1), (8, 1), (12, 1), (15, 1), (17, 1), (52, 1), (76, 2), (77, 1), (89, 1), (94, 1), (125, 1), (284, 1), (304, 1), (377, 1), (636, 1), (2241, 1), (2317, 1), (2358, 1)]
is it possible to use an extensive vocabulary without being a jerk
[(8, 1), (12, 1), (16, 1), (19, 1), (31, 1), (112, 1), (371, 1), (397, 1), (844, 1), (1933, 1), (1934, 1), (1935, 1)]
what is the difference between granite and marble
[(7, 1), (13, 1), (16, 1), (20, 1), (50, 1), (763, 1), (795, 1), (796, 1)]
what newsletters would you recommend to a fullstack developer
[(8, 1),


After obtaining vector representations for each sentence using BOW, we can determine the distance between two vectors by calculating their cosine similarity. While other similarity measures are available, we will use cosine similarity for simplicity.

To find the closest matching answer, we compute the cosine similarity between the query vector and each of the question vectors.

In the example below, the BOW representation did not perform well, retrieving the wrong answer because it relies on exact word matches.

In [None]:
import sklearn
from sklearn.metrics.pairwise import cosine_similarity;
def retrieveAndPrintFAQAnswer(question_embedding,sentence_embeddings,FAQdf,sentences):
    max_sim=-1;
    index_sim=-1;
    for index,faq_embedding in enumerate(sentence_embeddings):
        #sim=cosine_similarity(embedding.reshape(1, -1),question_embedding.reshape(1, -1))[0][0];
        sim=cosine_similarity(faq_embedding,question_embedding)[0][0];
        #print(index, sim, sentences[index])
        if sim>max_sim:
            max_sim=sim;
            index_sim=index;

    print("\n")
    print("Question: ",question)
    print("\n");
    print("Retrieved: ",FAQdf.iloc[index_sim,0])
    print(FAQdf.iloc[index_sim,1])

retrieveAndPrintFAQAnswer(question_embedding,bow_corpus,df,sentences);



Question:  how to use proxy


Retrieved:  Did a landlord ever offer to lower your rent so that you wouldn't leave?
it’s never happened that a landlord has offered to lower my rent so that I wouldn’t leave



# Word2Vec Embeddings

Word2Vec embeddings are commonly trained using the skip-gram model. This training method involves taking a word as input and reconstructing its context. Consequently, these embeddings consider the semantic similarity of words based on contextual information. The resulting embeddings ensure that words with similar meanings are closer to each other in terms of cosine similarity.


**Skipgram model** :

The most popular word2vec model is the skipgram model. Particularly, the most commonly used pre-trained model is based on the Google News dataset that has 3 billion running words and creates upto 300 dimensional embedding for 3 Million words

# Glove Embeddings :

GloVe is an alternative approach for creating word embeddings, utilizing matrix factorization techniques on the word-word co-occurrence matrix.

Although both techniques are widely used, GloVe performs better on certain datasets, while the Word2Vec skip-gram model excels on others. In this study, we will experiment with both the Word2Vec and GloVe models.


In [None]:
from gensim.models import Word2Vec
import gensim.downloader as api

#Downloading both the models and their directory
glove_model=None;
try:
    glove_model = gensim.models.KeyedVectors.load("./glovemodel.mod")
    print("Loaded glove model")
except:
    glove_model = api.load('glove-twitter-25')
    glove_model.save("./glovemodel.mod")
    print("Saved glove model")

v2w_model=None;
try:
    v2w_model = gensim.models.KeyedVectors.load("./w2vecmodel.mod")
    print("Loaded w2v model")
except:
    v2w_model = api.load('word2vec-google-news-300')
    v2w_model.save("./w2vecmodel.mod")
    print("Saved glove model")

w2vec_embedding_size=len(v2w_model['computer']);
glove_embedding_size=len(glove_model['computer']);


Saved glove model
Saved glove model


**Finding Phrase Embeddings from Word Embeddings**

To obtain phrase embeddings, the simplest method to convert word embeddings to phrase embeddings, applicable with both Word2Vec and GloVe embeddings, is to sum the individual word embeddings in the phrase to create a phrase vector.

The implementation of this method is shown below.


In [None]:
def getWordVec(word,model):
        samp=model['computer'];
        vec=[0]*len(samp);
        try:
                vec=model[word];
        except:
                vec=[0]*len(samp);
        return (vec)


def getPhraseEmbedding(phrase,embeddingmodel):

        samp=getWordVec('computer', embeddingmodel);
        vec=numpy.array([0]*len(samp));
        den=0;
        for word in phrase.split():
            #print(word)
            den=den+1;
            vec=vec+numpy.array(getWordVec(word,embeddingmodel));
        return vec.reshape(1, -1)


In [None]:
question_orig="How to use proxy?"
question=clean_sentence(question_orig,stopwords=False);
question_embedding = dictionary.doc2bow(question.split())

print("\n\n",question,"\n",question_embedding)



 how to use proxy 
 [(8, 1), (15, 1), (18, 1), (19, 1)]


In [None]:
#With w2Vec

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,v2w_model));

question_embedding=getPhraseEmbedding(question,v2w_model);

retrieveAndPrintFAQAnswer(question_embedding,sent_embeddings,df, cleaned_sentences);



Question:  how to use proxy


Retrieved:  What is a proxy, and how can I use one?
A proxy server is a system or router that provides a gateway between users and the internet. Therefore, it helps prevent cyber attackers from entering a private network. It is a server, referred to as an “intermediary” because it goes between end-users and the web pages they visit online.
 When a computer connects to the internet, it uses an IP address. This is similar to your home’s street address, telling incoming data where to go and marking outgoing data with a return address for other devices to authenticate. A proxy server is essentially a computer on the internet that has an IP address of its own.
 How a Proxy Works
Because a proxy server has its own IP address, it acts as a go-between for a computer and the internet. Your computer knows this address, and when you send a request on the internet, it is routed to the proxy, which then gets the response from the web server and forwards the data from

In [None]:
#With Glove

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,glove_model));

question_embedding=getPhraseEmbedding(question,glove_model);

retrieveAndPrintFAQAnswer(question_embedding,sent_embeddings,df, cleaned_sentences);




Question:  how to use proxy


Retrieved:  When should you not use serverless?
Serverless functions solve, overall, one conceptual problem: how do I make this computation be always-on and always available to be reached? This is the fundamental feature of serverless functions.
 If you don’t need that, serverless isn’t that useful and you’re probably better off with something else.



# BERT EMBEDDINGS

Instead of considering words individually, BERT, a transformer-based model, uses the context of words to generate embeddings. In 2018, BERT set several records in NLP tasks, marking a significant advancement in the field. BERT employs deep learning techniques to understand context in a bi-directional manner, utilizing information from the entire sentence through self-attention mechanisms.

For example, consider the search query, “2019 Japan tourist in Canada needs a visa.” The word "in" and its relationship to other words are crucial for understanding the query's meaning. It would be irrelevant to return information about Canadian citizens traveling to Japan since the query is about Japanese citizens traveling to Canada. BERT can handle such distinctions effectively.

Another notable example is BERT's ability to understand the impact of the word "no" in a query. For instance, with the query "Parking on a hill with no curb," it is not helpful to show results for parking on a hill with a curb, despite the semantic similarity. BERT can discern this subtle difference.

Unlike earlier models, such as bag-of-words or Word2Vec, BERT might not require the removal of stop words. This will be demonstrated in the exercise below.

In [3]:
!pip install transformers torch


Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-

In [4]:
from transformers import BertTokenizer, BertModel
import torch

# Initialize the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embeddings(texts):
    # Tokenize the input texts
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)

    # Return the last hidden state (embeddings)
    return outputs.last_hidden_state.mean(dim=1)  # Averaging embeddings across tokens

# Example usage
sentences = ['better president', 'Hilary Clinton']
encoded_sentences = get_embeddings(sentences)
print("Encoded sentences:", encoded_sentences)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Encoded sentences: tensor([[ 0.0480,  0.0031, -0.3345,  ..., -0.1005,  0.3328, -0.1023],
        [ 0.0812, -0.0784, -0.4180,  ..., -0.0970, -0.0877, -0.1008]])


Bert needs older version of numpy , will continue onother jupyter file