<div style="color:#722F37;">
In this notebook, <strong>word embeddings</strong> and <strong>sentence embeddings</strong> have been introduced. Word embeddings are an improvement over methods like <strong>Bag of Words (BoW)</strong> and <strong>TF-IDF</strong>, as they map words to vector spaces based on usage, though they still lack context sensitivity. 

To address this, <strong>contextual embeddings</strong> were developed with transformer models like <strong>BERT</strong> and <strong>RoBERTa</strong>. These models generate different embeddings for the same word based on its context, allowing for more accurate representations. This advancement has proven essential for tasks like <strong>question answering</strong> and <strong>word sense disambiguation</strong>, greatly enhancing natural language understanding.
</div>

# Word embedding (Sementic embedding)

In [1]:
# Importing necessary libraries
import nltk  # Natural Language Toolkit (NLP library for working with text data)
from gensim.models import Word2Vec  # Gensim library for training Word2Vec model
from nltk.corpus import stopwords  # NLTK's stopwords list to filter out common words (like 'the', 'is')
import re  # Regular expression module for cleaning and processing text data


In [2]:
#This is a simple paragraph for the purpose of word embedding
paragraph='''King sits on throne. Queen sits on throne. King is man. Quenn is woman. king is powerful. queen is powerful. man have moustache '''

In [3]:
# Download stopwords from the NLTK corpus
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ved\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ', paragraph)  # Remove references or citations in square brackets (e.g., [1], [2])
text = re.sub(r'\s+', ' ', text)             # Replace multiple spaces with a single space
text = text.lower()                          # Convert all text to lowercase to maintain consistency
text = re.sub(r'\d', ' ', text)              # Remove digits from the text (e.g., years or numbers)
text = re.sub(r'\s+', ' ', text)             # Again, replace multiple spaces with a single space after removing digits

# Preparing the dataset
sentences = nltk.sent_tokenize(text)         # Tokenize the text into sentences

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]  # Tokenize each sentence into words

# Remove stopwords from each sentence
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]  # Filter out common stopwords


In [5]:
sentences

[['king', 'sits', 'throne', '.'],
 ['queen', 'sits', 'throne', '.'],
 ['king', 'man', '.'],
 ['quenn', 'woman', '.'],
 ['king', 'powerful', '.'],
 ['queen', 'powerful', '.'],
 ['man', 'moustache']]

In [6]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1, vector_size=10)

In [7]:
# Finding Word Vectors
vector = model.wv['king']
print(vector)

[ 0.07380505 -0.01533471 -0.04536613  0.06554051 -0.0486016  -0.01816018
  0.0287658   0.00991874 -0.08285215 -0.09448818]


In [8]:
# Most similar words
similar = model.wv.most_similar('king')
similar

[('.', 0.5436006188392639),
 ('powerful', 0.3293722867965698),
 ('moustache', 0.23243051767349243),
 ('woman', 0.035253241658210754),
 ('sits', -0.1799871027469635),
 ('man', -0.21132999658584595),
 ('throne', -0.38205233216285706),
 ('quenn', -0.5145737528800964),
 ('queen', -0.5381841063499451)]

<div style="color:#722F37;">
    Now, let's check another example with Word Sense Disambiguation where samne word "bank" mean 2 different things.
</div>

In [9]:
paragraph='''The bank of river is awesome. The bank of india is awesome. The bank of america is awesome.'''

In [10]:
# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]

for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [11]:
sentences

[['bank', 'river', 'awesome', '.'],
 ['bank', 'india', 'awesome', '.'],
 ['bank', 'america', 'awesome', '.']]

In [12]:
set([item for sublist in sentences for item in sublist])

{'.', 'america', 'awesome', 'bank', 'india', 'river'}

In [13]:
len(set([item for sublist in sentences for item in sublist]))

6

In [14]:
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=1, vector_size=10)

In [15]:
model.wv.index_to_key

['.', 'awesome', 'bank', 'america', 'india', 'river']

In [16]:
model.wv.key_to_index

{'.': 0, 'awesome': 1, 'bank': 2, 'america': 3, 'india': 4, 'river': 5}

In [17]:
len(model.wv.index_to_key)

6

In [18]:
# Finding Word Vectors
vector = model.wv['bank']
print(vector)

[ 0.07311766  0.05070262  0.06757693  0.00762866  0.06350891 -0.03405366
 -0.00946401  0.05768573 -0.07521638 -0.03936104]


<div style="color:#722F37;">
So, in the above example of word embedding using <strong>Word2Vec</strong>, it can be seen that there is a single word vector for each word, irrespective of the context. For example, <strong>bank</strong> (financial institution) and <strong>bank</strong> (riverside) both have the same vector representation. And this is the problem.
</div>


# Sentence Embedding (Contextual embedding)

#### BERT

<div style="color:#722F37;">
    BERT, or Bidirectional Encoder Representations from Transformers, is a state-of-the-art language representation model developed by Google. It utilizes a bidirectional approach, allowing it to consider the context of a word based on both its left and right surroundings. This enables more nuanced understanding and improved performance in various natural language tasks.
    </div>

In [19]:
from transformers import BertModel, BertTokenizer
import torch
import torch.nn.functional as F

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_sentence_embedding(sentence):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors='pt')
    
    # Pass through the model to get the embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract the embeddings for the [CLS] token, which is used to represent the sentence
    sentence_embedding = outputs.last_hidden_state[:, 0, :].squeeze()
    
    # Extract the word embeddings (all tokens except [CLS] and [SEP])
    word_embeddings = outputs.last_hidden_state.squeeze()  # Shape: (seq_length, hidden_size)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])  # Get tokenized words
    
    return sentence_embedding, word_embeddings, tokens

  torch.utils._pytree._register_pytree_node(


In [20]:
# Function to calculate cosine similarity between two embeddings
def cosine_similarity(embedding1, embedding2):
    return F.cosine_similarity(embedding1.unsqueeze(0), embedding2.unsqueeze(0)).item()

In [21]:
sentence1 = "The bank of river is awesome."
sentence2 = "The bank of india is awesome."
sentence3 = "The bank of america is awesome"

sentence_embedding1, word_embeddings1, tokens1  = get_sentence_embedding(sentence1)
sentence_embedding2, word_embeddings2, tokens2 = get_sentence_embedding(sentence2)
sentence_embedding3, word_embeddings3, tokens3 = get_sentence_embedding(sentence3)

#similarity between sentence 1 and sentence 2
print(cosine_similarity(sentence_embedding1, sentence_embedding2))

#similarity between sentence 1 and sentence 3
print(cosine_similarity(sentence_embedding1, sentence_embedding3))

0.9108641743659973
0.8279022574424744


In [22]:
print(sentence_embedding1.shape)
print(word_embeddings1.shape)
print(len(tokens1))
print(tokens1)

torch.Size([768])
torch.Size([9, 768])
9
['[CLS]', 'the', 'bank', 'of', 'river', 'is', 'awesome', '.', '[SEP]']


In [23]:
print(sentence_embedding1)

tensor([ 1.5993e-01, -2.8346e-01, -1.2725e-01,  6.7331e-02, -3.4368e-01,
        -6.3199e-01,  2.2716e-01,  7.7732e-01,  2.0396e-01, -7.7004e-01,
         2.6004e-01, -1.9773e-01,  2.0294e-01,  4.7289e-01,  1.6413e-01,
         9.9682e-02,  2.1389e-03,  4.4647e-01,  6.2649e-01, -2.8840e-01,
         1.9626e-01, -4.6361e-01, -3.3375e-02,  4.2548e-02,  2.0130e-01,
        -2.1451e-01,  6.9227e-02,  2.7582e-01, -1.6637e-02,  2.0860e-01,
         1.1602e-01,  4.4463e-01, -3.7375e-01, -4.0462e-01,  2.5003e-01,
        -3.0714e-01,  2.8115e-01,  4.1827e-02, -1.5454e-01, -1.8375e-01,
        -1.2003e-01,  4.3141e-02,  1.7394e-01,  5.4538e-02,  1.9144e-01,
        -5.0717e-01, -3.3121e+00,  3.5465e-01, -1.5636e-01, -2.8077e-01,
         3.7732e-01, -3.0141e-01, -1.6014e-01,  3.7137e-01,  1.6896e-01,
         5.8219e-01, -6.8167e-01,  6.3972e-01,  3.3573e-01,  9.4391e-02,
         2.4286e-01, -1.0007e-01,  7.1360e-02,  1.2184e-01,  7.9204e-03,
         8.5833e-02,  4.3037e-02,  4.2669e-02,  1.2

<div style="color:#722F37;">
It can be seen above that, sentence1 and sentence 2, are very similar because it is related to india. but, now, let's see the embessing of "bank" in sentence 1 and sentence 2.
    </div>

In [24]:
print(tokens1)
print(tokens1[2])
print("---------")
print(tokens2)
print(tokens2[2])

print("---------")
print(cosine_similarity(word_embeddings1[2], word_embeddings2[2]))

['[CLS]', 'the', 'bank', 'of', 'river', 'is', 'awesome', '.', '[SEP]']
bank
---------
['[CLS]', 'the', 'bank', 'of', 'india', 'is', 'awesome', '.', '[SEP]']
bank
---------
0.556409478187561


<div style="color:#722F37;">
    So, it can be seen that, though at sentence level sentence 1 and sentence 2 are similar but that similarity is not coming due to the word "bank", as "bank" in both sentences are having cosine similarity of 0.55 only.
    </div>

In [25]:
sentence5 = "the bank of india and bank of baroda situated at the bank of ganges is awesome"
sentence_embedding5, word_embeddings5, tokens5 = get_sentence_embedding(sentence5)

print(tokens5)
print(tokens5[2])
print(tokens5[6])
print(tokens5[13])
print(cosine_similarity(word_embeddings5[2], word_embeddings5[6]))
print(cosine_similarity(word_embeddings5[2], word_embeddings5[13]))
print(cosine_similarity(word_embeddings5[6], word_embeddings5[13]))

['[CLS]', 'the', 'bank', 'of', 'india', 'and', 'bank', 'of', 'bar', '##oda', 'situated', 'at', 'the', 'bank', 'of', 'gang', '##es', 'is', 'awesome', '[SEP]']
bank
bank
bank
0.9170936942100525
0.6640843749046326
0.6949294209480286


<div style="color:#722F37;"> In above sentence, there are three bank word, but it can be seen that, first 2 are reltaed to financial institution and third is riverside. And, hence, similarity score between first 2 bank is high but first and three or second and three is low. </div>

#### RoBERTa 

<div style="color:#722F37;"> RoBERTa, or A Robustly Optimized BERT Approach, is an advanced variant of BERT developed by Facebook AI. It enhances BERT's training methodology by utilizing larger batches and longer sequences while removing the next sentence prediction objective. This optimization results in improved performance across various natural language processing tasks. </div>


In [26]:
from transformers import RobertaModel, RobertaTokenizer
import torch
import torch.nn.functional as F

# Load pre-trained RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

def get_embeddings(sentence):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors='pt')
    
    # Pass through the model to get the embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract the sentence embedding from the <s> token (RoBERTa's equivalent of [CLS])
    sentence_embedding = outputs.last_hidden_state[:, 0, :].squeeze()

    # Extract the word embeddings (all tokens except <s> and </s>)
    word_embeddings = outputs.last_hidden_state.squeeze()  # Shape: (seq_length, hidden_size)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])  # Get tokenized words
    
    return sentence_embedding, word_embeddings, tokens

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
sentence1 = "The bank of river is awesome."
sentence2 = "The bank of india is awesome."
sentence3 = "The bank of america awesome"

sentence_embedding1, word_embeddings1, tokens1  = get_embeddings(sentence1)
sentence_embedding2, word_embeddings2, tokens2 = get_embeddings(sentence2)
sentence_embedding3, word_embeddings3, tokens3 = get_embeddings(sentence3)

#similarity between sentence 1 and sentence 2
print(cosine_similarity(sentence_embedding1, sentence_embedding2))

#similarity between sentence 1 and sentence 3
print(cosine_similarity(sentence_embedding1, sentence_embedding3))

0.9983242750167847
0.9942834377288818


In [28]:
print(sentence_embedding1.shape)
print(word_embeddings1.shape)
print(len(tokens1))
print(tokens1)

torch.Size([768])
torch.Size([9, 768])
9
['<s>', 'The', 'Ġbank', 'Ġof', 'Ġriver', 'Ġis', 'Ġawesome', '.', '</s>']


<div style="color:#722F37;"> The Ġ token in RoBERTa indicates a space before the word, helping the model differentiate between word boundaries and providing better context in text processing. </div>

In [29]:
print(sentence_embedding1)

tensor([-1.1294e-01,  1.2454e-01, -1.5777e-02, -1.2765e-01,  9.3037e-02,
        -2.5061e-02, -4.4133e-02,  1.7352e-02,  5.5110e-02, -9.8595e-02,
        -3.2031e-02,  2.1407e-02,  7.1863e-02, -3.3101e-02,  3.7476e-02,
        -2.0152e-03, -1.2897e-02,  5.1805e-03, -8.5620e-03, -6.4059e-02,
        -8.9887e-02,  2.0873e-02,  5.9324e-02,  1.2222e-01, -3.1684e-02,
         7.9940e-02,  1.3939e-01,  6.8650e-02, -7.5467e-02,  3.2261e-02,
        -4.9740e-02, -1.0487e-01,  5.5901e-02,  1.6350e-02,  4.5014e-03,
         1.1428e-01,  1.4187e-02, -2.6811e-02, -3.9403e-02,  5.5864e-02,
        -2.4190e-02,  1.6140e-01,  4.4250e-02, -2.8860e-02,  7.3279e-02,
         2.6238e-02,  2.4181e-03, -1.8431e-02, -3.3616e-02, -2.6945e-02,
        -9.5245e-03,  7.1390e-02, -5.8574e-02,  2.2873e-02, -1.3999e-01,
         4.9996e-02,  3.6759e-02,  8.3483e-02,  4.5792e-02, -1.2199e-01,
        -2.5181e-02, -1.8193e-01, -8.8339e-02, -7.0444e-02,  6.5770e-02,
        -7.0507e-02, -1.3377e-02,  2.4619e-02,  2.0

In [30]:
print(tokens1)
print(tokens1[2])
print("---------")
print(tokens2)
print(tokens2[2])

print("---------")
print(cosine_similarity(word_embeddings1[2], word_embeddings2[2]))

['<s>', 'The', 'Ġbank', 'Ġof', 'Ġriver', 'Ġis', 'Ġawesome', '.', '</s>']
Ġbank
---------
['<s>', 'The', 'Ġbank', 'Ġof', 'Ġind', 'ia', 'Ġis', 'Ġawesome', '.', '</s>']
Ġbank
---------
0.9483951926231384


In [31]:
sentence5 = "the bank of india and bank of baroda situated at the bank of ganges is awesome"
sentence_embedding5, word_embeddings5, tokens5 = get_sentence_embedding(sentence5)

print(tokens5)
print(tokens5[2])
print(tokens5[6])
print(tokens5[13])
print(cosine_similarity(word_embeddings5[2], word_embeddings5[7]))
print(cosine_similarity(word_embeddings5[2], word_embeddings5[14]))
print(cosine_similarity(word_embeddings5[7], word_embeddings5[14]))

['<s>', 'the', 'Ġbank', 'Ġof', 'Ġind', 'ia', 'Ġand', 'Ġbank', 'Ġof', 'Ġbar', 'oda', 'Ġsituated', 'Ġat', 'Ġthe', 'Ġbank', 'Ġof', 'Ġg', 'anges', 'Ġis', 'Ġawesome', '</s>']
Ġbank
Ġand
Ġthe
0.9874536395072937
0.9697230458259583
0.9586929082870483


####  GPT-2 

<div style="color:#722F37;"> GPT-2, or Generative Pre-trained Transformer 2, is a language model developed by OpenAI. It employs an autoregressive approach, predicting the next word in a sentence based on preceding context. With its extensive training on diverse internet text, GPT-2 excels in generating coherent and contextually relevant text across various applications. </div>

In [32]:
from transformers import GPT2Model, GPT2Tokenizer
import torch

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

def get_embeddings(sentence):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors='pt')
    
    # Pass through the model to get the embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # GPT-2 doesn't have a [CLS] token, so we average the word embeddings to get the sentence embedding
    word_embeddings = outputs.last_hidden_state.squeeze()  # Shape: (seq_length, hidden_size)
    
    # Sentence embedding can be derived by averaging word embeddings
    sentence_embedding = word_embeddings.mean(dim=0)
    
    # Get the tokens (GPT-2 uses byte-pair encoding, so tokens may differ)
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    return sentence_embedding, word_embeddings, tokens

In [33]:
sentence1 = "The bank of river is awesome."
sentence2 = "The bank of india is awesome."
sentence3 = "The bank of america awesome"

sentence_embedding1, word_embeddings1, tokens1  = get_embeddings(sentence1)
sentence_embedding2, word_embeddings2, tokens2 = get_embeddings(sentence2)
sentence_embedding3, word_embeddings3, tokens3 = get_embeddings(sentence3)

#similarity between sentence 1 and sentence 2
print(cosine_similarity(sentence_embedding1, sentence_embedding2))

#similarity between sentence 1 and sentence 3
print(cosine_similarity(sentence_embedding1, sentence_embedding3))

0.9995173811912537
0.9984961152076721


In [34]:
print(sentence_embedding1.shape)
print(word_embeddings1.shape)
print(len(tokens1))
print(tokens1)

torch.Size([768])
torch.Size([7, 768])
7
['The', 'Ġbank', 'Ġof', 'Ġriver', 'Ġis', 'Ġawesome', '.']


<div style="color:#722F37;"> In GPT-2, the Ġ token signifies a space before the subsequent word, aiding in recognizing word boundaries and enhancing the model's understanding of text structure.</div>

In [35]:
print(sentence_embedding1)

tensor([ 2.2720e-02,  1.2177e-02, -6.6039e-01,  1.3097e-01,  2.0369e-01,
        -2.0988e-02,  4.3167e+00,  2.6947e-01, -2.6785e-02, -1.7381e-01,
        -5.9662e-02, -4.0658e-02,  3.5592e-02,  4.6588e-02,  4.3792e-02,
        -2.7996e-01, -1.7147e-01, -6.0552e-02, -3.0818e-01, -6.7432e-01,
         1.2491e-01,  9.0983e-02, -2.2381e-01,  3.9666e-02,  8.6331e-02,
         2.8157e-01, -5.2173e-02, -1.2894e-01, -2.4060e-01, -3.8236e-02,
        -6.3042e-02, -1.0419e-01,  7.8191e-02, -3.7931e-02, -1.3611e-01,
         4.3478e-01,  6.2503e+01,  2.4032e-01, -2.8039e-01,  2.9788e-01,
         2.5196e-01,  2.3417e-01,  1.1968e-01, -5.2832e-02, -7.9778e-02,
        -5.0708e-02, -9.2995e-02, -3.8076e-01, -1.6937e-01,  1.4483e-01,
        -4.3260e-02,  1.4739e-01,  3.1843e-01,  7.7430e-03, -2.4752e-01,
         6.5987e-01,  6.1598e-01, -1.3300e-01,  1.0418e-01, -2.5831e-01,
        -5.8405e-02, -1.7147e-01,  3.9393e-02,  2.6453e-01, -1.0963e+00,
         8.3859e-02, -1.1631e-01, -1.0500e-01, -5.1

In [36]:
print(tokens1)
print(tokens1[1])
print("---------")
print(tokens2)
print(tokens2[1])

print("---------")
print(cosine_similarity(word_embeddings1[2], word_embeddings2[2]))

['The', 'Ġbank', 'Ġof', 'Ġriver', 'Ġis', 'Ġawesome', '.']
Ġbank
---------
['The', 'Ġbank', 'Ġof', 'Ġind', 'ia', 'Ġis', 'Ġawesome', '.']
Ġbank
---------
1.0000001192092896


<div style="color:#722F37;"> A value slightly greater than 1 (like 1.0000001192092896) can occur due to floating-point precision errors in the calculations. This is common in numerical computations, especially when dealing with high-dimensional data or very small differences in vectors. </div>

In [37]:
sentence5 = "the bank of india and bank of baroda situated at the bank of ganges is awesome"
sentence_embedding5, word_embeddings5, tokens5 = get_sentence_embedding(sentence5)

print(tokens5)
print(tokens5[1])
print(tokens5[6])
print(tokens5[13])
print(cosine_similarity(word_embeddings5[2], word_embeddings5[7]))
print(cosine_similarity(word_embeddings5[2], word_embeddings5[14]))
print(cosine_similarity(word_embeddings5[7], word_embeddings5[14]))

['the', 'Ġbank', 'Ġof', 'Ġind', 'ia', 'Ġand', 'Ġbank', 'Ġof', 'Ġbar', 'oda', 'Ġsituated', 'Ġat', 'Ġthe', 'Ġbank', 'Ġof', 'Ġg', 'anges', 'Ġis', 'Ġawesome']
Ġbank
Ġbank
Ġbank
0.9868789911270142
0.9937019348144531
0.996721088886261


<div style="color:#722F37;">
<strong>BERT</strong> tends to perform better than <strong>RoBERTa</strong> and <strong>GPT-2</strong> in tasks involving <strong>word sense disambiguation</strong> and <strong>nuanced language understanding</strong> for a few reasons. First, BERT is a <strong>bidirectional transformer</strong> that processes text by looking at both the left and right context of each word, which allows it to capture the full context of a word within a sentence. This is particularly useful when disambiguating <strong>polysemous words</strong> (e.g., "bank" meaning a financial institution vs. a riverbank), as BERT uses context from all surrounding words to understand the meaning. In contrast, <strong>GPT-2</strong> is an <strong>autoregressive model</strong> that only predicts the next word based on preceding words, making it less effective at using the full sentence context to differentiate meanings. While <strong>RoBERTa</strong> also uses a bidirectional transformer like BERT, its training involves larger batches and longer sequences, but it lacks BERT's <strong>next sentence prediction (NSP)</strong> objective, which helps BERT model relationships between sentences more effectively. This NSP task may give BERT an edge in understanding complex sentence relationships, leading to better performance in certain nuanced language tasks.
</div>
