<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/NLP1_Word_and_Sentence_Embeddings_IK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Word Embeddings
## Word2Vec

In [2]:
# First, you'll need to install gensim
# !pip install gensim

# Import the necessary modules

from gensim.test.utils import common_texts

from gensim.models import Word2Vec

In [None]:
print(common_texts) #Sample Data

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]


 Word2vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

`model = Word2Vec(sentences, min_count=10)  # default value is 5`

A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

`model = Word2Vec(sentences, vector_size=200)  # default value is 100`

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

Other hyper-parameters:

*   size: window=window_size for capturing context for target word

*   sample: The threshold for configuring which higher-frequency words are randomly down sampled, useful range is (0, 1e-5)

*   workers: Use these many worker threads to train the model (faster training with multicore machines)

*   sg: Training algorithm: skip-gram if sg=1, otherwise CBOW.

*   iter: Number of iterations (epochs) over the corpus.


In [3]:
model = Word2Vec(sentences=common_texts, vector_size=10, window=5, min_count=1, workers=4)
#Here, vector_size = 10 denotes the length of embedding
model.save("word2vec.model")

If you save the model you can continue training it later:

In [4]:
# load the saved model
model = Word2Vec.load("word2vec.model")

The trained word vectors are stored in a KeyedVectors instance, as model.wv:

In [5]:
# Get the embeddings for the word 'human'
embedding = model.wv['human']

print(embedding)
print(len(embedding))

[-0.00410223 -0.08368949 -0.05600012  0.07104538  0.0335254   0.0722567
  0.06800248  0.07530741 -0.03789154 -0.00561806]
10


In [6]:
# Get the most similar words (having the most similar embeddings)
similar_words = model.wv.most_similar('human',topn = 3) #topn denotes the top 3 similar words
print(similar_words)

[('graph', 0.3586882948875427), ('system', 0.22743132710456848), ('response', 0.11532396823167801)]


In [7]:
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")

In [8]:
# Load back with memory-mapping = read-only, shared across processes.
from gensim.models import KeyedVectors
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')
wv['computer']  # Get numpy vector embedding for 'computer'

array([ 0.01632109,  0.00189991,  0.03474986,  0.00217862,  0.09622561,
        0.05062568, -0.08920852, -0.07044294,  0.00901806,  0.06395016],
      dtype=float32)

### Refer to the link below for more details:
https://radimrehurek.com/gensim/models/word2vec.html

# Gensim comes with several already pre-trained models, in the Gensim-data repository

In [9]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [10]:
# Download the "glove-twitter-25" embeddings
# Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.
glove_vectors = gensim.downloader.load('glove-twitter-25')
glove_vectors



<gensim.models.keyedvectors.KeyedVectors at 0x7ee5798188b0>

In [11]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')

[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]

# Document/Sentence Embeddings
Paragraph, Sentence, and Document embeddings

## Doc2vec

In [12]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Define your sentences (example)
sentences = ["this is the first sentence", "this is the second sentence", "yet another sentence", "one more sentence", "and the final sentence"]

# Tag the sentences for training
tagged_data = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(sentences)]

# Train the model
model = Doc2Vec(tagged_data, vector_size=10, window=2, min_count=1, workers=4)

# Get the embeddings for the sentences
sentence_vectors = [model.infer_vector(sentence.split()) for sentence in sentences]
# The infer_vectors expects the input as a list of words (nltk.word_tokenize())

print("Sentence Embeddings:")
print(sentence_vectors) #Embeddings of the sentences

import numpy as np
print("\nShape:")
print(np.array(sentence_vectors).shape)

Sentence Embeddings:
[array([-0.02651018,  0.00345225,  0.02950032, -0.00521099,  0.02622059,
       -0.04415315,  0.03949709,  0.0347003 , -0.0355242 ,  0.0355443 ],
      dtype=float32), array([-0.00355846, -0.01089069,  0.01064902, -0.03510013,  0.0489391 ,
        0.00149197, -0.00687361,  0.02762178,  0.02065086, -0.01745699],
      dtype=float32), array([-0.03286843,  0.02593765,  0.01603379, -0.04145756, -0.03280213,
        0.01090155,  0.03253587, -0.03788246,  0.02422524,  0.03906832],
      dtype=float32), array([-0.00464409, -0.00556819,  0.02142883, -0.00206426, -0.02979414,
        0.03101977,  0.04085322, -0.01811217,  0.04930789, -0.01411656],
      dtype=float32), array([-0.03791236, -0.02003873, -0.01327085,  0.00074965, -0.04746118,
       -0.04426737, -0.01107387,  0.00713651, -0.0143    , -0.0122711 ],
      dtype=float32)]

Shape:
(5, 10)


In [13]:
print(sentence_vectors[0]) #the first embedding

[-0.02651018  0.00345225  0.02950032 -0.00521099  0.02622059 -0.04415315
  0.03949709  0.0347003  -0.0355242   0.0355443 ]


In [16]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(sentence_vectors[1].reshape(1,-1),sentence_vectors[2].reshape(1,-1))[0][0]
#Cosine similarity between embeddings

-0.2199869

In [17]:
# Find the similarity between all the sentences
similarity = cosine_similarity(sentence_vectors)
similarity

array([[ 1.        ,  0.15361078,  0.08465073, -0.32718945,  0.14108971],
       [ 0.15361078,  0.99999994, -0.2199869 , -0.0894393 , -0.33247685],
       [ 0.08465073, -0.2199869 ,  0.99999994,  0.54031366,  0.01249529],
       [-0.32718945, -0.0894393 ,  0.54031366,  1.        , -0.15579385],
       [ 0.14108971, -0.33247685,  0.01249529, -0.15579385,  0.99999994]],
      dtype=float32)

In [18]:
#Find the most similar sentence to the first sentence (at index = 0)
idx = 0  # The index of the sentence for which you want to find the most similar sentence
max = -1 # This will store the cosine_similarity of the most similar document
max_idx = -1
print("Input Sentence -->", sentences[idx])
for i in range(np.array(sentence_vectors).shape[0]):
    if i == idx:
      continue
    sim = cosine_similarity(sentence_vectors[i].reshape(1,-1),
                            sentence_vectors[idx].reshape(1,-1))[0][0]
    if max < sim:
        max = sim
        max_idx = i

print("Most Similar Sentence -->", sentences[max_idx])
print("Cosine Simialrity:", max)

Input Sentence --> this is the first sentence
Most Similar Sentence --> this is the second sentence
Cosine Simialrity: 0.1536108


#### More about Doc2vec here:
https://radimrehurek.com/gensim/models/doc2vec.html