AIM: WRITE A PROGRAM TO GENERATE WORD EMBEDDING OF THE TEXT USING WORD2VEC

THEORY: Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.
Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two architectures :
1. CBOW (Continuous Bag of Words) : CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.
2. Skip Gram : Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

EXECUTION:

In [None]:
import nltk
nltk.download('punkt')
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
  
import gensim
from gensim.models import Word2Vec
sample = open("Sample_text.txt", "r")
s = sample.read()
f = s.replace("\n", " ")
  
data = []

for i in sent_tokenize(f):
    temp = []

    for j in word_tokenize(i):
        temp.append(j.lower())
  
    data.append(temp)

model1 = gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5)

print("Cosine similarity between 'language' " + "and 'processing' - CBOW : ",model1.similarity('language', 'processing'))
      
print("Cosine similarity between 'language' " + "and 'processing' - CBOW : ", model1.similarity('language', 'processing'))

model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5, sg = 1)

print("Cosine similarity between 'language' " +"and 'processing' - Skip Gram : ",model2.similarity('language', 'processing'))
      
print("Cosine similarity between 'language' " +"and 'processing' - Skip Gram : ",model2.similarity('language', 'processing'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Cosine similarity between 'language' and 'processing' - CBOW :  -0.14855781
Cosine similarity between 'language' and 'processing' - CBOW :  -0.14855781
Cosine similarity between 'language' and 'processing' - Skip Gram :  -0.14400575
Cosine similarity between 'language' and 'processing' - Skip Gram :  -0.14400575


OUTPUT:
Cosine similarity between 'language' and 'processing' - CBOW : 0.083356075
Cosine similarity between 'language' and 'processing' - CBOW : 0.083356075
Cosine similarity between 'language' and 'processing' - Skip Gram : 0.09677577
Cosine similarity between 'language' and 'processing' - Skip Gram : 0.09677577

CONCLUSION:
Hence we studied word embedding of the text using Word2Vec.