## Word2vec using gensim library

**Author: Abhishek Dey**


* **Word2Vec** is a popular technique in Natural Language Processing (NLP) used to convert words into numerical vectors such that words with similar meanings have similar vector representations

* There are two main word2vec architectures:

    * **CBOW (Continuous Bag of Words):**

        Predicts the current word based on its context (surrounding words).

        Example: Given The cat ___ on the mat, predict sat.

    * **Skip-Gram:**

        Predicts surrounding context words given a current word.

        Example: Given sat, predict The, cat, on, the, mat.



### Import libraries

In [55]:
import gensim
from gensim.models import Word2Vec
from nltk import word_tokenize
from textacy import preprocessing as tp

### Data

In [56]:
data = ['He loves music', 'She like songs', 'She enjoys music', 'He likes music', 'She loves songs']

### Text pre-processing

In [57]:
def text_preprocessing(text):
    
    text = tp.remove.punctuation(text)
    text = tp.normalize.whitespace(text)
    text = text.lower()
    
    return text

### Filtered data

In [58]:
filtered_data = [text_preprocessing(text) for text in data]

In [59]:
filtered_data

['he loves music',
 'she like songs',
 'she enjoys music',
 'he likes music',
 'she loves songs']

### Tokenised data

In [60]:
tokenised_data = [word_tokenize(text) for text in filtered_data]

tokenised_data

[['he', 'loves', 'music'],
 ['she', 'like', 'songs'],
 ['she', 'enjoys', 'music'],
 ['he', 'likes', 'music'],
 ['she', 'loves', 'songs']]

### Word2vec - Skip gram 

In [61]:
model_skip_gram = Word2Vec(tokenised_data, vector_size=50, window=2, min_count=1, sg=1)

### Word2vec - Cbow

In [62]:
model_cbow = Word2Vec(tokenised_data, vector_size=50, window=2, min_count=1, sg=0)

In [63]:
vector = model_cbow.wv['he']

vector

array([ 0.00018913,  0.00615464, -0.01362529, -0.00275093,  0.01533716,
        0.01469282, -0.00734659,  0.0052854 , -0.01663426,  0.01241097,
       -0.00927464, -0.00632821,  0.01862271,  0.00174677,  0.01498141,
       -0.01214813,  0.01032101,  0.01984565, -0.01691478, -0.01027138,
       -0.01412967, -0.0097253 , -0.00755713, -0.0170724 ,  0.01591121,
       -0.00968788,  0.01684723,  0.01052514, -0.01310005,  0.00791574,
        0.0109403 , -0.01485307, -0.01481144, -0.00495046, -0.01725145,
       -0.00316314, -0.00080687,  0.00659937,  0.00288376, -0.00176284,
       -0.01118812,  0.00346073, -0.00179474,  0.01358738,  0.00794718,
        0.00905894,  0.00286861, -0.00539971, -0.00873363, -0.00206415],
      dtype=float32)

### Similar / nearby word

In [64]:
similar_word = model_cbow.wv.most_similar("likes")[0]

similar_word

('enjoys', 0.18339458107948303)

In [65]:
similar_word = model_cbow.wv.most_similar("enjoys")[0]

similar_word

('likes', 0.18339456617832184)

In [66]:
similar_word = model_cbow.wv.most_similar("loves")[0]

similar_word

('like', 0.0449172779917717)