
# Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

- Word2Vec Model represents each word as 300 Dimensional Vector

- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

In [1]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-07-17 06:48:27--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.128.229
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.128.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-07-17 06:48:47 (76.0 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



# CODE ##

##### Load Word2Vec Model


**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [3]:
# Libraries
import numpy as np
import gensim
from gensim.models import KeyedVectors , Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
word_vector = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz', binary= True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [5]:
type(word_vector)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [10]:
 Apple = word_vector['Apple']
 Apple

array([-1.74804688e-01,  3.00292969e-02, -2.16796875e-01,  1.56250000e-01,
       -3.57421875e-01, -6.05468750e-02,  1.36718750e-01,  9.57031250e-02,
        3.17382812e-03, -4.29687500e-02, -3.30078125e-01,  2.57812500e-01,
        2.51953125e-01, -2.77343750e-01, -6.98242188e-02, -2.95410156e-02,
        3.22265625e-01, -7.76367188e-02, -3.06396484e-02, -1.67968750e-01,
       -5.76171875e-02,  3.05175781e-02,  5.52368164e-03, -1.26953125e-01,
       -1.44042969e-02,  1.75781250e-01,  9.47265625e-02,  3.16406250e-01,
       -7.81250000e-03, -3.40270996e-03,  3.63769531e-02,  1.11816406e-01,
       -1.24023438e-01,  1.29882812e-01, -3.22265625e-02, -1.60156250e-01,
        7.56835938e-02,  6.73828125e-02,  4.08203125e-01,  2.23632812e-01,
        1.60156250e-01,  3.63769531e-02, -1.64062500e-01, -3.51562500e-01,
        4.49218750e-02,  6.34765625e-02, -1.15234375e-01,  3.12500000e-01,
       -2.80761719e-02, -9.22851562e-02,  5.98144531e-02,  1.57470703e-02,
       -1.15234375e-01,  

In [11]:
Apple.shape

(300,)

In [16]:
mango = word_vector['mango']
apple = word_vector['apple']
banana = word_vector['banana']

In [13]:
cosine_similarity([Apple], [mango])

array([[0.11593594]], dtype=float32)

In [15]:
cosine_similarity([apple], [mango])

array([[0.57518554]], dtype=float32)

In [17]:
cosine_similarity([banana], [mango])

array([[0.63652116]], dtype=float32)

In [18]:
Google = word_vector['Google']
cosine_similarity([Apple], [Google])

array([[0.56835705]], dtype=float32)

## 1. Find the Odd One Out

In [55]:
def odd_one_out(words):
    """Accepts a list of words and returns the odd word"""
    # print(words)
    # all_word_vec = []
    # for i in words:
    #     all_word_vec.append(word_vector[i])
    all_word_vec = [word_vector[i] for i in words]
    # print(len(all_word_vec), all_word_vec[0].shape)

    avg_vec = np.mean(all_word_vec, axis = 0)
    # print(avg_vec.shape)

    odd_word = None
    max_sim = 2
    for word in words:
        temp_sim = cosine_similarity([avg_vec], [word_vector[word]])
        print(f"avg_vec and {word} ---> {temp_sim}")

        if temp_sim < max_sim:
            max_sim = temp_sim
            odd_word = word

    return odd_word

In [49]:
# a = [10,2,34,6,8,2,4]
# # min = 100000000
# max = -100000000
# for i in a:
#     if i > max:
#         max= i
# print(max)

In [56]:
odd_one_out(input_1)

avg_vec and apple ---> [[0.7806554]]
avg_vec and mango ---> [[0.7606032]]
avg_vec and juice ---> [[0.7106042]]
avg_vec and party ---> [[0.357093]]
avg_vec and orange ---> [[0.649024]]


'party'

In [19]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [57]:
odd_word = odd_one_out(input_1)
print(f"odd word in the given list is --> {odd_word}")

avg_vec and apple ---> [[0.7806554]]
avg_vec and mango ---> [[0.7606032]]
avg_vec and juice ---> [[0.7106042]]
avg_vec and party ---> [[0.357093]]
avg_vec and orange ---> [[0.649024]]
odd word in the given list is --> party


In [58]:
odd_word = odd_one_out(input_2)
print(f"odd word in the given list is --> {odd_word}")

avg_vec and music ---> [[0.66403615]]
avg_vec and dance ---> [[0.80607384]]
avg_vec and sleep ---> [[0.5149707]]
avg_vec and dancer ---> [[0.7154054]]
avg_vec and food ---> [[0.51771235]]
odd word in the given list is --> sleep


In [59]:
odd_word = odd_one_out(input_3)
print(f"odd word in the given list is --> {odd_word}")

avg_vec and match ---> [[0.5837205]]
avg_vec and player ---> [[0.6805351]]
avg_vec and football ---> [[0.72256005]]
avg_vec and cricket ---> [[0.69646657]]
avg_vec and dancer ---> [[0.52681357]]
odd word in the given list is --> dancer


In [60]:
odd_word = odd_one_out(input_4)
print(f"odd word in the given list is --> {odd_word}")

avg_vec and india ---> [[0.80707854]]
avg_vec and paris ---> [[0.74804693]]
avg_vec and russia ---> [[0.79275256]]
avg_vec and france ---> [[0.8136487]]
avg_vec and germany ---> [[0.841681]]
odd word in the given list is --> paris


### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](http://jalammar.github.io/images/word2vec/word2vec.png)

**man -> woman ::     prince -> princess**
**italy -> italian ::     spain -> spanish**
**india -> delhi ::     japan -> tokyo**
**man -> woman ::     boy -> girl**
**small -> smaller ::     large -> larger**

## Try it out
**man -> coder :: woman -> ______?**

In [None]:
word_vector.vocab.keys()

In [123]:
def predict_word(a, b, c, model) :
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    wv_a, wv_b, wv_c = model[a], model[b], model[c]
    min_sim = -10
    pred = None
    words = model.wv.vocab.keys()

    for w in words:
        if w in [a, b, c]:
            continue
        
        wv_w = model[w]

        temp_sim = cosine_similarity([wv_b - wv_a], [wv_w - wv_c])

        if temp_sim > min_sim:
            min_sim = temp_sim
            pred = w

    return pred

In [124]:
a, b, c = "man", "woman", "prince"
predict_word(a, b, c, word_vector)

  


'princess'

## 3. Training Your Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model **(CBOW)**
- Skip Gram Model

`Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner.` The algorithm was developed by Tomas Mikolov.

#### Data Preparation



- Each sentence must be tokenized, into a list of words.

- The sentences can be text loaded into memory once,
or we can build a data pipeline which iteratively feeds data to the model.


In [93]:
#libs
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [66]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [85]:
## Read the file 
def readFile(file): 
    f = open(file, 'r')
    text = f.read()
    # print(text)

    sent = nltk.sent_tokenize(text)

    data = []
    for s in sent:
        words = nltk.word_tokenize(s)
        words = [w.lower() for w in words if w.lower() not in stopwords.words('english') and len(w) > 2]
        data.append(words)
    return data

In [98]:
a = "This year was the year of big fat 2019 lavish 2020 and 99542654862656 extravagant weddings."
# print(a)
b = nltk.sent_tokenize(a)
# print(type(b))
# b

for i in b:
    # print(i)
    i = re.sub('[^a-zA-Z]', " ", i)
    i = re.sub('[ \t\n]+', ' ', i)
    print(i)
    w = nltk.word_tokenize(i)
    w = [j.lower() for j in w if j.lower() not in stopwords.words('english') and len(j) > 2]
    # temp = []
    # for j in w:
    #     if j.lower() not in stopwords.words('english') and len(j) > 2:
    #         temp.append(j.lower())
    # w = temp
    print(w)

This year was the year of big fat lavish and extravagant weddings 
['year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings']


In [86]:
file = '/content/bollywood_news.txt'
text = readFile(file)

In [88]:
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['deepveer', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'ranveer', 'wedding', 'style', 'file'], ['ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'year', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['nothing', 'beats', 'man', 'wedding', 'year', 'award', 'social', 'media'], ['wedding', 'season', 'year', 'kicked', 'deepika', 'padukone', 'ranveer', 'singh', 'flew', 'lake', 'como', 'tie', 'knot', 'two', 'days', 'november'], ['several', 'lavish', 'we

In [89]:
# lib
from gensim.models import Word2Vec

In [90]:
model = Word2Vec(text, size= 100, window= 10, min_count= 2)

In [122]:
word = model.wv.vocab
word.keys()

dict_keys(['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018', 'celebrations', 'waiting', 'come', 'also', 'gave', 'two', 'couple', 'airport', 'looks', 'reception', 'parties', 'everything', 'style', 'priyanka', 'nick', 'man', 'year', 'big', 'lavish', 'weddings', 'isha', 'ambani', 'anand', 'piramal', 'chopra', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', 'saw', 'many', 'grand', 'social', 'media', 'flew', 'lake', 'como', 'tie', 'knot', 'days', 'november', 'several', 'bengaluru', 'mumbai', 'even', 'tied', 'american', 'singer', 'jodhpur', 'december', 'yet', 'another', 'week', 'hosted', 'delhi', 'host', 'party', 'los', 'angeles', 'time', 'could', 'pre-wedding', 'festivities', 'took', 'udaipur', 'pop', 'soon', 'event', 'followed', 'three', 'stars', 'star', 'married', 'player', 'cricketer', 'much', 'like', 'couples', 'cake', 'appearances', 'see', 'white', 'italy', 'ceremonies', 'sindhi', 'seen', 'wearing', 'custom-made', 'sabya

In [99]:
model['nick']

  """Entry point for launching an IPython kernel.


array([ 2.8621913e-03, -2.0515082e-04,  6.3858822e-04, -8.6055323e-04,
       -2.4759721e-03,  1.1712040e-03,  2.2154090e-03,  2.0525013e-05,
       -4.0697204e-03, -5.0858120e-03,  2.5217290e-04,  7.8432204e-04,
        3.5915752e-03, -2.1355663e-04, -3.8461166e-03,  3.4058844e-03,
       -5.5755512e-04,  4.9911360e-03, -1.3874776e-03, -1.0732628e-03,
        1.7077090e-04,  1.8757195e-03, -2.3901912e-03, -3.0386650e-03,
        2.8603077e-03,  1.3546457e-03,  9.4087806e-04,  1.1219215e-03,
       -1.7312845e-03,  4.0384373e-03, -4.1738208e-03, -4.1057235e-03,
       -2.8084610e-03, -1.1593305e-03, -1.2666425e-04, -2.4157108e-03,
        1.4175175e-03, -4.9279369e-03,  5.0097327e-03, -4.4084559e-03,
       -4.7290404e-03, -4.9083070e-03,  4.0698764e-03,  4.7234125e-03,
        4.6677832e-03, -4.5823785e-03, -2.4699089e-03,  2.5960170e-03,
        1.7973717e-04,  1.5634418e-03,  3.8455368e-03, -2.2028040e-03,
       -7.2895311e-04, -4.1146525e-03, -2.1693269e-03, -2.3004945e-04,
      

In [100]:
model['nick'].shape

  """Entry point for launching an IPython kernel.


(100,)

In [116]:
def predict_actor(a, b, c, model):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka"]

    wv_a, wv_b, wv_c = model[a], model[b], model[c]
    min_sim = -10
    pred = None

    for w in actors:
        if w == a or w == b or w == c:
            continue
        
        wv_w = model[w]

        temp_sim = cosine_similarity([wv_b - wv_a], [wv_w - wv_c])

        if temp_sim > min_sim:
            min_sim = temp_sim
            pred = w

    return pred

### 4. Test your Model

In [117]:
a, b, c = "nick", "priyanka", "virat"
print(predict_actor(a, b, c, model))

chopra


  """
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


In [118]:
a, b, c = "ranveer", "deepika", "priyanka"
print(predict_actor(a, b, c, model))

anushka


  """
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


In [119]:
a,b,c = "ranveer", "singh", "deepika"
predict_actor(a, b, c, model)

  """
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'priyanka'

In [120]:
triad = ("deepika","padukone","priyanka")
predict_actor(*triad, model)

  """
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'jonas'

In [121]:
triad = ("priyanka","jonas","nick")
predict_actor(*triad, model)

  """
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


'virat'