
## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

### CODE ##

##### Load Word2Vec Model


**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [2]:
#!pip install gensim

In [18]:
import gensim
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [7]:
word_vectors = KeyedVectors.load_word2vec_format("./GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin", binary=True)

In [26]:
v_apple = word_vectors['apple']
v_mango = word_vectors['mango']
v_india = word_vectors['india']

In [10]:
v_apple.shape

(300,)

In [15]:
v_mango.shape

(300,)

In [25]:
cosine_similarity([v_mango], [v_apple])

array([[0.57518554]], dtype=float32)

In [28]:
cosine_similarity([v_mango], [v_india])

array([[0.18220624]], dtype=float32)

# 1. Odd One Out

In [32]:
import numpy as np

In [50]:
def odd_one_out(words):
    
    all_words_vectors = [ word_vectors[w] for w in words]
    avg_vector = np.mean(all_words_vectors, axis = 0)
    
#     print(avg_vector.shape)
    
    odd_word = None
    
    min_sim = 100.0
    
    for w in words:
        sim = cosine_similarity([avg_vector], [word_vectors[w]])
        if sim < min_sim:
            min_sim = sim
            odd_word = w
            
    return odd_word

In [51]:
odd_one_out(['india', 'mango', 'banana', 'orange', 'grapes'])

'india'

In [52]:
odd_one_out(["football", "cricket", "baseball", "hockey", "dance"])

'dance'

In [56]:
cosine_similarity([word_vectors['big']], [word_vectors['small']])

array([[0.49586788]], dtype=float32)

In [53]:
odd_one_out(["laptop", "mobile", "tablet", "television"])

'television'

In [58]:
odd_one_out(["india","paris","russia","france","germany"])

'paris'

In [59]:
odd_one_out(["india", "nepal", "australia", "china", "pakistan"])

'china'

### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](./images/word2vec.png)

`man -> woman :: 	prince -> princess`  
`italy -> italian :: 	spain -> spanish`  
`india -> delhi :: 	japan -> tokyo`  
`man -> woman :: 	boy -> girl`  
`small -> smaller :: 	large -> larger`  

#### Try it out 


`man -> coder :: woman -> ______?`


In [80]:
def predict_word(a, b, c):
    a,b,c = a.lower() , b.lower(), c.lower()
    
    # sim (b-a) = (d-c) should max
    max_sim = -100.0
    
    d = None
    
#    words = word_vectors.vocab.keys()
 
    words = ['queen', 'cow', 'india', 'princess', 'boy', "spanish", "tokyo", "hindi", "girl", "woman", "man"]
    
    wa , wb, wc = word_vectors[a], word_vectors[b], word_vectors[c]
    
    for w in words:
        if w in [a, b, c]:
            continue
        
        wd = word_vectors[w]
        
        sim = cosine_similarity([wb-wa], [wd-wc])
        
        if sim> max_sim:
            max_sim = sim
            d = w
            
    return d

In [77]:
words = word_vectors.vocab.keys()
len(words)

3000000

In [81]:
predict_word("italy", "italian", "spain")

'spanish'

In [82]:
predict_word("man", "woman", "boy")

'girl'

## Using the Most Similar Method

In [84]:
word_vectors.most_similar_to_given("woman", ['man', 'bridge', "beer", "pencil"])

'man'

In [85]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn = 1)

[('queen', 0.7118192911148071)]