
## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 4 Million words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance


### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies


### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.

- Word2Vec Model represents each word as 300 Dimensional Vector

- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 3.5 GB
- We will work using Gensim, which is popular NLP Package.


Gensim's Word2Vec Model provides optimum implementation of 

1) **CBOW** Model 

2) **SkipGram Model**


Paper 1 [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/pdf/1301.3781.pdf)


Paper 2 [Distributed Representations of Words and Phrases and their Compositionality
](https://arxiv.org/abs/1310.4546)

### Word2Vec using Gensim
`Link https://radimrehurek.com/gensim/models/word2vec.html`

### CODE ##

##### Load Word2Vec Model

In [7]:
import gensim
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [14]:
v_a = word_vectors['apple']
v_b = word_vectors['banana']

In [15]:
cosine_similarity([v_a], [v_b])

array([[0.5318406]], dtype=float32)

### Find Odd One Out

In [19]:
import numpy as np

In [34]:
def odd_one_out(words):
    
    # Generate word_embd for words in list
    all_words_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_words_vectors, axis =0)

    # Iterate over each word and find similarity
    odd_one_out = None
    min_sim = 1.0
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]], [avg_vector] )
        if sim<min_sim:
            min_sim = sim
            odd_one_out = w
        
        print("Similarity between %s and avg vector is %.3f"%(w,sim))
        
    return odd_one_out

In [43]:
input_1 = ["apple", "mango", "banana", "india"]
input_2 = ['france', 'russia', 'germany', 'paris']
input_3 = ['cat', 'dog','horse','human']
input_4 = ['france', 'germany', 'india', 'italy']

In [44]:
odd_one_out(input_4)

Similarity between france and avg vector is 0.847
Similarity between germany and avg vector is 0.850
Similarity between india and avg vector is 0.805
Similarity between italy and avg vector is 0.811


'india'