
## Word2Vec Model
- Word2Vec Google's Pretrained Model
- Contains vector representations of 50 billion words

- Words which are similar in context have similar vectors
- Distance/Similarity between two words can be measured using Cosine Distance

### Applications
- Text Similarity
- Language Translation
- Finding Odd Words
- Word Analogies

### Word Embeddings
- Word embeddings are numerical representation of words, in the form of vectors.
- Word2Vec Model represents each word as 300 Dimensional Vector
- In this tutorial we are going to see how to use pre-trained word2vec model.
- Model size is around 1.5 GB
- We will work using Gensim, which is popular NLP Package.






### Load Word2Vec Model
**KeyedVectors** - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [118]:
import gensim
import numpy as np
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

In [119]:
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

In [120]:
v_apple = word_vectors["apple"] 
v_india = word_vectors["india"]
print(v_apple.shape)
print(v_india.shape)

(300,)
(300,)


In [121]:
cosine_similarity([v_india],[v_apple])

array([[0.17158596]], dtype=float32)

### 1. Odd One Out

In [122]:
# Accepts a list of words and returns the odd word
def odd_one_out(words):
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors,axis=0)
    odd_one_out = None
    min_similarity = 1.0
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
        print("Similarity b/w %s and average vector is %.2f"%(w,sim))
    return odd_one_out

In [123]:
input_1 = ["apple","mango","juice","party","orange"] 
odd_one_out(input_1) 

Similarity b/w apple and average vector is 0.78
Similarity b/w mango and average vector is 0.76
Similarity b/w juice and average vector is 0.71
Similarity b/w party and average vector is 0.36
Similarity b/w orange and average vector is 0.65


'party'

In [124]:
input_2 = ["music","dance","sleep","dancer","food"]    
odd_one_out(input_2) 

Similarity b/w music and average vector is 0.66
Similarity b/w dance and average vector is 0.81
Similarity b/w sleep and average vector is 0.51
Similarity b/w dancer and average vector is 0.72
Similarity b/w food and average vector is 0.52


'sleep'

In [125]:
input_3  = ["match","player","football","cricket","dancer"]
odd_one_out(input_3)

Similarity b/w match and average vector is 0.58
Similarity b/w player and average vector is 0.68
Similarity b/w football and average vector is 0.72
Similarity b/w cricket and average vector is 0.70
Similarity b/w dancer and average vector is 0.53


'dancer'

In [126]:
input_4 = ["india","paris","russia","france","germany"]
odd_one_out(input_4)

Similarity b/w india and average vector is 0.81
Similarity b/w paris and average vector is 0.75
Similarity b/w russia and average vector is 0.79
Similarity b/w france and average vector is 0.81
Similarity b/w germany and average vector is 0.84


'paris'

### 2. Word Analogies

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea ≈ ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

![Word2Vec](./img/word2vec.png)

##### Examples
`man -> woman :: prince -> princess`  
`italy -> italian :: spain -> spanish`  
`india -> delhi :: japan -> tokyo`  
`man -> woman :: boy -> girl`  
`small -> smaller :: large -> larger`  

In [127]:
# Accepts a triad of words, a,b,c and returns d, such that a : b :: c : d
def predict_word(a,b,c,word_vectors):
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 # similarity b/w |b-a| & |d-c| should be max
    d = None
    words = word_vectors.vocab.keys()
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    for w in words:
        if w in [a,b,c]:
            continue
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        if sim > max_similarity: 
            max_similarity = sim
            d = w            
    return d    

In [None]:
triad = ("man","woman","prince")
predict_word(*triad,word_vectors)

#### Using the Most Similar Method

In [11]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7118192911148071)]

### 3. Training Our Own Word2Vec Model

Word2Vec model can learn embeddings from any text corpus!
- Continuous Bag of Words Model
- Skip Gram Model

Algorithm looks at window of target word(Y) to provide context word(X), the model is trained on (X,Y) pairs in a superwised manner.

#### Data Preparation



- Each sentence must be tokenized, into a list of words.

- The sentences can be text loaded into memory once,
or we can build a data pipeline which iteratively feeds data to the model.


In [128]:
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec

In [129]:
stopw  = set(stopwords.words('english'))
def readFile(file): 
    f = open(file,'r',encoding='utf-8')
    text = f.read()
    sentences = nltk.sent_tokenize(text)
    data = []
    for sent in sentences:
        words =  nltk.word_tokenize(sent)
        words = [w.lower() for w in words if len(w)>2 and w not in stopw]
        data.append(words)
    return data

In [130]:
text = readFile('news.txt')
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['the', 'deepika', 'ranveer', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['from', 'airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'deepika', 'ranveer', 'wedding', 'style', 'file'], ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'the', 'year', 'this', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['from', 'isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['but', 'nothing', 'beats', 'man', 'wedding', 'the', 'year', 'award', 'social', 'media'], ['priyanka', 'also', 'shared', 'video', 'featuring', 'nick', 'jonaswas', 'also', 'celebrating',

#### Creating Model

In [131]:
model = Word2Vec(text,size=300,window=10,min_count=1)
print(model)

Word2Vec(vocab=116, size=300, alpha=0.025)


In [132]:
words = list(model.wv.vocab)
print(words)

['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018', 'the', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple', 'from', 'airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'style', 'file', 'not', 'ambanis', 'priyanka', 'nick', 'man', 'proves', 'year', 'this', 'big', 'fat', 'lavish', 'extravagant', 'weddings', 'isha', 'ambani', 'anand', 'piramal', 'chopra', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', 'saw', 'many', 'grand', 'but', 'nothing', 'beats', 'award', 'social', 'media', 'shared', 'video', 'featuring', 'jonaswas', 'celebrating', 'family', 'first', 'celebrated', 'christmas', 'london', 'pictures', 'new', 'outstanding', 'glimpses', 'celebration', 'verbier', 'switzerland', 'married', 'december', 'three', 'receptions', 'delhi', 'mumbai', 'jaggo', 'night', 'made', 'even', 'special', 'industry', 'friends', 'long', '

#### Creating Analogies

In [133]:
def predict_actor(a,b,c,word_vectors):
    # Accepts a triad of words, a,b,c and returns d, such that a : b :: c : d
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 
    d = None
    words = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    for w in words:
        if w in [a,b,c]:
            continue
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d    

####  Testing the Model

In [134]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,model.wv)

'chopra'

In [135]:
triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,model.wv)

'nick'

In [136]:
triad = ("ranveer","singh","deepika")
predict_actor(*triad,model.wv)

'nick'

In [137]:
triad = ("deepika","padukone","priyanka")
predict_actor(*triad,model.wv)

'virat'

In [138]:
model.wv.save_word2vec_format("myModel.bin")