<a href="https://colab.research.google.com/github/allokkk/Odd-Word-Out/blob/master/Word_2_Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Model
Word2Vec Google's Pretrained Model
Contains vector representations of 50 billion words

Words which are similar in context have similar vectors

Distance/Similarity between two words can be measured using Cosine Distance

# Applications
#### Text Similarity
#### Language Translation
#### Finding Odd Words
#### Word Analogies

# Word Embeddings
Word embeddings are numerical representation of words, in the form of vectors.

Word2Vec Model represents each word as 300 Dimensional Vector

In this tutorial we are going to see how to use pre-trained word2vec model.

Model size is around 1.5 GB
We will work using Gensim, which is popular NLP Package.
Gensim's Word2Vec Model provides optimum implementation of

# Word Embeddings
Word embeddings are numerical representation of words, in the form of vectors.

Word2Vec Model represents each word as 300 Dimensional Vector

In this tutorial we are going to see how to use pre-trained word2vec model.

Model size is around 1.5 GB
We will work using Gensim, which is popular NLP Package.
Gensim's Word2Vec Model provides optimum implementation of

1) CBOW Model

2) SkipGram Model

Paper 1 Efficient Estimation of Word Representations in Vector Space

Paper 2 Distributed Representations of Words and Phrases and their Compositionality

# Word2Vec using Gensim
Link https://radimrehurek.com/gensim/models/word2vec.html

# CODE




## Load Word2Vec Model

KeyedVectors - This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways

In [0]:
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:

import numpy as np

word_vectors = KeyedVectors.load_word2vec_format('/content/gdrive/My Drive/GoogleNews-vectors-negative300.bin.gz',binary=True)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
v_apple=word_vectors["Apple"]
v_mango=word_vectors["Mango"]

In [0]:
print(v_apple.shape)
print(v_mango.shape)



cosine_similarity([v_mango],[v_apple])



(300,)
(300,)


array([[0.2475584]], dtype=float32)

In [0]:
v_apple = word_vectors["apple"] 
v_mango = word_vectors["india"]

cosine_similarity([v_mango],[v_apple])

array([[0.17158596]], dtype=float32)

# Find ODD

In [0]:

def odd_one_out(words):
    """Accepts a list of words and returns the odd word"""
    
    # Generate all word embeddings for the given list
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors,axis=0)
    print(avg_vector.shape)
    
    #Iterate over every word and find similarity
    odd_one_out = None
    min_similarity = 1.0 #Very high value
    
    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_similarity:
            min_similarity = sim
            odd_one_out = w
    
        print("Similairy btw %s and avg vector is %.2f"%(w,sim))

        
            
    return odd_one_out

In [0]:
input_1 = ["apple","mango","juice","party","orange"] 
input_2 = ["music","dance","sleep","dancer","food"]        
input_3  = ["match","player","football","cricket","dancer"]
input_4 = ["india","paris","russia","france","germany"]

In [0]:

odd_one_out(input_1)

(300,)
Similairy btw apple and avg vector is 0.78
Similairy btw mango and avg vector is 0.76
Similairy btw juice and avg vector is 0.71
Similairy btw party and avg vector is 0.36
Similairy btw orange and avg vector is 0.65


'party'

In [0]:
odd_one_out(input_2)

(300,)
Similairy btw music and avg vector is 0.66
Similairy btw dance and avg vector is 0.81
Similairy btw sleep and avg vector is 0.51
Similairy btw dancer and avg vector is 0.72
Similairy btw food and avg vector is 0.52


'sleep'

In [0]:
odd_one_out(input_3)

(300,)
Similairy btw match and avg vector is 0.58
Similairy btw player and avg vector is 0.68
Similairy btw football and avg vector is 0.72
Similairy btw cricket and avg vector is 0.70
Similairy btw dancer and avg vector is 0.53


'dancer'

In [0]:
odd_one_out(input_4)

(300,)
Similairy btw india and avg vector is 0.81
Similairy btw paris and avg vector is 0.75
Similairy btw russia and avg vector is 0.79
Similairy btw france and avg vector is 0.81
Similairy btw germany and avg vector is 0.84


'paris'

### 2. Word Analogies Task

In the word analogy task, we complete the sentence "a is to b as c is to __". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors `ea,eb,ec,ed` are related in the following manner: `eb−ea≈ed−ec`. We will measure the similarity between `eb−ea` and `ed−ec` using cosine similarity. 

`man -> woman :: 	prince -> princess`  
`italy -> italian :: 	spain -> spanish`  
`india -> delhi :: 	japan -> tokyo`  
`man -> woman :: 	boy -> girl`  
`small -> smaller :: 	large -> larger`  

#### Try it out 


`man -> coder :: woman -> ______?`

In [0]:
word_vectors["man"].shape 

(300,)

In [0]:
def predict_word(a,b,c,word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    
    # similarity |b-a| = |d-c| should be max
    max_similarity = -100 
    
    d = None
    
    words = word_vectors.vocab.keys()
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    #to find d s.t similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
            
    return d    

In [0]:
triad_2 = ("man","woman","prince")
predict_word(*triad_2,word_vectors)



'princess'

## Using the Most Similar Method

In [0]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

  if np.issubdtype(vec.dtype, np.int):


[('queen', 0.7118192911148071)]

Data Preparation¶

Each sentence must be tokenized, into a list of words.

The sentences can be text loaded into memory once, or we can build a data pipeline which iteratively feeds data to the model.

In [0]:
import nltk
from nltk.corpus import stopwords

In [0]:
import nltk
from nltk.corpus import stopwords

In [0]:
!wget https://www.dropbox.com/s/s1mcc8win027qww/bollywood.rar?dl=0

--2020-01-05 18:35:53--  https://www.dropbox.com/s/s1mcc8win027qww/bollywood.rar?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.1, 2620:100:6018:1::a27d:301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/s1mcc8win027qww/bollywood.rar [following]
--2020-01-05 18:35:53--  https://www.dropbox.com/s/raw/s1mcc8win027qww/bollywood.rar
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc98a9e20ad599485ad87076d0f3.dl.dropboxusercontent.com/cd/0/inline/AvmykWhN-jjMjppuI_NQbaLsreImtmj0lOgHi-rIaEgscjTT2BtBAv4aLY5_BD8Qpr9XpehQDfYmt_YeTOjE-VUXMpvklPBORWBd-AwJy5GF21GplaxF7_ezyOIkEmtoO1E/file# [following]
--2020-01-05 18:35:53--  https://uc98a9e20ad599485ad87076d0f3.dl.dropboxusercontent.com/cd/0/inline/AvmykWhN-jjMjppuI_NQbaLsreImtmj0lOgHi-rIaEgscjTT2BtBAv4aLY5_BD8Qpr9XpehQDfYmt_YeTOjE-VUXMpvklPBORWBd-AwJy5

In [0]:
!unzip bollywood.rar?dl=0

Archive:  bollywood.rar?dl=0
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bollywood.rar?dl=0 or
        bollywood.rar?dl=0.zip, and cannot find bollywood.rar?dl=0.ZIP, period.

No zipfiles found.


In [0]:
def readFile(file): 
    f = open(file,'r',encoding='utf-8')
    text = f.read()
    sentences = nltk.sent_tokenize(text)
    
    data = []
    for sent in sentences:
        words =  nltk.word_tokenize(sent)
        words = [w.lower() for w in words if len(w)>2 and w not in stopw]
        data.append(words)
        
    return data


In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
import nltk
nltk.download("stopwords")
stopw  = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
!ls

'bollywood.rar?dl=0'	'bollywood.txt?dl=0'   sample_data
'bollywood.rar?dl=0.1'	 gdrive


In [0]:
path="/content/bollywood.txt?dl=0"

In [0]:
text = readFile(path)

In [0]:
print(text)

[['deepika', 'padukone', 'ranveer', 'singh', 'wedding', 'one', 'biggest', 'bollywood', 'events', 'happened', '2018'], ['the', 'deepika', 'ranveer', 'celebrations', 'hooked', 'phones', 'waiting', 'come', 'also', 'gave', 'enough', 'reason', 'believe', 'stylish', 'two', 'couple'], ['from', 'airport', 'looks', 'reception', 'parties', 'everything', 'entire', 'timeline', 'deepika', 'ranveer', 'wedding', 'style', 'file'], ['not', 'ambanis', 'deepika', 'ranveer', 'priyanka', 'nick'], ['man', 'proves', 'wedding', 'the', 'year', 'this', 'year', 'year', 'big', 'fat', 'lavish', 'extravagant', 'weddings'], ['from', 'isha', 'ambani', 'anand', 'piramal', 'deepika', 'padukone', 'ranveer', 'singh', 'priyanka', 'chopra', 'nick', 'jonas', 'kapil', 'sharma', 'ginni', 'chatrath', '2018', 'saw', 'many', 'grand', 'weddings'], ['but', 'nothing', 'beats', 'man', 'wedding', 'the', 'year', 'award', 'social', 'media'], ['priyanka', 'also', 'shared', 'video', 'featuring', 'nick', 'jonaswas', 'also', 'celebrating',

In [0]:

from gensim.models import Word2Vec

In [0]:
model = Word2Vec(text,size=300,window=10,min_count=1)

In [0]:
print(model)

Word2Vec(vocab=116, size=300, alpha=0.025)


In [0]:
!wget https://www.dropbox.com/s/irxlasnb14jodj5/bollywood.bin?dl=0

--2020-01-05 18:59:26--  https://www.dropbox.com/s/irxlasnb14jodj5/bollywood.bin?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.1, 2620:100:601b:1::a27d:801
Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/irxlasnb14jodj5/bollywood.bin [following]
--2020-01-05 18:59:26--  https://www.dropbox.com/s/raw/irxlasnb14jodj5/bollywood.bin
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc5e048f535733c7dca7d7df55a1.dl.dropboxusercontent.com/cd/0/inline/AvlsAHx10ZyKWYfwnJKtr2WR8KsRMW8sW11WFaTNtvraETipwj8IbwYG5hx-vqPeb3Oygi6UIufHJwbumOEs9Qbo5Omd9IFdOEM7qWWCYatDKW0j346MwL4pJpqIQq9GdUc/file# [following]
--2020-01-05 18:59:26--  https://uc5e048f535733c7dca7d7df55a1.dl.dropboxusercontent.com/cd/0/inline/AvlsAHx10ZyKWYfwnJKtr2WR8KsRMW8sW11WFaTNtvraETipwj8IbwYG5hx-vqPeb3Oygi6UIufHJwbumOEs9Qbo5Omd9IFdOEM7qWWCYa

In [0]:
!unzip bollywood.bin?dl=0

Archive:  bollywood.bin?dl=0
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bollywood.bin?dl=0 or
        bollywood.bin?dl=0.zip, and cannot find bollywood.bin?dl=0.ZIP, period.

No zipfiles found.


In [0]:

!ls
#path2='/content/bollywood.bin?dl=0'

word_vectors_kv = KeyedVectors.load_word2vec_format('/content/bollywood.bin?dl=0',binary=False)

word_vectors = word_vectors_kv.wv




'bollywood.bin?dl=0'  'bollywood.rar?dl=0.1'   gdrive
'bollywood.rar?dl=0'  'bollywood.txt?dl=0'     sample_data


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  import sys


In [0]:
actors = ["ranveer","deepika","padukone","singh","nick","jonas","chopra","priyanka","virat","anushka","ginni"]


def predict_actor(a,b,c,word_vectors):
    """Accepts a triad of words, a,b,c and returns d such that a is to b : c is to d"""
    a,b,c = a.lower(),b.lower(),c.lower()
    max_similarity = -100 
    
    d = None
    words = actors
    
    wa,wb,wc = word_vectors[a],word_vectors[b],word_vectors[c]
    
    #to find d s.t similarity(|b-a|,|d-c|) should be max
    
    for w in words:
        if w in [a,b,c]:
            continue
        
        wv = word_vectors[w]
        sim = cosine_similarity([wb-wa],[wv-wc])
        
        if sim > max_similarity:
            max_similarity = sim
            d = w
    return d    

### 4. Test your Model

In [0]:
triad = ("nick","priyanka","virat")
predict_actor(*triad,word_vectors)

'anushka'

In [0]:
triad = ("ranveer","deepika","priyanka")
predict_actor(*triad,word_vectors)

'nick'

In [0]:
model.wv.save_word2vec_format("/content/bollywood.bin?dl=0")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
