# WordVec

In this tutorial we will be using gensim to study word vectors. 


Let’s download a pre-trained model and play around with it. We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases. 

Such a model can take hours to train, we will load Gensim's pretrained model.

<font color='red'> Run the cells in the sections below to understand how to use the gensim model </font>


In [4]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')



### Word2Vec Word Representations

In [5]:
# Let us look at the vector representation of the word 'king'. 
vec_king = wv['king']
print(vec_king.shape)

(300,)


The basic idea is to create a small neural network that has a bottleneck representing each word as a 300d vector, and then train the network to predict nearby "context" words given some central word. 

Over time, it learns the representations that allow it to make the best predictions, essentially capturing the typical context of each word as its meaning.

### Pairwise similarity between two words

Word2Vec implements word pair similarity using the cosine distance between the two word vectors.

In [6]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, wv.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


In [7]:
print(wv.most_similar(positive=['car', 'minivan'], topn=5))

[('SUV', 0.8532192707061768), ('vehicle', 0.8175783753395081), ('pickup_truck', 0.7763688564300537), ('Jeep', 0.7567334175109863), ('Ford_Explorer', 0.7565720081329346)]


In [14]:
# Let us look at the vector representation of the word 'king' and 'queen'. We find modifying the vector representation of king wrt. man and woman yields a vector representation similar to queen.

wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7118193507194519)]

In [15]:
# We can also use it to understand semantically similar categories

wv.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [19]:
# We can use it to compute representations of sentences. A sentence representation is computed by averaging the word vectors of the words in the sentence.

sentence = ['this', 'is', 'an', 'odd', 'sentence']
vec_sentence = wv[sentence]
print(vec_sentence.shape)
print(vec_sentence)

(5, 300)
[[ 0.109375    0.140625   -0.03173828 ...  0.00765991  0.12011719
  -0.1796875 ]
 [ 0.00704956 -0.07324219  0.171875   ...  0.01123047  0.1640625
   0.10693359]
 [ 0.12597656  0.19042969  0.06982422 ...  0.0612793   0.17285156
  -0.07861328]
 [ 0.20800781 -0.03564453 -0.14941406 ...  0.0534668  -0.20019531
   0.03515625]
 [ 0.11767578 -0.234375    0.4765625  ... -0.24511719  0.1484375
   0.07714844]]
