# Exploring word2vec

We will be using a Python package called `gensim` to play around with word2vec.

We use word2vec to produce word embedding. This is the way we convert normal text into vectors that can be use to feed into neural networks. These vector has interesting properties that make them superior than using bag-of-word or one-hot vector for vectorizing text.

In [0]:
import gensim
import gensim.downloader as api

The following command load the word2vec model trained on part of the [Google News dataset](https://code.google.com/archive/p/word2vec/). The dataset has about 100 billion words. This model contained a 300-dimention vector for each of about 3 million words and phrases. 

Loading the model take a very long time, therefore feels free to try this out on your own time.

In [0]:
model = api.load("word2vec-google-news-300")

## Most similar vectors

`gensim` allows us to quickly find the most similar word vector to a word vector of our chosing through [`model.most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar). 

For example, we can see the most similar vector to the word-vector "king" has meaning that in the same general area of said word. The most similar vector for the word "king" in this example, is the plural form of the word ("kings"). This makes sense since the word's plural form shared basically the same "context" in sentences with the singular form. It would make sense that the word's plural form being the most similar. 

We can further look at other similar vector and see that the theme of royalty is well-represented. 



In [0]:
model.most_similar("king")

On another example, we trying to look for the similar word-vectors to the word "Obama". Here we see that the most similars word-vector are themed around the person or other person who are connected to Obama himself. 

In [0]:
model.most_similar("Obama")

  if np.issubdtype(vec.dtype, np.int):


[('Barack_Obama', 0.8036513328552246),
 ('President_Barack_Obama', 0.7878767848014832),
 ('McCain', 0.7555227875709534),
 ('Clinton', 0.7526832222938538),
 ('Illinois_senator', 0.74974524974823),
 ('Biden', 0.7485178709030151),
 ('Bush', 0.7348896861076355),
 ('Barack', 0.7290467023849487),
 ('White_House', 0.7151209115982056),
 ('elect_Barack_Obama', 0.6941337585449219)]

In [0]:
# Try your own word of your choosing. What kind of similarities between the word-vector that are similar to the word that you have chosen?
# model.most_similar()



## Word-vector "algebra"

Another interesting property of word-vectors is the ability to do "algebra" on it and derives a semantic meaning. 

For example:
$king + woman = queen$

Using `gensim`, we can represent this operation through positive word-vectors and negative word-vectors.



In [0]:
model.most_similar(positive=["king", "woman"])

  if np.issubdtype(vec.dtype, np.int):


[('man', 0.6628609895706177),
 ('queen', 0.6438563466072083),
 ('girl', 0.6136074662208557),
 ('princess', 0.6087510585784912),
 ('monarch', 0.5900576114654541),
 ('prince', 0.5896844863891602),
 ('teenage_girl', 0.571865975856781),
 ('boy', 0.5665285587310791),
 ('crown_prince', 0.5520807504653931),
 ('lady', 0.5445604920387268)]

Another example,

$China + capital = Beijing$

We can see that we are able to derives the country's capital through just word-vector "algebra". 

In [0]:
model.most_similar(positive=["China", "capital"])

  if np.issubdtype(vec.dtype, np.int):


[('Beijing', 0.6323763132095337),
 ('Chinese', 0.6081547737121582),
 ('Foods_Limited_HKSE', 0.579767107963562),
 ('Daniel_Schearf_reports', 0.5684748291969299),
 ('Shanghai', 0.5660480260848999),
 ('Guandong_Province', 0.5635813474655151),
 ('Changsha_Hunan', 0.5617741346359253),
 ('Communications_BoCom', 0.5575785040855408),
 ('Guangzhou', 0.5527973175048828),
 ('Chengdu', 0.5491997003555298)]

How about $king - man$ ?

In [0]:
model.most_similar(positive=["king"], negative=["man"])

  if np.issubdtype(vec.dtype, np.int):


[('kings', 0.4295138418674469),
 ('queen', 0.39028695225715637),
 ('Pansy_Ho_Chiu', 0.3827225863933563),
 ('monarch', 0.3633837103843689),
 ('kingdom', 0.36145076155662537),
 ('royal_palace', 0.3535977602005005),
 ('Savory_aromas_wafted', 0.35272473096847534),
 ('princes', 0.3526379466056824),
 ('monarchy', 0.3432357907295227),
 ('Rama_VII', 0.3380342423915863)]

In [0]:
# Try playing around with some word-vector "algebra" of your own. What more interesting thing could you found about word-vectors?
# model.most_similar(positive=[], negative=[])