## Word2Vec Google News 300

**Word2Vec**

This is the algorithm used to create the vectors. Word2Vec maps words to a continuous vector space based on their context, allowing for semantic relationships to be captured mathematically.

**Google News**

This is the dataset in which the model was trained on.

_Scale_: Approximately 100 billion words sourced from a Google News dataset.

_Vocabulary_: Around 3 million unique words and phrases.

**300**

This is the vector dimension.

It means that every single word or phrase in the 3-million-word vocabulary is represented by a list of 300 floating-point numbers

In [3]:
import gensim.downloader as api # pip install gensim

In [5]:
model = api.load("word2vec-google-news-300")



## Example Vector

In [7]:
example_word = "home"

example_vec = model[example_word]

print(f"Vector for the word: {example_word}:\n\n{example_vec}")

Vector for the word: home:

[-0.01184082  0.07958984  0.0168457  -0.08984375  0.08642578  0.02416992
  0.0255127  -0.18945312  0.14160156  0.08447266  0.16992188 -0.25
 -0.0534668  -0.02832031  0.04541016 -0.14746094  0.01226807  0.05639648
  0.01953125  0.21582031  0.15722656 -0.15917969  0.08837891 -0.10595703
 -0.00408936 -0.02331543 -0.04931641  0.08154297  0.03808594  0.00177002
  0.05517578  0.03735352 -0.14648438 -0.03808594 -0.05517578  0.01135254
 -0.07861328 -0.14941406  0.09033203  0.03710938  0.08837891 -0.01483154
  0.18261719  0.09667969 -0.05517578  0.16308594  0.03955078  0.1640625
  0.08398438  0.0279541  -0.03442383  0.296875    0.140625    0.09863281
 -0.18457031 -0.26367188 -0.12109375  0.18261719  0.02282715 -0.04248047
 -0.02062988  0.08251953 -0.00140381  0.02246094 -0.07617188  0.02709961
 -0.04711914  0.05639648 -0.00250244  0.11328125  0.12890625 -0.09667969
  0.06738281 -0.08154297 -0.10546875  0.06982422 -0.02294922  0.00622559
  0.01843262 -0.02612305  0.05

In [8]:
print(f"Size of a single vector: {example_vec.shape}")

Size of a single vector: (300,)


## Find Most Similar Words

In [10]:
num_most_similar = 5

print(f"{num_most_similar} most similar words to the word: {example_word}:\n")
print(model.most_similar(example_word, topn=num_most_similar))

5 most similar words to the word: home:

[('house', 0.5617802143096924), ('Superfast_WiFi', 0.5286994576454163), ('homes', 0.5108882188796997), ('Home', 0.4817159175872803), ('residence', 0.4699970781803131)]


## Vector Arithmetics

In [12]:
print(f"King - Man + Woman = ?\n")

print(model.most_similar(positive=["King", "Woman"], negative=["Man"], topn=1))

King - Man + Woman = ?

[('Queen', 0.4929387867450714)]


## Quantify Similarity and Difference

In [13]:
# Similarity between pairs
def find_similarity_of_pair(word1, word2):
  sim = model.similarity(word1, word2)
  print(f"Similarity score between `{word1}` and `{word2}`: {sim}")

In [22]:
find_similarity_of_pair("mouse", "keyboard")
find_similarity_of_pair("mobile", "phone")
find_similarity_of_pair("mom", "dad")
find_similarity_of_pair("ball", "bat")
print("====================================")
find_similarity_of_pair("mom", "keyboard")
find_similarity_of_pair("ball", "mobile")
find_similarity_of_pair("dad", "phone")
find_similarity_of_pair("mouse", "bat")

Similarity score between `mouse` and `keyboard`: 0.4737585484981537
Similarity score between `mobile` and `phone`: 0.5593067407608032
Similarity score between `mom` and `dad`: 0.7470093369483948
Similarity score between `ball` and `bat`: 0.48378387093544006
Similarity score between `mom` and `keyboard`: 0.17724628746509552
Similarity score between `ball` and `mobile`: 0.08789485692977905
Similarity score between `dad` and `phone`: 0.15379786491394043
Similarity score between `mouse` and `bat`: 0.16606763005256653


In [16]:
import numpy as np

In [17]:
# Norm of word diff
def find_norm_diff(word1, word2):
  vec1 = model[word1]
  vec2 = model[word2]
  diff_vec = vec1 - vec2
  diff_vec_magnitude = np.linalg.norm(diff_vec)
  print(f"Magnitude of Difference between `{word1}` andd `{word2}`: {diff_vec_magnitude}")

In [23]:
find_norm_diff("mouse", "keyboard")
find_norm_diff("mouse", "bat")
print("========================")
find_norm_diff("mom", "dad")
find_norm_diff("mom", "keyboard")

Magnitude of Difference between `mouse` andd `keyboard`: 3.4226744174957275
Magnitude of Difference between `mouse` andd `bat`: 4.223516464233398
Magnitude of Difference between `mom` andd `dad`: 2.0500104427337646
Magnitude of Difference between `mom` andd `keyboard`: 4.124857425689697
