### 1. Introduction
In this section, we will delve into the practical implementation of Word2Vec using the Gensim library. Here, we'll explore various operations that can be performed on words and words embeddings learned by the model. Let's get started!


### 2. Setting Up The Environment 
First, we need to set up our environment. This involves importing necessary libraries and ensuring Gensim is installed.


In [12]:
!pip install --quiet gensim
import gensim
from gensim.models import Word2Vec
from gensim.downloader import load

import warnings
warnings.filterwarnings('ignore')

import numpy as np
np.set_printoptions(threshold=10)

### 3. Loading Word2Vec Vectors into Model

I will be using pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.

In [2]:
wv = load('word2vec-google-news-300')

# Initialize Word2Vec models
w2v_model = Word2Vec(vector_size=wv.vector_size, min_count=1)

# Build the vocabulary from the pretrained model
w2v_model.build_vocab_from_freq(word_freq={word: wv.get_vecattr(word, 'count') for word in wv.index_to_key})

# Inject the pretrained vectors
w2v_model.wv.vectors[:] = wv.vectors[:]

# Check if the vectors have been injected correctly
assert w2v_model.wv['king'].all() == wv['king'].all()

### 4. Exploring the Word Vector Operations
Let's explore the word vectors and see how they perform on various tasks.

#### 4.1. Vector Arithmetics

Imagine you're navigating a map where cities and countries are represented as points. The distance and direction between "Germany" and "Berlin" on this map captures the relationship of a country to its capital.

When you move from "Germany" to "Berlin" on this map, you're essentially following a "capital city direction". Now, if you start at "France" and move in the same "capital city direction", you should end up at "Paris".

In the vector space of word embeddings:

Moving from "Germany" to "Berlin" is like subtracting the vector of "Germany" from "Berlin".
To find the equivalent capital for "France", you add the "capital city direction" to "France".

In [3]:
result = w2v_model.wv.most_similar(positive=['France', 'Berlin'], negative=['Germany'], topn=10)
print(f"France - Germany + Berlin = {result[0][0]}  ---  with similarity score of {result[0][1]}")

France - Germany + Berlin = Paris  ---  with similarity score of 0.7672389149665833


#### 4.2. Cosine Similarity
We can compute the cosine similarity between two word vectors to measure their semantic similarity.

In [4]:
similarity = w2v_model.wv.similarity('boy', 'girl')
print(f"Cosine similarity between 'boy' and 'girl': {similarity}")

similarity = w2v_model.wv.similarity('coffee', 'giraffe')
print(f"Cosine similarity between 'coffee' and 'giraffe': {similarity}")

Cosine similarity between 'boy' and 'girl': 0.8543272018432617
Cosine similarity between 'coffee' and 'giraffe': 0.13183525204658508


1. **Cosine similarity between 'boy' and 'girl'**: <br>
This measures how similar the word vectors for 'boy' and 'girl' are. Given that both words are human genders and are often used in similar contexts, their vectors are likely to be close in the vector space. Therefore, the cosine similarity between them would be a high value, close to 1.

2. **Cosine similarity between 'coffee' and 'giraffe'**: <br>
Here, we're comparing the word vectors for 'coffee' (a beverage) and 'giraffe' (an animal). Intuitively, these words are not related and are used in very different contexts. As a result, their vectors would be farther apart in the vector space. The cosine similarity between them would be closer to 0, indicating they are not similar.

#### 4.3. Word Analogies
Word2Vec can solve analogies. For example, "man" is to "king" as "woman" is to "____?".

Imagine a theater where actors play different roles. In one play, a man plays the role of a king. In another play, if we want a woman to play a similar authoritative role, what would that role be? The answer could be a queen.

In [5]:
# This is similar to vector arithmetics, but let's try another example
result = w2v_model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(f"man:king as woman:{result[0][0]}")

man:king as woman:queen


#### 4.4. Find Synonyms
We can identify semantically similar or opposite words based on proximity in vector space.

In [6]:
synonyms = w2v_model.wv.most_similar('car', topn=5)
print(f"Synonyms for 'car': {[word[0] for word in synonyms]}")

Synonyms for 'car': ['vehicle', 'cars', 'SUV', 'minivan', 'truck']


#### 4.5. Out-of-Vocabulary Words
Word2Vec can't directly generate embeddings for words not in the original vocabulary. However, FastText, an extension of Word2Vec, can handle this by averaging subword embeddings.

For more information: https://github.com/facebookresearch/fastText

In [7]:
from gensim.models import FastText

fasttext_model = FastText(vector_size=wv.vector_size, min_count=1)

# Build the vocabulary from the pretrained model
fasttext_model.build_vocab_from_freq(word_freq={word: wv.get_vecattr(word, 'count') for word in wv.index_to_key})

# Inject the pretrained vectors
fasttext_model.wv.vectors[:] = wv.vectors[:]

# Check if the vectors have been injected correctly
assert fasttext_model.wv['king'].all() == wv['king'].all()

In [8]:
print('madeupword' in fasttext_model.wv.key_to_index)

False


In [13]:
fasttext_model.wv['madeupword']

array([ 6.9911410e-05,  3.8663266e-04,  4.9756694e-04, ...,
        3.2902186e-04, -4.7884602e-04, -7.4080104e-05], dtype=float32)

#### 4.6. Sentence/Document Embeddings
To represent entire sentences or documents, we can average the word embeddings of all words in the sentence/document.

In [14]:
def sentence_vector(sentence, model):
    words = [word for word in sentence.split() if word in model.wv.key_to_index]
    if len(words) == 0:
        return None
    return sum(model.wv[word] for word in words) / len(words)

sentence = "A quick cat is running outside the house."
vector = sentence_vector(sentence, w2v_model)
print(f"Vector for the sentence: {vector}")

Vector for the sentence: [ 0.04085432  0.07670375  0.00636509 ...  0.11017717 -0.04701451
 -0.0227356 ]


#### 4.7. Semantic Clustering
Grouping semantically related words based on their embeddings can be achieved using clustering techniques like K-means.

In [11]:
from sklearn.cluster import KMeans

# Extract word vectors and their corresponding words
word_vectors = [w2v_model.wv[word] for word in w2v_model.wv.key_to_index]
words = list(w2v_model.wv.key_to_index.keys())

# Use KMeans to cluster these word vectors into 10 clusters
kmeans = KMeans(n_clusters=10, random_state=0).fit(word_vectors)

# Display words in each cluster
for i in range(10):
    cluster_words = [words[j] for j, cluster_id in enumerate(kmeans.labels_) if cluster_id == i]
    print(f"Cluster {i + 1}: {', '.join(cluster_words[:10])} ...")  # Displaying only the first 10 words for brevity

Cluster 1: </s>, in, for, that, is, on, ##, The, with, said ...
Cluster 2: pm, •, ##:##, *, WASHINGTON, NEW_YORK, Former, #####, Call, vs. ...
Cluster 3: %, system, available, products, data, technology, rate, study, product, provides ...
Cluster 4: game, team, season, #-#, play, points, win, games, ##-##, players ...
Cluster 5: He, her, she, school, She, manager, Smith, district, St., wife ...
Cluster 6: India, Mr, Pakistan, Government, Indian, minister, Rs, Secretary, Minister, army ...
Cluster 7: company, market, quarter, billion, companies, sales, financial, industry, growth, Inc. ...
Cluster 8: government, AP, police, tax, economic, Department, prices, federal, political, military ...
Cluster 9: By, ###-####, looking_statements, ###-###-####, Writer, AP_Photo, differ_materially, Tickets, please_visit, Contact ...
Cluster 10: I, you, like, your, my, And, me, 've, You, really ...


### 6. References and Further Reading
- [Gensim Documentation](https://radimrehurek.com/gensim/models/word2vec.html)
- [Gensim Source Code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py)
- [Original Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [FastText GitHub Repo](https://github.com/facebookresearch/fastText)