## Gensim Word2Vec Example
Gensim is library that contains family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling

In [None]:
!pip install gensim
!pip install nltk
!pip install kagglehub

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec

Go to Kaggle website and download the archive "Reviews.csv" from the dataset "Amazon Fine Food Reviews": https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

In [None]:
import pandas as pd
rev = pd.read_csv("Reviews.csv")
print(rev.head())

# We create the list of the words that our corpus has

In [None]:
corpus_text = 'n'.join(rev[:1000]['Text'])
data = []
# iterate through each sentence in the file
for i in sent_tokenize(corpus_text):
    temp = []
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
    data.append(temp)

# Create the Word2Vec model with Gensim

Here's the table converted to markdown:

| Parameter | Type / Default | Meaning |
|-----------|----------------|---------|
| sentences | list of list of str | Your training data — a list where each item is a tokenized sentence (e.g. [['I', 'love', 'NLP'], ['Word2Vec', 'is', 'cool']]). |
| vector_size | int, default = 100 | Dimensionality of the word vectors (i.e., number of features in the embedding). Larger size captures more semantic nuance but requires more data and memory. |
| window | int, default = 5 | Maximum distance between the target word and its surrounding context words. A larger window means a broader context. |
| min_count | int, default = 5 | Ignores all words that appear fewer than this number of times. Helps filter out noise and rare words. |
| sg | int, default = 0 | Defines the training algorithm:• sg = 0 → CBOW (Continuous Bag of Words) — predicts the current word from context.• sg = 1 → Skip-gram — predicts context words from the current word. Skip-gram works better with smaller datasets and rare words. |
| epochs | int, default = 5 | Number of iterations (epochs) over the training corpus. Increasing can improve accuracy, but training takes longer. |
| workers | int, default = 3 | Number of CPU cores to use for training (parallelization). The higher, the faster training will be. |
| hs | int, default = 0 | If 1, hierarchical softmax is used for training; if 0, and negative > 0, then negative sampling is used instead. |
| negative | int, default = 5 | Number of negative samples to use. Setting this to 0 disables negative sampling. Works only if hs=0. |
| seed | int, optional | Random seed for reproducibility. |
| alpha | float, default = 0.025 | The initial learning rate. It decreases linearly during training. |
| min_alpha | float, default = 0.0001 | The minimum learning rate during training decay. |
| max_vocab_size | int, optional | Limits RAM during vocabulary building — if not None, truncates the vocabulary to this size. |

In [None]:
model1 = gensim.models.Word2Vec(data, min_count = 1,vector_size = 100, window = 5, sg=0)
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5, sg = 1)

In [None]:
vector = model1.wv['tuna']  # get numpy vector of a word

sims = model1.wv.most_similar('tuna', topn=10)  # get other similar words
sims

In [None]:
vector = model2.wv['tuna']  # get numpy vector of a word

sims = model2.wv.most_similar('tuna', topn=10)  # get other similar words
sims

The trained word vectors are stored in a KeyedVectors instance, as model.wv as seen above
The reason for separating the trained vectors into KeyedVectors is that if you don’t need the full model state any more (don’t need to continue training), its state can be discarded, keeping just the vectors and their keys proper.

This results in a much smaller and faster object that can be mmapped for lightning fast loading and sharing the vectors in RAM between processes:

In [None]:
from gensim.models import KeyedVectors

# Store just the words + their trained embeddings.

word_vectors = model1.wv

word_vectors.save("word2vec.wordvectors")

# Load back with memory-mapping = read-only, shared across processes.

wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')

vector = wv['computer']  # Get numpy vector of a word
vector

Code that finds the top similar words to "tuna" from your wv vectors and plots them in 2-D using PCA (deterministic and fast). It also highlights "tuna" and labels all points.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

def plot_similar_words(wv, query="tuna", topn=15):
    # Graceful fallback for casing
    if query not in wv.key_to_index:
        if query.lower() in wv.key_to_index:
            query = query.lower()
        elif query.title() in wv.key_to_index:
            query = query.title()
        else:
            raise KeyError(f"'{query}' not in vocabulary.")

    # Get similar words
    sims = wv.most_similar(query, topn=topn)  # [(word, score), ...]
    words = [w for w, _ in sims] + [query]

    # Collect vectors
    X = np.vstack([wv[w] for w in words])

    # 2D projection
    pca = PCA(n_components=2, random_state=42)
    X2 = pca.fit_transform(X)

    # Split for styling
    query_xy = X2[-1]
    others_xy = X2[:-1]
    other_labels = words[:-1]

    plt.figure(figsize=(8, 6))
    # plot similar words
    plt.scatter(others_xy[:, 0], others_xy[:, 1], s=60, alpha=0.8)
    # plot the query word
    plt.scatter(query_xy[0], query_xy[1], s=150, marker='*')  # highlighted

    # annotate others
    for (x, y), label in zip(others_xy, other_labels):
        plt.text(x + 0.02, y + 0.02, label, fontsize=10)

    # annotate query last so it's on top
    plt.text(query_xy[0] + 0.02, query_xy[1] + 0.02, query, fontsize=12, weight="bold")

    plt.title(f"Top-{topn} words similar to '{query}' (PCA projection)")
    plt.xlabel("PC 1")
    plt.ylabel("PC 2")
    plt.tight_layout()
    plt.show()


plot_similar_words(wv, query="tuna", topn=15)


In [None]:
import matplotlib.pyplot as plt

def bar_similarities(wv, query="tuna", topn=10):
    sims = wv.most_similar(query, topn=topn)
    labels, scores = zip(*sims)
    plt.figure(figsize=(8,4))
    plt.barh(range(len(scores)), scores)
    plt.yticks(range(len(scores)), labels)
    plt.gca().invert_yaxis()
    plt.title(f"Cosine similarity to '{query}'")
    plt.tight_layout()
    plt.show()

bar_similarities(wv, "cherry")


# Homework
1. Train Word2Vec models with different parameters (e.g., vector_size, window, sg) and compare the results.
2. Use the trained Word2Vec embeddings in a simple text classification task (e.g., sentiment analysis) and evaluate the performance.
3. Visualize the embeddings using PCA to see how similar words cluster together.