# Day-52: Word Embeddings

Yesterday, we took a crucial step by converting words into count-based vectors (BoW and TF-IDF). While effective, these methods treat words as isolated, independent features, failing to capture the rich semantic relationships between them.

Today, we delve into the core of modern NLP: Word Embeddings. This technique transforms words into dense numerical vectors that capture meaning and context.

## Topic Covered:

## Word Embeddings: Meaning in Motion 

Word Embeddings are dense, low-dimensional vector representations of words. The core idea is that words that appear in similar contexts should have similar vector representations.

- `How it Works`: These vectors (typically 50 to 300 dimensions) are learned by training a neural network on a massive corpus of text (like Wikipedia or the entire web).

- `Analogy`: Geographical Map. Imagine every word is a city. BoW just lists cities alphabetically. Word Embeddings place them on a map: words with similar meanings (e.g., "king," "queen") are mapped close together, and the direction between related words (e.g., the vector from "man" to "woman") is consistent.

- `Key Feature`: Vector Arithmetic. The most famous result of embeddings is their ability to perform algebraic operations that capture semantic relationships:

$$ king − man + woman ≈ queen $$

​


## Word2Vec: Predicting Contex

Word2Vec is one of the earliest and most influential methods for generating word embeddings. It introduced two main architectures:

1. Continuous Bag of Words (CBOW): Predicts the current word based on its surrounding context words. (Input: context, Output: target word).

    - `Example`: Predict "fox" from context "The quick [ ] ran."

2. Skip-gram: Predicts the context words given the current word. (Input: target word, Output: context).

    - `Example`: Predict "The," "quick," "ran" from target "fox."

Skip-gram generally performs better on large datasets, as its objective of predicting the surrounding words allows it to capture a wider, more subtle range of contextual information.

## GloVe: Global Matrix Factorization

GloVe (Global Vectors for Word Representation) is another popular embedding method that combines the principles of local context windows (like Word2Vec) with global matrix factorization techniques.

- `How it Works`: Instead of a complex predictive model, GloVe focuses on building a matrix of Word-Word Co-occurrence. It then uses factorization to generate vectors such that the ratio of their components reflects the frequency of words appearing together.

- `Advantage`: GloVe often converges faster than Word2Vec and performs well even on smaller training corpora because it leverages global statistics about word relationships.

## Code Example: Loading and Using Pre-trained GloVe Embeddings

Since training your own Word2Vec or GloVe model is computationally expensive, we always use pre-trained models (trained on billions of words) and load them via libraries like Gensim or through a machine learning framework.

The following example loads a small pre-trained GloVe model to demonstrate the concept of vector similarity.

In [1]:
! pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-win_amd64.whl.metadata (8.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.3.1-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.3-cp311-cp311-win_amd64.whl.metadata (6.5 kB)
Downloading gensim-4.3.3-cp311-cp311-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ----------------- ---------------------- 10.2/24.0 MB 49.1 MB/s eta 0:00:01
   ---------------------------------- ----- 21.0/24.0 MB 51.0 MB/s eta 0:00:01
   ---------------------------------------- 24.0/24.0 MB 47.5 MB/s  0:00:00
Using cached numpy-1.26.4-cp311-cp311-win_amd64.whl (15.8 MB)
Downloading scipy-1.13.1

In [None]:
import numpy as np
import pandas as pd
# Gensim is the standard library for loading/working with Word2Vec and GloVe
from gensim.models import KeyedVectors 

# --- NOTE: Loading large embedding files requires the file to be present ---
# For demonstration, we will assume a small file or simulate the key vectors:

# Simulated GloVe vector dictionary (Words -> 3-dimensional vector)
word_vectors_dict = {
    'king': np.array([0.5, 0.4, 0.3]),
    'queen': np.array([0.4, 0.6, 0.2]),
    'man': np.array([0.7, 0.2, 0.6]),
    'woman': np.array([0.6, 0.5, 0.4]),
    'apple': np.array([0.1, 0.1, 0.9]), # Unrelated word
    'fruit': np.array([0.2, 0.2, 0.8]), # Related to apple
    'phone': np.array([0.9, 0.1, 0.1])  # Unrelated word
}

# 1. Perform Vector Arithmetic (Illustrating the core feature)
vec_king = word_vectors_dict['king']
vec_man = word_vectors_dict['man']
vec_woman = word_vectors_dict['woman']

# Operation: king - man + woman
result_vector = vec_king - vec_man + vec_woman

print("Vector Arithmetic: (king - man + woman) result:")
print(np.round(result_vector, 2))
# The resulting vector [0.4 0.7 0.1] should be numerically close to the 'queen' vector [0.4 0.6 0.2]

# 2. Check Similarity (Cosine Similarity)
def cosine_similarity(v1, v2):
    """Calculates the cosine similarity between two vectors."""
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

# Compare similar words
sim_king_queen = cosine_similarity(word_vectors_dict['king'], word_vectors_dict['queen'])
# Compare dissimilar words
sim_king_apple = cosine_similarity(word_vectors_dict['king'], word_vectors_dict['apple'])
# Compare related words
sim_apple_fruit = cosine_similarity(word_vectors_dict['apple'], word_vectors_dict['fruit'])
# compare unrelated words
sim_apple_phone = cosine_similarity(word_vectors_dict['apple'], word_vectors_dict['phone'])

# Display results

print(f"\nSimilarity (King vs. Queen): {sim_king_queen:.4f}")
print(f"Similarity (King vs. Apple): {sim_king_apple:.4f}")
print("\nNote: Similarity scores range from -1 (opposite) to 1 (identical).")
print(f"Similarity(apple vs. fruit): {sim_apple_fruit:.4f}")
print(f"Similarity(apple vs. phone): {sim_apple_phone:.4f}")
# Output: King/Queen should be much higher (closer to 1.0) than King/Apple.

Vector Arithmetic: (king - man + woman) result:
[0.4 0.7 0.1]

Similarity (King vs. Queen): 0.9449
Similarity (King vs. Apple): 0.5588

Note: Similarity scores range from -1 (opposite) to 1 (identical).
Similarity(apple vs. fruit): 0.9831
Similarity(apple vs. phone): 0.2289


## Summary of Day 52

Today, you moved beyond simple word counting to embrace Word Embeddings. You learned that Word2Vec and GloVe generate dense vectors that encode the semantic meaning of words, enabling powerful features like vector arithmetic. This ability to understand context is what separates modern NLP from classical approaches.

## What's Next (Day 53)

We've focused on words, but what about the structure of a sentence? Tomorrow, on Day 53, we'll dive into two foundational structural analysis tools: Named Entity Recognition (NER) and Part-of-Speech (POS) Tagging. You'll learn how to use libraries like spaCy to automatically tag grammar and identify key entities (people, places, organizations) in text.