# Introduction to Word Embeddings

Open in [Google Colab](https://colab.research.google.com/github/febse/ta2025/blob/main/03-01-Word-Embeddings-Overview.ipynb)


In [2]:
# Install gensim if needed (Colab)
try:
    import gensim
except ImportError:
    %pip install gensim

import gensim.downloader as api

# Download the Google News Word2Vec model (300d, English)
model = api.load("word2vec-google-news-300")


In [1]:
model["king"]

NameError: name 'model' is not defined

Until now our approach to representing words has been treating them as discrete symbols. Each word was represented as a one-hot vector, i.e., a vector of length V (the size of the vocabulary) with a 1 in the position corresponding to the word and 0s elsewhere. This representation has several limitations:

- The length of the vector is equal to the size of the vocabulary, which can be very large (tens or hundreds of thousands of words).
- This representation cannot capture any relationships between the words (all vectors are orthogonal to each other) and hence cannot capture semantic or syntactic similarities between words.

As a solution to the first problem we can use dense vector representations of words, i.e., vectors of much smaller size (e.g., 100 or 300 dimensions) where each dimension can take any real value. Such dense vector representations are called **word embeddings**. We actually already saw one method that in effect compressed the sparse one-hot vectors into dense vectors when we applied SVD to the term-document matrix in Latent Semantic Analysis (LSA).

However, these word embeddings are unlikely to capture semantic relationships between words if they are learned in isolation.

The word2vec model [@MIKOLOV2013EfficientEstimationWord] turned out to deliver impressive results by learning word embeddings based on the context in which words appear. There are two variants of the word2vec model: Continuous Bag of Words (CBOW) and Skip-gram.

The skip-gram model tries to predict the context words given a target word. The architecture of the skip-gram model is shown in the figure below.

For example in a sentence "The cat sits on the mat", if the target word is "sits", the context words could be "The", "cat", "on", "the", "mat". The model takes the one-hot vector of the target word as input, passes it through a hidden layer (which is essentially a weight matrix that transforms the one-hot vector into a dense vector), and then tries to predict the one-hot vectors of the context words.

![Skip-gram Model](./figures/skip-gram-architecture.webp)

To make sense of this architecture, let's consider some simpler models that will help us understand how the skip-gram model works.