# Lab 1: Building Embeddings

In this lab, we will look at the process of constructing word embeddings from an unlabelled corpus of texts. We'll use the `NLTK` (Natural Language Toolkit) library to preprocess the text data and build embeddings using the `Word2Vec` approach.

Before we begin, we need to install the `nltk` library. You can do this by running the following command:

In [None]:
!pip install -q -U nltk seaborn

## Step 1: Loading and pre-processing the data

Building word embeddings begins with a (ideally large) corpus of real text data. The aim for this dataset is to be representative of the type of text data that the embeddings will be used on. If you're building embeddings for a specific domain, it's a good idea to use text data from that domain so that there are plenty of examples for niche or domain-specific words.

We will use the Reuters corpus from `nltk`, which is a collection of news articles. Let's go ahead and load the data to see some examples:

In [None]:
import nltk

nltk.download('reuters')
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
from nltk.corpus import reuters

# Display the first few documents in the corpus
for file_id in reuters.fileids()[:5]:
    print(file_id, reuters.raw(file_id)[:200], '...')

In [None]:
# Display the number of documents in the corpus
len(reuters.fileids())

The key principle that word embeddings are built on is that the meaning of a word can be inferred from the context in which it appears. This is the basis of the `Word2Vec` algorithm, which learns to predict a word given its context (or vice versa) by training a neural network on a large corpus of text.

To prepare the data for training the `Word2Vec` model, we need to perform the following steps:

1. Tokenize the text: Split the text into individual words (or tokens).
2. Remove punctuation and special characters.
3. Convert all words to lowercase.

This way, we can ensure that the model learns to treat words like "hello", "Hello", and "HELLO" as the same word.

`nltk` provides some useful tools for text processing, which we'll use for this lab. In practice, `nltk` isn't the most efficient library for processing large datasets, but it's great for educational purposes. If you're curious about more efficient text processing libraries, you can look into `spaCy` and `gensim`.

In [None]:
# Step 1: Tokenize the text
from nltk.tokenize import word_tokenize

def tokenize(text):
    return word_tokenize(text)

print('Before:\n', reuters.raw('test/14826')[:200])
print('After:\n', tokenize(reuters.raw('test/14826'))[:20])

Next we will remove punctuation and special characters from the tokens. There are different ways we can approach this depending on the requirements of the task. For example, our tokenizer (which is very basic) has created 'U.S.-JAPAN' as a single token, because there's no space. If we remove punctuation, we'll end up with 'USJAPAN', which might not be what we want.

For this exercise, we will simply remove any token that contains a non-alphanumeric character. This will remove tokens like 'U.S.', but it will also remove tokens like 'it's' and 'can't'. This is far from best practice, but it's simple and will work for our purposes.

In [None]:
# Step 2: Remove punctuation and special characters

def remove_punctuation(tokens):
    return [word for word in tokens if word.isalnum()]

tokens = tokenize(reuters.raw('test/14826'))
print('Before:\n', tokens[:20])
print('After:\n', remove_punctuation(tokens)[:20])

Finally, we will convert all tokens to lowercase.

In [None]:
# Step 3: Convert all words to lowercase

def to_lowercase(tokens):
    return [word.lower() for word in tokens]

tokens = remove_punctuation(tokenize(reuters.raw('test/14826')))
print('Before:\n', tokens[:20])
print('After:\n', to_lowercase(tokens)[:20])

## Exercise: Combining the preprocessing steps

Now that we have defined the three preprocessing steps, let's combine them into a single function that takes a text string as input and returns a list of preprocessed tokens.

Complete the function `preprocess_text` below to combine the three steps:

1. Tokenize the text.
2. Remove punctuation and special characters.
3. Convert all words to lowercase.

The function should return a list of preprocessed tokens, where each token is a string.

In [None]:
def preprocess_text(text):
    # Complete the function

# Test the function on a sample text
sample_text = reuters.raw('test/14826')
preprocess_text(sample_text)[:20]

There are many other preprocessing steps that can be applied to text data, such as:

- Removing stopwords (common words like "the", "a", "is", etc.).
- Stemming or lemmatization to reduce words to their base form (e.g. "running" -> "run", "raised" -> "raise").
- Removing rare words or words that appear too frequently.

## Building a vocabulary

Now that we have a function to preprocess text data, we can use it to build a vocabulary of words from the Reuters corpus. The vocabulary is simply a set of unique words that appear in the corpus. Typically, it's not worth it to include very rare words in the vocabulary, as we won't have enough examples to learn good embeddings for them. In our case, we also just want to speed things up down the line, so we will limit ourselves to the 1000 most common words in the corpus (this is a really small vocabulary!)

In [None]:
VOCAB_SIZE = 1000

# Build the vocabulary
from collections import Counter
from tqdm import tqdm

def build_vocabulary(corpus):
    word_counts = Counter()
    for file_id in tqdm(corpus.fileids(), total=len(corpus.fileids())):
        text = corpus.raw(file_id)
        tokens = preprocess_text(text)
        word_counts.update(tokens)
    return [word for word, _ in word_counts.most_common(VOCAB_SIZE)]

vocabulary = build_vocabulary(reuters)

print(f"First ten words in the vocabulary: {vocabulary[:10]}")
print(f"Last ten words in the vocabulary: {vocabulary[-10:]}")

# Step 2: Building a Co-occurrence Matrix

The next step in building word embeddings is to construct a co-occurrence matrix. This matrix will have a row and column for each word in the vocabulary, and each cell will contain the number of times the corresponding words appear together in a window of a fixed size.

The window size is a hyperparameter that determines how many words before and after the target word are considered to be its context. A larger window size captures more semantic information, but reduces the specificity of the embeddings. A window size of 2-10 is common for small to medium-sized datasets.

To build the co-occurrence matrix, we need to perform the following steps:

1. Create a mapping from words to indices in the vocabulary: each word will have a unique index that corresponds to its row and column in the matrix.
2. Iterate over the corpus and count co-occurrences of words within the window.
3. Construct a sparse matrix from the co-occurrence counts.
4. Normalize the matrix to account for the frequency of individual words.

Let's begin by creating our mapping from words to indices. This is fairly straightforward, as we can use the index of the word in the vocabulary list as its index in the matrix. So for example, the word at index 0 in the vocabulary ("the") will have index 0 in the matrix, and so on.

In [None]:
# Step 1: Create a mapping from words to indices in the vocabulary

word_to_index = {word: i for i, word in enumerate(vocabulary)}

Next, we will iterate over the corpus and count co-occurrences of words within a window of size `window_size`. To do this, we will use a `defaultdict` of `Counter` objects, which allows us to efficiently count the co-occurrences of words in the context of each target word.

For each word in the corpus, we will consider the words within a window of size `window_size` before and after the target word. We will increment the co-occurrence counts for the target word and the context words in the `co_occurrence_counts` dictionary.

In [None]:
from collections import defaultdict

def build_co_occurrence_matrix(texts, vocabulary, word_to_index, window_size=2):
    co_occurrence_counts = defaultdict(Counter)
    for text in tqdm(texts, total=len(texts)):
        for i, target_word in enumerate(text):
            if target_word in vocabulary:
                target_index = word_to_index[target_word]
                context = text[max(i - window_size, 0):i] + text[i + 1:i + window_size + 1]
                for context_word in context:
                    if context_word in word_to_index:
                        context_index = word_to_index[context_word]
                        co_occurrence_counts[target_index][context_index] += 1
                        
    return co_occurrence_counts

# To ease our understanding, let's begin by building a co-occurrence matrix for just one document

sample_text = preprocess_text(reuters.raw('test/14826'))

co_occurrence_counts = build_co_occurrence_matrix([sample_text], vocabulary, word_to_index, window_size=2)

# Display the co-occurrence counts for the word "the"
for index, count in co_occurrence_counts[word_to_index['the']].items():
    print(vocabulary[index], count)

Now that we have the co-occurrence counts for a single document, we can extend this to the entire corpus. We will build the co-occurrence matrix for the entire Reuters corpus, using a window size of 2. This will take a moment to run.

In [None]:
co_occurrence_counts = build_co_occurrence_matrix([preprocess_text(reuters.raw(file_id)) for file_id in reuters.fileids()], vocabulary, word_to_index, window_size=2)

In [None]:
# Display the co-occurrence counts for the word "the"
for index, count in co_occurrence_counts[word_to_index['the']].items():
    print(vocabulary[index], count)

# Step 3: Constructing a Sparse Matrix

The co-occurrence matrix is typically very sparse, as most words will not appear together in the corpus. To save memory and computation time, we can store the matrix in a sparse format. We will use the `csr_matrix` class from the `scipy` library, which is an efficient way to store sparse matrices.

Note that this is a very small example, so the matrix will not be very sparse. For larger datasets, the sparsity will be much higher, and so the savings from this step will be more significant.

In [None]:
from scipy.sparse import csr_matrix

def build_sparse_matrix(co_occurrence_counts, vocabulary_size):
    rows, cols, data = [], [], []
    for i, counter in co_occurrence_counts.items():
        for j, count in counter.items():
            rows.append(i)
            cols.append(j)
            data.append(count)
    return csr_matrix((data, (rows, cols)), shape=(vocabulary_size, vocabulary_size))

co_occurrence_matrix = build_sparse_matrix(co_occurrence_counts, VOCAB_SIZE)

In [None]:
# Display the shape of the co-occurrence matrix
co_occurrence_matrix.shape

# Step 4: Normalizing the Matrix

The final step in building the co-occurrence matrix is to normalize it to account for the frequency of individual words. This is done by dividing each cell in the matrix by the sum of the row (or column) in which it appears. This ensures that words that appear more frequently in the corpus do not dominate the embeddings.

In [None]:
def normalize_matrix(matrix):
    row_sums = matrix.sum(axis=1)
    row_sums[row_sums == 0] = 1
    return matrix / row_sums

co_occurrence_matrix = normalize_matrix(co_occurrence_matrix)

Now we have a normalized co-occurrence matrix where the value in cell (i,j) represents something like the likelihood of seeing words i and j near each other. What's interesting to note is that this is already sufficient to get some useful information about the relationships between words.

Remember, the key principle behind word embeddings is that similar words should have similar neighbours. With that in mind, we can estimate the similarity of two words by comparing their co-occurrence vectors. We can use the cosine similarity metric for this, which measures the cosine of the angle between two vectors. If the vectors are similar, the cosine similarity will be close to 1; if they are dissimilar, it will be close to -1.

In [None]:
def cosine_similarity(vec1, vec2):
    return vec1.dot(vec2.T).toarray()[0, 0]

def estimate_similarity(word1, word2, matrix, word_to_index):
    index1, index2 = word_to_index[word1], word_to_index[word2]
    vec1, vec2 = matrix.getrow(index1), matrix.getrow(index2)
    return cosine_similarity(vec1, vec2)

estimate_similarity('economy', 'economic', co_occurrence_matrix, word_to_index)

In [None]:
estimate_similarity('economy', 'canada', co_occurrence_matrix, word_to_index)

...okay, maybe these aren't the most amazing results, but we'll get there! The embeddings we've built so far are very basic, and there are many ways to improve them. Some common techniques include:

- Using larger vocabularies and more data.
- Tuning hyperparameters like the window size and the number of dimensions in the embeddings.
- Applying more advanced text processing techniques.
- Using more sophisticated algorithms like GloVe or FastText.

The last step we will apply to our own word embeddings is dimensionality reduction. Since our current matrix is of size (VOCAB_SIZE, VOCAB_SIZE), you can imagine that this quickly becomes intractable for large vocabularies. We can use techniques like PCA or t-SNE to reduce the dimensionality of the embeddings while preserving the relationships between words.

In [None]:
from sklearn.decomposition import PCA

def reduce_dimensions(matrix, n_components=2):
    pca = PCA(n_components=n_components)
    return pca.fit_transform(matrix.toarray())

embeddings_2d = reduce_dimensions(co_occurrence_matrix, n_components=2)

Congratulations! We now have a set of two values that abstractly represent the meaning of each word in the vocabulary. We can plot these values on a 2D graph to visualize the relationships between words. Let's do that now.

In [None]:
import matplotlib.pyplot as plt

def plot_embeddings(embeddings, vocabulary):
    plt.figure(figsize=(10, 10))
    for i, word in enumerate(vocabulary):
        x, y = embeddings[i]
        plt.plot(x, y, 'bo')
        plt.text(x, y, word, fontsize=6)
    plt.show()
    
plot_embeddings(embeddings_2d, vocabulary)

# Extension: Building Word Embeddings with Word2Vec in Gensim

If you have time, you can try building word embeddings using the `Word2Vec` algorithm from the `gensim` library. This is a more advanced and efficient approach to building embeddings, and it can capture more complex relationships between words. The Word2Vec process looks something like this:

1. Tokenize and preprocess the text data.
2. Create a training corpus of (context, target) word pairs, so that the model learns to predict a word given its context.
3. Build a neural network that learns to predict the target word from the context word.
4. Train the neural network on the training corpus.
5. Extract the weights of the hidden layer as the word embeddings.

Here's a simple example of how to build word embeddings using Word2Vec in `gensim`:

In [None]:
!pip install -q -U gensim

In [None]:
from gensim.models import Word2Vec

# Tokenize and preprocess the text data
texts = [preprocess_text(reuters.raw(file_id)) for file_id in reuters.fileids()]

# Train a Word2Vec model
model = Word2Vec(texts, vector_size=100, window=5, min_count=1, sg=1)

# Get the word embeddings
word_vectors = model.wv

# Display the most similar words to "economy"
print(f'Most similar to "economy": {word_vectors.most_similar("economy")}')

# Display the similarity between "economy" and "economic"
print(f'Similarity between "economy" and "economic": {word_vectors.similarity("economy", "economic")}')

# Save the word vectors to a file
word_vectors.save('word_vectors.kv')