### Web Intelligence - Exercise 10

In this exercise, we will explore several prominent techniques for learning word representations. First, we will implement a matrix decomposition-based word embedding model, **Latent Semantic Analysis (LSA)**. These models leverage matrix factorization to capture semantic relationships in text, providing an introduction to the underlying principles of word representation.

Next, we will implement a neural-based approach, the **SkipGram** model of the **Word2Vec** framework. The SkipGram architecture is designed to predict surrounding context words given a target word, effectively learning word representations that capture meaningful relationships between words in a corpus.

We will work on the [CMU Book Summary](https://www.cs.cmu.edu/~dbamman/booksummaries.html) dataset that consists of plot summaries for $16,559$ books extracted from Wikipedia. Throughout the exercise, we will analyze and compare these methods, discussing their advantages, limitations, and practical applications in natural language processing tasks. 

**Question 1.** In this exercise, we will implement the Latent Semantic Analysis (LSA).

In [None]:
import os.path

from tqdm.notebook import tqdm
# For loading the dataset
from datasets import load_dataset

# Preprocessing packages
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

Load the dataset and preprocess the book summaries.

In [None]:
# Load the CMU Book Summary Corpus dataset and get all the summaries.
ld = load_dataset("textminr/cmu-book-summaries")['train']
summaries = [data['summary'] for data in ld]

In [None]:
# Complete the following preprocessing steps:
stop_words = set(stopwords.words("english"))

def preprocess_texts(texts, stop_words):
    
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    cleaned_texts = []
    for text in tqdm(texts, desc="Pre-processing text"):
        
        # 1. Lowercase the text
        text = 
        
        # 3. Remove punctuation and special characters
        text = 
        
        # 4. Tokenize the text
        words = 
        
        # 5. Remove stopwords
        words = 
        
        # 6. Lemmatize the tokens 
        words = 
        
        # Join tokens back into a single string
        cleaned_texts.append(" ".join(words))
        
    return cleaned_texts

# Run the preprocessing function on the summaries
corpus = preprocess_texts(summaries, stop_words=stop_words)

In [None]:

# You can compare the original and preprocessed summaries to see the difference.
print(f"Original: {summaries[0][:100]}")
print(f"Preprocessed: {corpus[0][:100]}")

Construct the Document-Term Matrix

In [None]:
# Note: you can only consider the top 'N' words ordered by term frequency across the corpus.

vocab_size = 


Construct the Term Frequency-Inverse Document Frequency (TF-IDF) Weighting

In [None]:
# Fit and transform the document-term matrix to the TF-IDF matrix

tfidf_matrix = 


Perfom Singular Value Decomposition (SVD) and Dimensionality Reduction

In [None]:
k = 

U, S, Vh = 

# Print the singular values
print("\nTop k- singular values:\n", S)

# Define the word embeddings
# Note that the columns of SVh.T define the word embeddings, i.e. rows of the VhS
# SVh.T is also equal to U^TX so the rows of (X^TU) also correspond the same word embeddings.
word_embeddings = np.dot(np.diag(S), Vh).T


Generate Word Embeddings and Analyze Similarity Between Words

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import seaborn as sns

words_to_explore= [
    'king', 'queen', 'woman', 'man', 'man', 'woman', 'boy', 'daughter', 
    #'dog', 'cat', 'horse', 'elephant', 'fish', 'bird', 'lizard', 'snake',
    #'doctor', 'nurse', 'scientist', 'teacher', 'engineer', 'artist', 'musician', 'writer',
    #'denmark', 'sweden', 'france', 'germany', 'spain', 'italy', 'portugal', 'turkey',
]




In [None]:
# Analyzing similarity between words
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarity_terms(word1, word2, word_embeddings_df):
    
    return 

# Find top 5 similar words for a given word
def top_similar_words(word, word_embeddings_df, top_n=5):

    return 

word_embeddings_df = pd.DataFrame(word_embeddings, index=terms)

word1, word2 = "king", "queen"
# Test cosine similarity function
print(f"Cosine similarity between '{word1}' and '{word2}': {cosine_similarity_terms(word1, word2, word_embeddings_df)}")
word = "king"
print(f"Top 5 similar words to '{word}': {top_similar_words(word, word_embeddings_df, top_n=5)}")

Compare Book Summaries

In [None]:
# Define the book summary vectors
doc_embedding = U

# Compute cosine similarity between documents
doc_similarities = cosine_similarity(doc_embedding)
doc_similarities_df = pd.DataFrame(
    doc_similarities, columns=[f"Summary {i+1}" for i in range(len(corpus))], index=[f"Summary {i+1}" for i in range(len(corpus))])

#print("\nSummary Similarities (Cosine Similarity):\n", doc_similarities_df)

**Question 2.** In this exercise, you will explore pre-trained word embeddings on the [Google News dataset](https://code.google.com/archive/p/word2vec/), and you will familiarize yourself with the **Gensim** package for handling the word embeddings. You will also investigate the analogy relationships between words based on their embeddings.

In [None]:
import os
import numpy as np
import gensim
from gensim.test.utils import datapath
from sklearn.decomposition import PCA


Download the pre-trained word embeddings from [Google News dataset](https://code.google.com/archive/p/word2vec/) and use the *Gensim* package to load and work with the embeddings.

In [None]:
# Define the path of the pretrained model
googlenews_vectors = 
gensim_model = gensim.models.KeyedVectors.load_word2vec_format(googlenews_vectors, binary = True)

 Use the following example list of words: 'king', 'queen', 'woman', 'man', 'fish', 'bird', 'snake', 'elephant' (or construct your own list of at least $6$-$10$ meaningful and diverse words.

Reduce the dimensionality of the word embeddings to $2D$ space using PCA and visualize the words in the new latent space.

In [None]:

def display_pca_scatterplot(model, words=None, sample=0, save=False, file_path='scatterplot.png'):
    


In [None]:
words_to_explore= [
    'king', 'queen', 'woman', 'man', 'fish', 'bird', 'snake', 'elephant',
]
display_pca_scatterplot(model=gensim_model, words=words_to_explore)


Investigate the relationships between the words based on their embeddings.

In [None]:
gensim_model.most_similar(positive=["king", "woman"],negative=["man"])


Assess the overall accuracy of the embeddings and analyze strengths and weaknesses across analogy types (syntactic vs. semantic).

In [None]:
accuracy = gensim_model.evaluate_word_analogies(datapath('questions-words.txt'))

# The first entry stores the overall evaluation score on the entire evaluation set
print(f"Overall evaluation score: {accuracy[0]}")
for current_dict in accuracy[1]:
    correct_count, incorrect_count = len(current_dict['correct']), len(current_dict['incorrect'])
    section_accuracy = correct_count / float(correct_count + incorrect_count)
    print(
        f"Section: {current_dict['section']} - Correct: {correct_count} - Incorrect: {incorrect_count} - Accuracy: {section_accuracy}"
    )

**Question 3.** In this exercise, we will implement the **SkipGram** model of the **Word2Vec** framework.

In [None]:
# Import the necessary packages
import torch
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm

Prepare the training data

In [None]:
corpus =  # Define the corpus (preprocessed summaries)


In [None]:
min_count = 5 # Discard the words that appear less or equal to 'min_count'
window_size =  # You can choose small values to reduce computational cost, but it might affect the performance of the model

# Define the vocabulary, i.e. set of all distinct words that appear more than 'min_count' times in the corpus.
vocab = 
vocab_size = len(vocab) 
# Construct the dictionaries that maps words to unique ids and vice versa.
word2Id = {word: id for id, word in enumerate(vocab)}
id2Word = {id: word for word, id in word2Id.items()}



In [None]:
# Generate skip-gram training (source, context) word pairs.

# You can implement the following function to generate the source-context word pairs.
# If you wish, you can also modify the argument list
def generate_center_context_pairs(sentences, window_size, word2Id, vocab):
    '''
    Given a list of sentences and a window size, generate a pair of context pairs
    Note that a pair must be discarded if its center or context word appears less than 'min_count'.
    :param sentences: a list of lists of words
    :param window_size: window size
    :param word2Id: a dictionary that maps words to ids
    :param vocab: a set of unique words
    :return: a list of centerId-contextId word pairs
    '''
    data = []

    return data

pairs = generate_center_context_pairs(sentences=sentences, window_size=window_size, word2Id=word2Id, vocab=vocab)
# Convert the list to a torch tensor
pairs = torch.as_tensor(pairs, dtype=torch.long)

In [None]:
# You can examine the training data
print(f"Vocab size: {vocab_size}")
print(f"Number of center-context pairs: {pairs.shape[0]}")

Implement the SkipGram Model with Softmax Function

In [None]:
# Define the Skip-gram Model with Softmax
class SkipGramModel(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        # Complete the implementation
        
    def forward(self, center_word):
    
    def get_center_embs(self, center_words = None):
    
    def get_context_embs(self, context_words):
        

Set the parameters required for training the model such as learning rate, embedding dimension, epochs and batch size.
Note that chosen values for the parameters might/might not affect the optimization and the training time.

In [None]:
lr = 
epochs_num = 
embedding_dim = 100
batch_size = 

# Initialize model, loss, and optimizer
model = SkipGramModel(vocab_size, embedding_dim)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Set the platform that we will use for training.
device = torch.device('mps') #torch.device("cuda" if torch.cuda.is_available() else "cpu") (for Mac M1/M2, torch.device('mps'))
model.to(device)

Train the model

In [None]:
# Build a data loader for pairs vector in pytorch
data_loader = DataLoader(pairs, batch_size=batch_size, shuffle=True)

# Training loop
for epoch in tqdm(range(epochs_num)):
    total_loss = 0
    for batch_pairs in tqdm(data_loader, desc=f"Epoch {epoch} - Current batch progress"):
        # Transfer the batch data to the memory of the 'device' (i.e. gpu if used)
        batch_pairs = batch_pairs.to(device)
        # Define the center and context words
        center_word, context_word = batch_pairs[:, 0], batch_pairs[:, 1]
        
        # Forward pass
        optimizer.zero_grad()
        output = model(center_word)
        
        # Calculate loss
        loss = loss_function(output, context_word)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(pairs):.4f}")  
        
# Transfer the model back to the cpu.
model.to('cpu')

Generate Word Embeddings and Visualize

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import seaborn as sns

words_to_explore= [
    'king', 'queen', 'woman', 'man', 'boy', 'daughter', 
    'dog', 'cat', 'horse', 'elephant', 'fish', 'bird', 'lizard', 'snake',
    #'doctor', 'nurse', 'scientist', 'teacher', 'engineer', 'artist', 'musician', 'writer',
    #'denmark', 'sweden', 'france', 'germany', 'spain', 'italy', 'portugal', 'turkey',
]
selected_ids = torch.tensor([word2Id[word] for word in words_to_explore])
selected_word_embeddings = model.get_center_embs(selected_ids).detach()


# Visualize the word embeddings using PCA for dimensionality reduction to 2D



Evaluate Word Similarity

In [None]:
names = [id2Word[i] for i in range(len(id2Word))]
word_embs_df = pd.DataFrame(model.get_center_embs().detach(), index=names)

word1, word2 = "king", "queen"
# Test cosine similarity function
print(f"Cosine similarity between '{word1}' and '{word2}': {cosine_similarity_terms(word1, word2, word_embs_df)}")
word = "king"
print(f"Top 5 similar words to '{word}': {top_similar_words(word, word_embs_df, names, top_n=5)}")

