#Task 1: Distributional semantics

Some resources that helped me in choosing the hyperparameters.

1-Word2Vec applied to Recommendation: Hyperparameters Matter
Hugo Caselles-Dupré, Florian Lesaint, Jimena Royo-Letelier https://arxiv.org/pdf/1804.04212.pdf

2-https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook

3- word2vec Parameter Learning Explained
Xin Rong https://arxiv.org/abs/1411.2738

4- Contextual and Non-Contextual Word Embeddings:
an in-depth Linguistic Investigation https://aclanthology.org/2020.repl4nlp-1.15.pdf

5- Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings https://arxiv.org/pdf/1909.10430.pdf

6- From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough
by Mourad Mars

Please note that as part of my final year project, I had access to colab pro, therefore the time reported here might differ greatly.

In [1]:
!pip install transformers



## Word2Vev

### Setup and Imports

In [2]:
import pandas as pd
import numpy as np
from time import time

### Data Loading

In [4]:
# Change to ./data/
base = './data/'
result_base = './data/' # this where it save the result

In [5]:
reviews_df = pd.read_csv(base+'Training-dataset.csv', usecols=[0, 1, 2])

word_pairs = pd.read_csv(base+'Task-1-validation-dataset.csv', header=None, index_col=0, usecols=[0, 1, 2])
word_pairs.columns = ['word1', 'word2']

In [6]:
reviews_df.shape

(8257, 3)

In [7]:
reviews_df.head()

Unnamed: 0,ID,title,plot_synopsis
0,8f5203de-b2f8-4c0c-b0c1-835ba92422e9,Si wang ta,"After a recent amount of challenges, Billy Lo ..."
1,6416fe15-6f8a-41d4-8a78-3e8f120781c7,Shattered Vengeance,"In the crime-ridden city of Tremont, renowned ..."
2,4979fe9a-0518-41cc-b85f-f364c91053ca,L'esorciccio,Lankester Merrin is a veteran Catholic priest ...
3,b672850b-a1d9-44ed-9cff-025ee8b61e6f,Serendipity Through Seasons,"""Serendipity Through Seasons"" is a heartwarmin..."
4,b4d8e8cc-a53e-48f8-be6a-6432b928a56d,The Liability,"Young and naive 19-year-old slacker, Adam (Jac..."


In [8]:
# check there's no missing data
reviews_df.isnull().sum()

ID               0
title            0
plot_synopsis    0
dtype: int64

### Data Preprocessing


In [9]:
import spacy

In [10]:
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm", disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def clean_text(doc):
    return ' '.join(token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop)

In [11]:
%%time
# Clean data
txt = [clean_text(doc) for doc in nlp.pipe(reviews_df['plot_synopsis'])]

CPU times: user 3min 40s, sys: 777 ms, total: 3min 41s
Wall time: 3min 43s



This code uses Gensim to detect bigrams in text data. It splits text into sentences, identifies frequent word pairs, and combines them into single tokens. It then counts and lists the top 10 most frequent words or bigrams in the processed text.

In [12]:
from gensim.models.phrases import Phrases, Phraser

sent = [row.split() for row in txt]
phrases = Phrases(sent, min_count=30, progress_per=1000)
bigram = Phraser(phrases)
sentences = bigram[sent]

word_freq = {}
for sent in sentences:
    for i in sent:
        word_freq[i] = word_freq.get(i, 0) + 1
len(word_freq)

78224

In [13]:
# Sorting and displaying the top 10 most frequent words/tokens
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['tell', 'find', 'leave', 'kill', 'man', 'go', 'take', 'come', 'say', 'try']

### Model Building

In [14]:
import multiprocessing

from gensim.models import Word2Vec

In [15]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

8

I have experimented with a range of numbers for the hyperparameter, and found these to have the best results.

In [16]:
model = Word2Vec(min_count=5,
                     window=5,
                     vector_size=200,
                     sample=1e-5,
                     alpha=0.05,
                     min_alpha=0.0001,
                     negative=10,
                     workers=cores-1)

In [17]:
%%time
model.build_vocab(sentences, progress_per=10000)

CPU times: user 3.53 s, sys: 14.9 ms, total: 3.55 s
Wall time: 3.54 s


### Model Training

In [18]:
%%time
model.train(sentences, total_examples=model.corpus_count, epochs=30, report_delay=1)

CPU times: user 4min 53s, sys: 1.55 s, total: 4min 55s
Wall time: 2min


(41044748, 102652230)

### Model Evalution

Calculates cosine similarity between two vectors, handling zero magnitude cases. calculate_similarity averages Word2Vec vectors for phrases, splitting them into words. If a word is not in the model, a zero vector is used, ensuring compatibility with any phrase length for similarity calculation.

In [19]:
# Function to calculate cosine similarity between two vectors
def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)
    if norm_vector1 * norm_vector2 == 0:
        return 0  # Return 0 if either vector has zero magnitude to avoid division by zero
    return dot_product / (norm_vector1 * norm_vector2)


In [20]:
def calculate_similarity(model, phrase1, phrase2):
    # Function to average word vectors for a phrase
    def phrase_vector(phrase):
        words = phrase.split()  # Split the phrase into words
        vectors = [model.wv[word] for word in words if word in model.wv.key_to_index]
        if len(vectors) > 0:
            return sum(vectors) / len(vectors)
        else:
            return np.zeros(model.vector_size)  # Return a zero vector if no words are in the model

    # Compute the average vectors for each phrase
    vector1 = phrase_vector(phrase1)
    vector2 = phrase_vector(phrase2)

    # Compute cosine similarity
    return cosine_similarity(vector1, vector2)

In [21]:
word_pairs['similarity'] = word_pairs.apply(lambda row: calculate_similarity(model, row['word1'], row['word2']), axis=1)

In [22]:
# Drop the word columns
word_pairs.drop(['word1', 'word2'], axis=1, inplace=True)

# Save to CSV without a header and with the index
word_pairs.to_csv(result_base+'10931277-Task1-method-b-validation.csv', index=True, header=False)

### Test Dataset Prediction

In [23]:
%%time
# Load the new test dataset
new_test_word_pairs = pd.read_csv(base + 'Task-1-test-dataset1.csv', header=None, index_col=0, usecols=[0, 1, 2])
new_test_word_pairs.columns = ['word1', 'word2']

# Calculate similarities for the new test dataset
new_test_word_pairs['similarity'] = new_test_word_pairs.apply(lambda row: calculate_similarity(model, row['word1'], row['word2']), axis=1)

# Optionally, drop the word columns if you only need the similarity scores
new_test_word_pairs.drop(['word1', 'word2'], axis=1, inplace=True)

# Save the results to CSV
new_test_word_pairs.to_csv(result_base + '10931277-Task1-method-b.csv', index=True, header=False)

CPU times: user 11.9 ms, sys: 1 µs, total: 11.9 ms
Wall time: 472 ms


## Roberta

Using the Language Model, in this case, RoBERTa, would not yield useful results without context because of its design to understand and generate context-dependent language representations. RoBERTa's strength lies in capturing the nuances of language within sentences or phrases, making it less effective for isolated words without contextual information.

The cosine gave very high result for most words, however, the evaluation script gave a good accuracy of 0.57, given that most similarities are above 0.99.

Again, since we were not allowed to use the training data to fine tune the model or provide context, the result was not high and of poor quality

### Setup and loading

In [24]:
from transformers import RobertaModel, RobertaTokenizer
import torch

In [25]:
# Change to ./data/
base = './data/'
result_base = './data/' # this where it save the result

In [26]:
word_pairs = pd.read_csv(base+'Task-1-validation-dataset.csv', header=None, index_col=0, usecols=[0, 1, 2])
word_pairs.columns = ['word1', 'word2']

### Tokenzing

In [27]:
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name, output_hidden_states=True)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Model Evalaution

In [28]:
def cosine_similarity(a, b):
    return torch.nn.functional.cosine_similarity(a, b, dim=0).item()

I have experimented with a range of layers and including getting the mean of all layers, however, found that using the second to last layer provided the highest accuracy as in regard to the script provided.

In [29]:
# Function to get the embedding from a specific layer
def get_embedding_from_specific_layer(word, layer_index=-2):
    inputs = tokenizer(word, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        # Get the embeddings from the specified layer
        layer_embeddings = outputs.hidden_states[layer_index]
        # Use the first token's embedding from that layer
        token_embedding = layer_embeddings[:, 0, :]
    return token_embedding.squeeze()

In [30]:
similarities = []
for _, row in word_pairs.iterrows():
    embedding1 = get_embedding_from_specific_layer(row['word1'])
    embedding2 = get_embedding_from_specific_layer(row['word2'])
    similarity = cosine_similarity(embedding1, embedding2)
    similarities.append(similarity)

In [31]:
word_pairs['similarity'] = similarities

# Drop the word columns
word_pairs.drop(['word1', 'word2'], axis=1, inplace=True)

# Save to CSV without a header and with the index
word_pairs.to_csv(result_base+'10931277-Task1-method-c-validation.csv', index=True, header=False)

### Test Dataset Prediction

In [34]:
# Get the word embedings for the test dataset

%%time

test_word_pairs = pd.read_csv(base + 'Task-1-test-dataset1.csv', header=None, index_col=0, usecols=[0, 1, 2])
test_word_pairs.columns = ['word1', 'word2']

test_similarities = []
for _, row in test_word_pairs.iterrows():
    embedding1 = get_embedding_from_specific_layer(row['word1'])
    embedding2 = get_embedding_from_specific_layer(row['word2'])
    similarity = cosine_similarity(embedding1, embedding2)
    test_similarities.append(similarity)

test_word_pairs['similarity'] = test_similarities
test_word_pairs.drop(['word1', 'word2'], axis=1, inplace=True)
test_word_pairs.to_csv(result_base + '10931277-Task1-method-c.csv', index=True, header=False)

CPU times: user 21.7 s, sys: 53.6 ms, total: 21.7 s
Wall time: 5.46 s
