# Natural Language Processing CW - Task 1: Distributional Semantics


Before starting, we will load the datasets, import relevant libraries, etc. These will be used across the whole notebook.

Importing the necessary libraries:

In [45]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
import numpy as np
from gensim.models import Phrases
import string
import nltk
import time
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Loading the training dataset:

In [46]:
train_data = pd.read_csv('./data/Training-dataset.csv')

Load the validation test dataset:


In [47]:
# Load test data
val_data = pd.read_csv('./data/Task-1-validation-dataset.csv', names=['id', 'term1', 'term2', 'similarity'])

Before starting, it is good to visualize the dataframe to understand the kind of data we are working with.

In [48]:
train_data.head()

Unnamed: 0,ID,title,plot_synopsis,comedy,cult,flashback,historical,murder,revenge,romantic,scifi,violence
0,8f5203de-b2f8-4c0c-b0c1-835ba92422e9,Si wang ta,"After a recent amount of challenges, Billy Lo ...",0,0,0,0,1,1,0,0,1
1,6416fe15-6f8a-41d4-8a78-3e8f120781c7,Shattered Vengeance,"In the crime-ridden city of Tremont, renowned ...",0,0,0,0,1,1,1,0,1
2,4979fe9a-0518-41cc-b85f-f364c91053ca,L'esorciccio,Lankester Merrin is a veteran Catholic priest ...,0,1,0,0,0,0,0,0,0
3,b672850b-a1d9-44ed-9cff-025ee8b61e6f,Serendipity Through Seasons,"""Serendipity Through Seasons"" is a heartwarmin...",0,0,0,0,0,0,1,0,0
4,b4d8e8cc-a53e-48f8-be6a-6432b928a56d,The Liability,"Young and naive 19-year-old slacker, Adam (Jac...",0,0,1,0,0,0,0,0,0


In [49]:
train_data['plot_synopsis']

0       After a recent amount of challenges, Billy Lo ...
1       In the crime-ridden city of Tremont, renowned ...
2       Lankester Merrin is a veteran Catholic priest ...
3       "Serendipity Through Seasons" is a heartwarmin...
4       Young and naive 19-year-old slacker, Adam (Jac...
                              ...                        
8252    After serving an eight month sentence for brea...
8253    The Mystery Inc. crew head to Chicago for a ta...
8254    Through its run, Another Life revolved around ...
8255    At the North Bend Psychiatric Hospital in 1966...
8256    The film is a depiction of various scenes, usu...
Name: plot_synopsis, Length: 8257, dtype: object

# (a) Bag-of-Words with tf*idf (sparse representation):


TF-IDF (Term Frequency-Inverse Document Frequency) is a numercial statistic that evaluates the importance of a word in a document against all documents.
To implement TF-IDF for this task, we will use TfidfVectorizer (from scikit-learn) to create a matrix that contains documents (rows) and terms (columns) in the vocabulary, and stores the TF-IDF scores.
Uisng the matrix, we will then calculate the cosine similarity between 2 terms.

Before we can start working, we have to preprocess our data. This is important to reduce noise and improve the model performance. The function preprocess_text does the following:


*   Lowercase the text.
*   Tokenize the text (breaks down text into discrete units/words).
*   Removing stopwords.
*   Remove punctuations.
*   Lemmatization to reduce words to their root form.


In [50]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    # Lemmatization and stemming
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Join tokens back into a string
    return ' '.join(tokens)

Using the training data we loaded, we have to extract the plot synopsis which we will be using as our corpus. We process the synposes, and create a TF-IDF representation for it.

In [51]:
#start the clock to calculate the length of the process of training
start = time.time()

In [52]:
# Preprocess the plot_synopsis column using the preprocess_text function
corpus = train_data['plot_synopsis'].apply(preprocess_text)

# Create TF-IDF matrix representation from our corpus
# The N-gram range is set to (1,3), meaning the model will consider unigrams, bigrams, and trigarms
tfidf_vect = TfidfVectorizer(ngram_range=(1, 3))
# Transform the corpus into a tfidf matrix
matrix = tfidf_vect.fit_transform(corpus)

In [53]:
#Check the length of our vocabulary
len(tfidf_vect.vocabulary_)

6157454

Now that we have created the TF-IDF representation for our corpus, we can see how the model performs on our validation dataset. But before doing that, we need to think of Out-of-Vocabulary words. Since our model is fitted using our corpus, the validation dataset may contain unseen words. To handle that, we will check for synonyms using wordnet.

The function synonym takes in a term as an input and uses WordNet to find relevant synonyms. It returns the list of synonyms for a given term. It is used to handle OOV. If the word is not in our vocabulary, then we check if the synonyms are.

In [54]:
def synonym(term):
  #initialize lemmatizer and an empty set to store synonyms
  lem = WordNetLemmatizer()
  syn = set()
  #iterating over the set of synonyms for a given term from wordnet
  for s in wordnet.synsets(term):
    #iterate over the individual synonyms in the set
    for l in s.lemmas():
      #lemmatize the synonym and add it to the set
      lem_syn = lem.lemmatize(l.name())
      syn.add(lem_syn)
  return list(syn)

The check_vocab is a function that checks if a term is present in our vocabulary. If it is, then we retreive the index, if not then we return None. This helps identify OOV words.

In [55]:
def check_vocab(term, tfidf_vect, matrix):
  #check if the term is present in our vocabulary
    if term in tfidf_vect.vocabulary_:
      #retreive index of the term
        indx = tfidf_vect.vocabulary_[term]
        #extract the column and convert it to an array and faltten it
        return matrix[:, indx].toarray().flatten()
    #if the term is OOV return None
    return None

The last function we have is cos_sim, which takes in the 2 terms we want to calculate the similarity for, along with our tfidf matrix and vectorizer. The first thing is to vectorize the terms using check_vocab function. If the vecttors are returned, it calculates the cosine similarity. Otherwise, it looks at synonyms using the synoyms function. If valid vectors are found, it calculates the similarity, otherwise it returns a similarity of 1 as a default.

In [56]:
def cos_sim(term1, term2, tfidf_vect, matrix):
    # Vectorize terms using check_vocab function
    v1 = check_vocab(term1, tfidf_vect, matrix)
    v2 = check_vocab(term2, tfidf_vect, matrix)

    # check for valid vectors and calculate their cosine similarity
    if v1 is not None and v2 is not None:
        return cosine_similarity([v1], [v2])[0][0]

    # Handle synonyms
    syn1 = synonym(term1)
    syn2 = synonym(term2)

    #iterate over the first term's synonyms
    for s1 in syn1:
      #vectorize the synonym
        v1_syn = check_vocab(s1, tfidf_vect, matrix)
        #check if the vector is valid
        if v1_syn is not None:
            #iterate over the second term's synonyms
            for s2 in syn2:
                #vectorize the synonym
                v2_syn = check_vocab(s2, tfidf_vect, matrix)
                #check if the vector is valid
                if v2_syn is not None:
                    #calculate the synonyms cosine similarity
                    return cosine_similarity([v1_syn], [v1_syn])[0][0]

    return 0.5  # Default similarity if no valid vectors are found

Calculate the cosine similarity between term1 and term2 in the test data:

In [57]:
#Empty list to store the cosine similarity values
similarity_values = []
#interating over the rows in our validation data
for i, row in val_data.iterrows():
    term1 = row['term1']
    term2 = row['term2']
    #use the cos_sim function to calculate the cosine similarity between the terms
    similarity = cos_sim(term1, term2, tfidf_vect, matrix)
    #append the calculated similarity to the list
    similarity_values.append(similarity)
#add a new column to our test datafarme with the cosine similarity calculated
val_data['cosine_similarity'] = similarity_values


In [58]:
#stop the clock
end = time.time()
#calculate the elapsed time
elapsed_time = end - start
print(f'Time taken to preprocess, train, and validate the model: {elapsed_time} seconds')

Time taken to preprocess, train, and validate the model: 142.7440857887268 seconds


Saving the results in a csv file:

In [59]:
# Save results to a CSV file
result_df = pd.DataFrame({'id': val_data['id'], 'cosine_similarity': val_data['cosine_similarity']})
result_df.to_csv('10693727-Task1-method-a-validation.csv', index=False, header=False)


**Results:**

I have played around with the preprocessing function, and found that removing punctuation and lemmatization improves the accuracy of the model to 61%. In addition to checking for synonyms.

### Testing:

Now that we have our model trained using our training dataset, and tested using the validation dataset, we can test it on unseen data. We will load the test dataset and run the model just like we did for the validation dataset, and save the results in a csv file.

In [60]:
#Load the test dataset
test_data = pd.read_csv('./data/Task-1-test-dataset.csv', names=['id', 'term1', 'term2'])
#start clock
start = time.time()
#Empty list to store the cosine similarity values
test_similarity_values = []
#interating over the rows in our validation data
for i, row in test_data.iterrows():
    term1 = row['term1']
    term2 = row['term2']
    #use the cos_sim function to calculate the cosine similarity between the terms
    test_similarity = cos_sim(term1, term2, tfidf_vect, matrix)
    #append the calculated similarity to the list
    test_similarity_values.append(test_similarity)
#add a new column to our test datafarme with the cosine similarity calculated
test_data['cosine_similarity'] = test_similarity_values
#end time
end = time.time()
#print the time elapsed
elapsed_time = end - start
print(f'Time taken to test the model: {elapsed_time} seconds')

Time taken to test the model: 8.274048328399658 seconds


Save the results:

In [61]:
# Save results to a CSV file
test_result = pd.DataFrame({'id': test_data['id'], 'cosine_similarity': test_data['cosine_similarity']})
test_result.to_csv('10693727-Task1-method-a.csv', index=False, header=False)

# (b) word2vec (dense static representation)

Word2Vec is an NLP technique used for learning distributed representations of terms in a continious vector space. In this implementation, Word2Vec is used to generate vectors to represent terms based on semantic terms. Which are then used to calculate the cosine similarity between 2 terms.

We will be using the preprocess_text function from our previous method to preprocess the data.

In [62]:
#start time
start = time.time()
# Preprocess training data
corpus = [preprocess_text(doc) for doc in train_data['plot_synopsis']]

We train a bigram detector using Phrases (gensim) to identify multi term words. We then use that to train the Word2Vec model. This will help capture individual and multi term words that are frequently occuring.

In [63]:
# Train a bigram detector.
bigram = Phrases(corpus, min_count=1, threshold=1)
#create a phraser object
bigram_text = Phrases(bigram[corpus])

# Train Word2Vec model
model = Word2Vec(sentences=list(bigram_text[corpus]), vector_size=100, window=5, min_count=5, workers=3, sg=1)



In [64]:
vector_dimension = model.vector_size

print(f"Word2Vec Vector Dimension: {vector_dimension}")

Word2Vec Vector Dimension: 100


The get_vector function uses the Word2Vec model we trained above. It initialzes an empty array to accumulate the word vectors. It iterates through the document and checks if the word is in the vocablary, if it is not then it checks for synonyms.

In [65]:
def get_vector(text):
    #initialize an array to accumlate vector
    vector = np.zeros((1, model.vector_size))
    #iterate over the words in the document
    for word in text:
        #check if the word is in the vocabulary
        if word in model.wv:
            # if the word is in the vocabulary, add it to the vector sum
            vector += model.wv[word]
        else:
            # If the word is not in the vocabulary, try finding a synonym
            synonyms = []
            for syn in wordnet.synsets(word):
                for lemma in syn.lemmas():
                    synonyms.append(lemma.name())
            # iterate over the synonyms and add the vector of the first found synonym
            for synonym in synonyms:
                if synonym in model.wv:
                    vector += model.wv[synonym]

    return vector + 1  # Add 1 to the entire vector

Using the validation dataset that we loaded previously, we will assess the performance of the model. We create an empty column in the dataframe where we will store the calculated similarities. Then loop over each pair of terms in the validation set and preprocess it using the preprocess_text function. We then get the vectors for the terms using get_vector function. Using the vectors, we calculate the cosine similarity.

In [66]:
# Add a column for calculated similarities to the validation DataFrame
val_data['calculated_similarity'] = np.nan

# Test the model using the test data
for i, pair in val_data.iterrows():
    term1 = preprocess_text(pair["term1"])
    term2 = preprocess_text(pair["term2"])

    # Get Word2Vec vectors for the terms
    vector_term1 = get_vector(term1)
    vector_term2 = get_vector(term2)

    # Calculate cosine similarity using Word2Vec vectors
    similarity_word2vec = cosine_similarity(vector_term1, vector_term2)[0][0]

    # Update the 'Similarity_Word2Vec' column
    val_data.loc[i, 'calculated_similarity'] = similarity_word2vec

#stop the clock
end = time.time()
#calculate the elapsed time
elapsed_time = end - start
print(f'Time taken to preprocess, train, and validate the model: {elapsed_time} seconds')

Time taken to preprocess, train, and validate the model: 227.48058700561523 seconds


Save the results:

In [67]:
# Save the updated test DataFrame to a new CSV file
val_data[['id', 'calculated_similarity']].to_csv('10693727-Task1-method-b-validation.csv', header=False, index=False)

**Results:**

For this method, when using CBOW, the accuracy ranged between 48% to 51%. When implementing the Bigram detector from gensim, and using Skip-gram, the accuracy increased to 54%.

### Testing:

Now that we have our model trained using our training dataset, and tested using the validation dataset, we can test it on unseen data. We will load the test dataset and run the model just like we did for the validation dataset, and save the results in a csv file.

In [68]:
#start time
start = time.time()
# Add a column for calculated similarities to the test DataFrame
test_data['calculated_similarity'] = np.nan

# Test the model using the test data
for i, pair in test_data.iterrows():
    term1 = preprocess_text(pair["term1"])
    term2 = preprocess_text(pair["term2"])

    # Get Word2Vec vectors for the terms
    vector_term1 = get_vector(term1)
    vector_term2 = get_vector(term2)

    # Calculate cosine similarity using Word2Vec vectors
    similarity_word2vec = cosine_similarity(vector_term1, vector_term2)[0][0]

    # Update the 'Similarity_Word2Vec' column
    test_data.loc[i, 'calculated_similarity'] = similarity_word2vec
#end time
end = time.time()
#print the time elapsed
elapsed_time = end - start
print(f'Time taken to test the model: {elapsed_time} seconds')

Time taken to test the model: 0.1435244083404541 seconds


Save the results:

In [69]:
# Save the updated test DataFrame to a new CSV file
test_data[['id', 'calculated_similarity']].to_csv('10693727-Task1-method-b.csv', header=False, index=False)