Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "ANTONIS PRODROMOU"
ID = "238"

---

# Introduction & Learning Goals

Welcome to your second assignment! The goal is to get hands-on experience with two fundamental ways of modeling language: i) N-gram Language Models (LMs) that capture the probability of a sequence of words, ii) Word Embeddings: Using dense vectors (fasttext) to represent word meaning.

You will first build two separate n-gram LMs—one for positive reviews and one for negative—and use perplexity to see how well they model new sentences. Then, you will use word embeddings to build a sentiment classifier.

This assignment will test your understanding of the concepts from the Speech and Language Processing book, Chapters 3 (N-gram Language Models) and 5 (Vector Semantics and Embeddings).

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.util import pad_sequence, ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline, flatten
from nltk.lm import Laplace, MLE
import re
import requests, zipfile, io
import nltk
from datasets import load_dataset

# Download NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')

# Load the IMDB dataset from Hugging Face
print("Loading IMDB dataset from Hugging Face...")
dataset = load_dataset('imdb')

# The dataset is a DatasetDict. We'll convert the 'train' and 'test' splits to pandas
X_train_df = dataset['train'].to_pandas()
X_test_df = dataset['test'].to_pandas()

print(f"Loaded {len(X_train_df)} training examples and {len(X_test_df)} test examples.")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to /Users/anton/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/anton/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading IMDB dataset from Hugging Face...
Loaded 25000 training examples and 25000 test examples.


# Part 1 - N-gram Language Models

In this section, you will use nltk to build, train, and evaluate n-gram language models. You will create two separate models, one  trained only on positive reviews, and one trained only on negative reviews.

You will then use perplexity to measure how "surprising" a new sentence is to each model. A lower perplexity score means the model finds the sentence more probable (i.e., it "fits" the model better).

We will use trigrams (n=3) and Laplace (Add-1) smoothing.

## Task 1.1 Train an N-gram LM

Create a function train_lm that takes a list of tokenized sentences (a list of lists of tokens) and returns a trained 3-gram nltk language model with Laplace smoothing (nltk.lm.Laplace). Use the padded_everygram_pipeline function to process the sentences.

In [3]:
# the trigram model means that we are considering the two previous words
# Laplace smoothing: add one to all the n-gram counts before we normalize them into probabilities

def train_lm(sentences):
    N = 3
    # padded_everygram_pipeline returns iterators
    train, vocab = padded_everygram_pipeline(N, sentences)
    # Laplace() creates a new LanguageModel, implementing Laplace (add one) smoothing
    lm = Laplace(N)
    # I use vocab to initialize counts for unseen n-grams
    lm.fit(train, vocab)
    return lm

In [4]:
"""Testing that the function returns the correct results"""
# Convert the train split of the DatasetDict to pandas
X_train_df = dataset['train'].to_pandas()

def preprocess(text):
    text = text.lower()
    return [word_tokenize(sent) for sent in sent_tokenize(text)]

print("Preparing data for training N-gram LMs...")
# Get all positive and negative reviews and create a list of sentences, each
# with a list of tokens suitable for nltk LM training
pos_sents = [sent for review in X_train_df[X_train_df['label'] == 1]['text'] for sent in preprocess(review)]
neg_sents = [sent for review in X_train_df[X_train_df['label'] == 0]['text'] for sent in preprocess(review)]
print(f"Total positive sentences for LM: {len(pos_sents)}")
print(f"Total negative sentences for LM: {len(neg_sents)}")
print(f"Example positive sentence: {pos_sents[0]}")
print(f"Example negative sentence: {neg_sents[0]}")

print("Training positive language model")
pos_lm = train_lm(pos_sents)
print("Training negative language model")
neg_lm = train_lm(neg_sents)

assert pos_lm.counts[['great']]['movie'] == 294
assert neg_lm.counts[['bad']]['movie'] == 320


Preparing data for training N-gram LMs...
Total positive sentences for LM: 130829
Total negative sentences for LM: 137359
Example positive sentence: ['zentropa', 'has', 'much', 'in', 'common', 'with', 'the', 'third', 'man', ',', 'another', 'noir-like', 'film', 'set', 'among', 'the', 'rubble', 'of', 'postwar', 'europe', '.']
Example negative sentence: ['i', 'rented', 'i', 'am', 'curious-yellow', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', 'released', 'in', '1967.', 'i', 'also', 'heard', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'u.s.', 'customs', 'if', 'it', 'ever', 'tried', 'to', 'enter', 'this', 'country', ',', 'therefore', 'being', 'a', 'fan', 'of', 'films', 'considered', '``', 'controversial', "''", 'i', 'really', 'had', 'to', 'see', 'this', 'for', 'myself.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'the', 'plot', 'is', 'centered', 'around', 'a', 'young', 'swedish', 'drama', 'stude

## Task 1.2 Calculate Perplexity

Create a calculate perplexity function that takes a trained model and an untokenized sentence and computes its perplexity. You can use the perplexity function of nltk models. Remember to apply the same preprocessing as during training.

In [5]:
def calculate_perplexity(lm, sentence):
    N = 3
    # preprocess returns a list of lists
    tokenized_sents = preprocess(sentence)

    # padded_everygram_pipeline creates two iterators: an iterator of padded n-grams
    # and an iterator of padded words for vocabulary training
    test_data, vocabulary = padded_everygram_pipeline(N, tokenized_sents)

    # flatten the test_data iterator into a single list of n-grams
    flattened = []
    # loop over sentences
    for sent_ngrams in test_data:
        # loop over the n-grams in that sentence
        for ngram in sent_ngrams:
            flattened.append(ngram)
    test_data = flattened

    # Calculate perplexity using the model's perplexity method
    return lm.perplexity(test_data)

In [6]:
"""Testing that the function returns the correct results"""
test_pos_sent = "This was a truly great and wonderful film."
per_pos_pos = calculate_perplexity(pos_lm, test_pos_sent)
per_neg_pos = calculate_perplexity(neg_lm, test_pos_sent)
assert per_pos_pos < per_neg_pos


# Part 2 - Embeddings

In this section, we'll switch gears. Instead of n-grams, we'll represent text using pre-trained fasttext embeddings. We will use a "sentence embedding" technique by averaging the word vectors for all words in a sentence to build a sentiment classifier.

In [7]:
import fasttext
from huggingface_hub import hf_hub_download

# model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
# model = fasttext.load_model(model_path)
model = fasttext.load_model("data/model.bin")

## Task 2.1 Implement Cosine Similarity

To understand embeddings, let's look at their properties. A key operation is cosine similarity, which measures the similarity between two vectors. Implement the function below.Hint: The formula is $similarity = \frac{A \cdot B}{\|A\| \|B\|}$

In [8]:
def cosine_similarity(vec_a, vec_b):
    vctr_dot_product = vec_a.dot(vec_b)
    vector_magnitude_product = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    if vector_magnitude_product!= 0:
        vector_similiarity = vctr_dot_product / vector_magnitude_product
    else:
        vector_similiarity = 0
    return vector_similiarity

In [9]:
"""Testing that the function returns the correct results"""
vec_good = model['good']
vec_nice = model['nice']
vec_king = model['king']
assert cosine_similarity(vec_good, vec_nice) > cosine_similarity(vec_good, vec_king)


## Task 2.2 Document Vector Averaging

To create a sentiment classifier, let's first implement a method that converts a document into the average of the embeddings of the words inside the document:

In [12]:
def average_document_vector(doc_text, embeddings_dict):
    words = doc_text.split()
    total_embeddings = []
    for word in words:
        total_embeddings.append(embeddings_dict[word])
    if total_embeddings:
        return np.mean(np.array(total_embeddings), axis=0)
    else:
        return None


In [13]:
"""Testing that the function returns the correct results"""
from datasets import concatenate_datasets

train_dataset = load_dataset("imdb", split="train")
positive_samples = train_dataset.filter(lambda x: x["label"] == 1)
negative_samples = train_dataset.filter(lambda x: x["label"] == 0)
n = 500
positive_subsample = positive_samples.select(range(n))
negative_subsample = negative_samples.select(range(n))
subsampled_dataset = concatenate_datasets([positive_subsample, negative_subsample])

texts = subsampled_dataset['text']
labels = subsampled_dataset['label']

# 2. Convert text documents to averaged vector features
print(f"Converting {len(texts)} documents to averaged fasttext vectors...")
X = np.array([average_document_vector(text, model) for text in texts])
y = np.array(labels)

# Check the shape of the features
print(f"Feature matrix shape (X): {X.shape}")
print(f"Label vector shape (y): {y.shape}")

# 3. Split data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(
#    X, y, test_size=0.2, random_state=42, stratify=y
#)
#print(f"Train/Test split: {len(X_train)} training samples, {len(X_test)} testing samples.")

# 4. Train the Logistic Regression Model
print("Training Logistic Regression model...")
classifier = LogisticRegression(max_iter=1000, random_state=42)
classifier.fit(X, y)
print("Training complete.")

assert classifier.predict([average_document_vector("This was a truly great and wonderful film.", model)])[0] == 1

Converting 1000 documents to averaged fasttext vectors...
Feature matrix shape (X): (1000, 300)
Label vector shape (y): (1000,)
Training Logistic Regression model...
Training complete.
