# Introduction to Text Analytics Assignment

The first code cell below loads a subset of 5000 movie reviews from the IMDB dataset. Each review is labeled as either positive or negative. The task here is to compare the performance of different text classification methods on this dataset.

Train a classifier of your choice (e.g., logistic regression, SVM, decision tree) using only the review text to create features. Evaluate the classifier using accuracy, precision, recall, and F1-score when predicting the sentiment labels.

- Use TF-IDF vectorization as one of the feature extraction methods with varying n-gram ranges (e.g., unigrams, bigrams).
- Compare the results with using a lower-dimensional representation of the text data using Singular Value Decomposition (SVD) on the TF-IDF matrix.
- Experiment with word embeddings (e.g., Word2Vec, GloVe) to represent the text data and train a classifier on these embeddings (by using the average of word vectors for each review to create a document-level representation).
- Finally, use the document embeddings of two different LLMs available via OpenAI API (e.g., text-embedding-ada-002 and text-embedding-3-small) to represent the reviews and train classifiers on these embeddings.

## Working on a Local Machine

You can edit the notebook on Google Colab or locally on your computer. The project dependencies are managed by `uv`. For local development, download and install `uv` from [here](https://docs.astral.sh/uv/getting-started/installation/) then run the following command in your terminal to set up the environment:

```bash
uv sync
```

This will create a virtual environment under `.venv` in the project directory and install all required dependencies. You can connect this environment to the Jupyter notebook by selecting the appropriate kernel (in VSCode, hit Ctrl+Shift+P and search for "Python: Select Interpreter").

## Working on Google Colab

You can download the notebook from the GitHub repository and upload it to Google Colab. When you work on it you can save intermediate results to your Google Drive (find the command in the File menu). When you are done, download the completed notebook and upload it to your GitHub repository.

## Using OpenAI API

To use the OpenAI API, get the API key from [here](https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/lit%2Foc.txt?alt=media&token=768020c6-62d2-4c1b-9c53-966c322922e0) and edit the first code cell to set the API key in the `OpenAI` client as shown below:

## How to submit

When you are done with the assignment, please upload the completed notebook to your GitHub repository:


In [None]:
%pip install gensim tqdm openai
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
import os
import re
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer



import gensim.downloader as api
from tqdm.auto import tqdm
import pandas as pd
from openai import OpenAI

openai_client = OpenAI(api_key="WRITE_THE_API_KEY_HERE")

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip")
df.head()

Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative


In [2]:
if 'word2vec_model' not in globals():
    # Download the pretrained Word2Vec vectors (GoogleNews-vectors-negative300)
    word2vec_model = api.load('word2vec-google-news-300')

if 'glove_model' not in globals():
    # Download the pretrained GloVe vectors (glove-wiki-gigaword-300)
    glove_model = api.load('glove-wiki-gigaword-300')

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Get embedding for a text using OpenAI's API (v1.0+ compatible).
    
    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "text-embedding-3-small")
    
    Returns:
    list: The embedding vector
    """
    response = openai_client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# get_openai_embedding("This is a sample text.", model="gpt-4.1")

In [3]:
# Sanity check: one embedding call
test_vec = get_openai_embedding("This is a sample text.", model="text-embedding-3-small")
print(type(test_vec), len(test_vec))
print(test_vec[:10])


<class 'list'> 1536
[0.03873104974627495, 0.025061266496777534, 0.025133363902568817, 0.007267478853464127, -0.031694281846284866, -0.04994949698448181, -0.012177352793514729, -0.018384991213679314, 0.0011130128987133503, 0.004736838862299919]


In [4]:
# Ensure numeric labels
df["y"] = df["sentiment"].map({"negative": 0, "positive": 1}).astype(int)

X = df["review"].astype(str).tolist()
y = df["y"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

len(X_train), len(X_test)


(4000, 1000)

In [5]:
def eval_binary(y_true, y_pred, name="model"):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    print(f"{name} | acc={acc:.4f} prec={prec:.4f} rec={rec:.4f} f1={f1:.4f}")
    return {"model": name, "accuracy": acc, "precision": prec, "recall": rec, "f1": f1}

results = []


In [6]:
def get_openai_embeddings(texts, model="text-embedding-3-small", batch_size=64):
    vectors = []
    for i in tqdm(range(0, len(texts), batch_size), desc=f"Embedding ({model})"):
        batch = texts[i:i+batch_size]
        resp = openai_client.embeddings.create(model=model, input=batch)
        vectors.extend([d.embedding for d in resp.data])
    return np.array(vectors, dtype=np.float32)


In [7]:
os.makedirs("cache", exist_ok=True)

model_1 = "text-embedding-3-small"
Xtr_path_1 = f"cache/Xtr_{model_1}.npy"
Xte_path_1 = f"cache/Xte_{model_1}.npy"

if os.path.exists(Xtr_path_1) and os.path.exists(Xte_path_1):
    Xtr_1 = np.load(Xtr_path_1)
    Xte_1 = np.load(Xte_path_1)
else:
    Xtr_1 = get_openai_embeddings(X_train, model=model_1, batch_size=64)
    Xte_1 = get_openai_embeddings(X_test,  model=model_1, batch_size=64)
    np.save(Xtr_path_1, Xtr_1)
    np.save(Xte_path_1, Xte_1)

Xtr_1.shape, Xte_1.shape


((4000, 1536), (1000, 1536))

In [8]:
clf_1 = LogisticRegression(max_iter=3000)
clf_1.fit(Xtr_1, y_train)
pred_1 = clf_1.predict(Xte_1)

results.append(eval_binary(y_test, pred_1, name=f"OpenAI {model_1} + LR"))


OpenAI text-embedding-3-small + LR | acc=0.9390 prec=0.9301 rec=0.9504 f1=0.9401


In [9]:
model_2 = "text-embedding-ada-002"   # assignment example
# model_2 = "text-embedding-3-large" # alternative if allowed

Xtr_path_2 = f"cache/Xtr_{model_2}.npy"
Xte_path_2 = f"cache/Xte_{model_2}.npy"

if os.path.exists(Xtr_path_2) and os.path.exists(Xte_path_2):
    Xtr_2 = np.load(Xtr_path_2)
    Xte_2 = np.load(Xte_path_2)
else:
    Xtr_2 = get_openai_embeddings(X_train, model=model_2, batch_size=64)
    Xte_2 = get_openai_embeddings(X_test,  model=model_2, batch_size=64)
    np.save(Xtr_path_2, Xtr_2)
    np.save(Xte_path_2, Xte_2)

Xtr_2.shape, Xte_2.shape


((4000, 1536), (1000, 1536))

In [10]:
clf_2 = LogisticRegression(max_iter=3000, random_state=42)
clf_2.fit(Xtr_2, y_train)
pred_2 = clf_2.predict(Xte_2)

results.append(eval_binary(y_test, pred_2, name=f"OpenAI {model_2} + LR"))


OpenAI text-embedding-ada-002 + LR | acc=0.9250 prec=0.9101 rec=0.9444 f1=0.9270


In [11]:
res_df = pd.DataFrame(results).sort_values("f1", ascending=False)
res_df


Unnamed: 0,model,accuracy,precision,recall,f1
0,OpenAI text-embedding-3-small + LR,0.939,0.930097,0.950397,0.940137
1,OpenAI text-embedding-ada-002 + LR,0.925,0.910134,0.944444,0.926972


In [12]:
tfidf_runs = [
    ("TFIDF (1,1) + LR", (1,1)),
    ("TFIDF (1,2) + LR", (1,2)),
]

for name, ngr in tfidf_runs:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            lowercase=True,
            stop_words="english",
            ngram_range=ngr,
            min_df=2,
            max_df=0.95
        )),
        ("clf", LogisticRegression(max_iter=3000, random_state=42))
    ])
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    results.append(eval_binary(y_test, pred, name=name))


TFIDF (1,1) + LR | acc=0.8540 prec=0.8315 rec=0.8909 f1=0.8602
TFIDF (1,2) + LR | acc=0.8560 prec=0.8371 rec=0.8869 f1=0.8613


In [13]:
svd_runs = [
    ("TFIDF (1,2) + SVD100 + LR", 100),
    ("TFIDF (1,2) + SVD200 + LR", 200),
    ("TFIDF (1,2) + SVD300 + LR", 300),
]

for name, k in svd_runs:
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            lowercase=True,
            stop_words="english",
            ngram_range=(1,2),
            min_df=2,
            max_df=0.95
        )),
        ("svd", TruncatedSVD(n_components=k, random_state=42)),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=3000, random_state=42))
    ])
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    results.append(eval_binary(y_test, pred, name=name))


TFIDF (1,2) + SVD100 + LR | acc=0.8310 prec=0.8454 rec=0.8135 f1=0.8291
TFIDF (1,2) + SVD200 + LR | acc=0.8500 prec=0.8569 rec=0.8433 f1=0.8500
TFIDF (1,2) + SVD300 + LR | acc=0.8600 prec=0.8655 rec=0.8552 f1=0.8603


In [14]:

_token_re = re.compile(r"[a-z']+")

def tokenize_simple(text: str):
    return _token_re.findall(text.lower())

def avg_doc_vector(texts, kv):
    X = np.zeros((len(texts), kv.vector_size), dtype=np.float32)
    for i, t in enumerate(tqdm(texts, desc="Doc vectors")):
        toks = tokenize_simple(t)
        vecs = [kv[w] for w in toks if w in kv]
        if vecs:
            X[i] = np.mean(vecs, axis=0)
    return X


In [15]:
os.makedirs("cache", exist_ok=True)

w2v_tr_path = "cache/Xtr_word2vec.npy"
w2v_te_path = "cache/Xte_word2vec.npy"

if os.path.exists(w2v_tr_path) and os.path.exists(w2v_te_path):
    Xtr_w2v = np.load(w2v_tr_path)
    Xte_w2v = np.load(w2v_te_path)
else:
    word2vec_model = api.load("word2vec-google-news-300")
    Xtr_w2v = avg_doc_vector(X_train, word2vec_model)
    Xte_w2v = avg_doc_vector(X_test, word2vec_model)
    np.save(w2v_tr_path, Xtr_w2v)
    np.save(w2v_te_path, Xte_w2v)

clf = LogisticRegression(max_iter=3000, random_state=42)
clf.fit(Xtr_w2v, y_train)
pred = clf.predict(Xte_w2v)
results.append(eval_binary(y_test, pred, name="Word2Vec avg + LR"))


Word2Vec avg + LR | acc=0.8410 prec=0.8457 rec=0.8373 f1=0.8415


In [16]:
gl_tr_path = "cache/Xtr_glove.npy"
gl_te_path = "cache/Xte_glove.npy"

if os.path.exists(gl_tr_path) and os.path.exists(gl_te_path):
    Xtr_gl = np.load(gl_tr_path)
    Xte_gl = np.load(gl_te_path)
else:
    glove_model = api.load("glove-wiki-gigaword-300")
    Xtr_gl = avg_doc_vector(X_train, glove_model)
    Xte_gl = avg_doc_vector(X_test, glove_model)
    np.save(gl_tr_path, Xtr_gl)
    np.save(gl_te_path, Xte_gl)

clf = LogisticRegression(max_iter=3000, random_state=42)
clf.fit(Xtr_gl, y_train)
pred = clf.predict(Xte_gl)
results.append(eval_binary(y_test, pred, name="GloVe avg + LR"))


GloVe avg + LR | acc=0.8230 prec=0.8371 rec=0.8056 f1=0.8210


In [17]:
res_df = pd.DataFrame(results).sort_values("f1", ascending=False)
res_df


Unnamed: 0,model,accuracy,precision,recall,f1
0,OpenAI text-embedding-3-small + LR,0.939,0.930097,0.950397,0.940137
1,OpenAI text-embedding-ada-002 + LR,0.925,0.910134,0.944444,0.926972
3,"TFIDF (1,2) + LR",0.856,0.837079,0.886905,0.861272
6,"TFIDF (1,2) + SVD300 + LR",0.86,0.865462,0.855159,0.860279
2,"TFIDF (1,1) + LR",0.854,0.831481,0.890873,0.860153
5,"TFIDF (1,2) + SVD200 + LR",0.85,0.856855,0.843254,0.85
7,Word2Vec avg + LR,0.841,0.845691,0.837302,0.841476
4,"TFIDF (1,2) + SVD100 + LR",0.831,0.845361,0.813492,0.82912
8,GloVe avg + LR,0.823,0.837113,0.805556,0.821031


## Results and comparison

We compared four representation families on the IMDB-5000 sentiment dataset using the same 80/20 stratified train/test split and a Logistic Regression classifier on top of each representation. Performance is reported with accuracy, precision, recall, and F1.

### TF–IDF baselines (sparse bag-of-ngrams)
TF–IDF with unigrams and bigrams produced similar performance (F1 ≈ 0.860–0.861). Adding bigrams gave only a marginal improvement over unigrams in this run.

### TF–IDF + SVD (LSA)
Applying TruncatedSVD to the TF–IDF matrix (LSA) did not improve results relative to the best TF–IDF baseline. SVD100 degraded performance (F1 ≈ 0.829), while SVD200–SVD300 recovered close to the baseline (F1 ≈ 0.850–0.860), but still did not exceed TF–IDF without dimensionality reduction.

### Averaged pretrained word embeddings (dense, static)
Document vectors formed by averaging pretrained word vectors performed worse than TF–IDF. Word2Vec averaging reached F1 ≈ 0.841, and GloVe averaging reached F1 ≈ 0.821. This is consistent with the limitation of simple averaging: it discards word order and can dilute sentiment-bearing phrases.

### OpenAI document embeddings (dense, contextual)
OpenAI embeddings substantially outperformed all other approaches. `text-embedding-3-small` achieved the best overall result (accuracy ≈ 0.939, F1 ≈ 0.940). The older `text-embedding-ada-002` was slightly weaker (accuracy ≈ 0.925, F1 ≈ 0.927). These embeddings likely capture higher-level semantic and sentiment cues beyond surface n-gram statistics.

### Overall ranking (by F1 in this run)
1. OpenAI `text-embedding-3-small` + LR (F1 ≈ 0.940)
2. OpenAI `text-embedding-ada-002` + LR (F1 ≈ 0.927)
3. TF–IDF (1,2) + LR / TF–IDF + SVD300 + LR (F1 ≈ 0.860–0.861)
4. Word2Vec average + LR (F1 ≈ 0.841)
5. GloVe average + LR (F1 ≈ 0.821)

## Implementation notes

OpenAI embeddings were computed in batches and cached to disk to avoid repeated API calls. Word2Vec and GloVe document embeddings were constructed by averaging in-vocabulary word vectors for each review and then training a classifier on these dense document vectors.
