# Introduction to Text Analytics Assignment

The first code cell below loads a subset of 5000 movie reviews from the IMDB dataset. Each review is labeled as either positive or negative. The task here is to compare the performance of different text classification methods on this dataset.

Train a classifier of your choice (e.g., logistic regression, SVM, decision tree) using only the review text to create features. Evaluate the classifier using accuracy, precision, recall, and F1-score when predicting the sentiment labels.

- Use TF-IDF vectorization as one of the feature extraction methods with varying n-gram ranges (e.g., unigrams, bigrams).
- Compare the results with using a lower-dimensional representation of the text data using Singular Value Decomposition (SVD) on the TF-IDF matrix.
- Experiment with word embeddings (e.g., Word2Vec, GloVe) to represent the text data and train a classifier on these embeddings (by using the average of word vectors for each review to create a document-level representation).
- Finally, use the document embeddings of two different LLMs available via OpenAI API (e.g., text-embedding-ada-002 and text-embedding-3-small) to represent the reviews and train classifiers on these embeddings.

## Working on a Local Machine

You can edit the notebook on Google Colab or locally on your computer. The project dependencies are managed by `uv`. For local development, download and install `uv` from [here](https://docs.astral.sh/uv/getting-started/installation/) then run the following command in your terminal to set up the environment:

```bash
uv sync
```

This will create a virtual environment under `.venv` in the project directory and install all required dependencies. You can connect this environment to the Jupyter notebook by selecting the appropriate kernel (in VSCode, hit Ctrl+Shift+P and search for "Python: Select Interpreter").

## Working on Google Colab

You can download the notebook from the GitHub repository and upload it to Google Colab. When you work on it you can save intermediate results to your Google Drive (find the command in the File menu). When you are done, download the completed notebook and upload it to your GitHub repository.

## Using OpenAI API

To use the OpenAI API, get the API key from [here](https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/lit%2Foc.txt?alt=media&token=768020c6-62d2-4c1b-9c53-966c322922e0) and edit the first code cell to set the API key in the `OpenAI` client as shown below:

## How to submit

When you are done with the assignment, please upload the completed notebook to your GitHub repository:


In [None]:
%pip install gensim

import gensim.downloader as api
import pandas as pd
import numpy as np
from openai import OpenAI

import re
import os

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip")
df.head()


import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
sw = set(stopwords.words("english"))

#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

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
if 'word2vec_model' not in globals():
    # Download the pretrained Word2Vec vectors (GoogleNews-vectors-negative300)
    word2vec_model = api.load('word2vec-google-news-300')

if 'glove_model' not in globals():
    # Download the pretrained GloVe vectors (glove-wiki-gigaword-300)
    glove_model = api.load('glove-wiki-gigaword-300')

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Get embedding for a text using OpenAI's API (v1.0+ compatible).

    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "text-embedding-3-small")

    Returns:
    list: The embedding vector
    """
    response = openai_client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# get_openai_embedding("This is a sample text.", model="gpt-4.1")



In [None]:
#read data
df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip") #Downloads the specific subset of 5000 movie reviews
df = df.dropna(subset=["review", "sentiment"]).copy()

#clean
x = df["review"].astype(str).str.lower()
x = x.str.replace(r"<br\s*/?>", " ", regex=True)
x = x.str.replace(r"[^a-z\s]", " ", regex=True)
x = x.str.replace(r"\s+", " ", regex=True).str.strip()

y = df["sentiment"].map({"positive": 1, "negative": 0}).astype(int) #1 for positive and 0 for negative

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0, stratify=y) #splitting the data
#We perform learning on 75% of the data


In [None]:
vec1 = TfidfVectorizer(
    strip_accents="unicode", #removes accents
    lowercase=True,
    stop_words=list(sw), #Removes "noise" words like "the", "is", "and"
    ngram_range=(1, 1), #we look at Unigrams only. It will treat "not" and "good" as separate features,
                         #ignoring the phrase "not good."
    min_df=5, #we ignore words that appear in fewer than 5 documents. Avoids typos and extremely rare words to save memory.
    max_df=0.95 #We ignore words that appear in more than 95% of documents.
)

Xtr1 = vec1.fit_transform(X_train) #training matrix
Xte1 = vec1.transform(X_test)  #test matrix

lr1 = LogisticRegression(max_iter=5000, random_state=0)
lr1.fit(Xtr1, y_train)

yp1 = lr1.predict(Xte1)
pp1 = lr1.predict_proba(Xte1)[:, 1]

acc1 = lr1.score(Xte1, y_test) #accuracy
auc1 = roc_auc_score(y_test, pp1) #area under the curve
pre1 = precision_score(y_test, yp1, zero_division=0) #precision
rec1 = recall_score(y_test, yp1, zero_division=0) #recall
f11  = f1_score(y_test, yp1, zero_division=0) #mean of precision and recall

print("TFIDF(1,1)")
print(f"acc={acc1:.4f} pre={pre1:.4f} rec={rec1:.4f} f1={f11:.4f} auc={auc1:.4f}")


TFIDF(1,1)
acc=0.8520 pre=0.8306 rec=0.8873 f1=0.8580 auc=0.9326


In [None]:
#Here we are looking at Bigrams as opposed to unigrams in the code above.
#we want to examine the pairs of adjacent words to capture context
vec2 = TfidfVectorizer(
    strip_accents="unicode",
    lowercase=True,
    stop_words=list(sw),
    ngram_range=(2, 2),
    min_df=5,
    max_df=0.95
)

Xtr2 = vec2.fit_transform(X_train)
Xte2 = vec2.transform(X_test)

lr2 = LogisticRegression(max_iter=5000, random_state=0)
lr2.fit(Xtr2, y_train)

yp2 = lr2.predict(Xte2)
pp2 = lr2.predict_proba(Xte2)[:, 1]

acc2 = lr2.score(Xte2, y_test)
auc2 = roc_auc_score(y_test, pp2)
pre2 = precision_score(y_test, yp2, zero_division=0)
rec2 = recall_score(y_test, yp2, zero_division=0)
f12  = f1_score(y_test, yp2, zero_division=0)

print("TFIDF(2,2)")
print(f"acc={acc2:.4f} pre={pre2:.4f} rec={rec2:.4f} f1={f12:.4f} auc={auc2:.4f}")


TFIDF(2,2)
acc=0.7800 pre=0.7569 rec=0.8302 f1=0.7918 auc=0.8687


In [None]:
#The goal of Latent Semantic Analysis (LSA) is to reduce the number of features (tokens)
#in a document-term matrix while preserving the similarity structure among documents.
#This is achieved by transforming the original high-dimensional space into a
#lower-dimensional space using Singular Value Decomposition (SVD)

tf_lsa = TfidfVectorizer(
    strip_accents="unicode",
    lowercase=True,
    stop_words=list(sw),
    ngram_range=(1, 1),
    min_df=5,
    max_df=0.95
)

Xtr_t = tf_lsa.fit_transform(X_train)
Xte_t = tf_lsa.transform(X_test)

svd = TruncatedSVD(n_components=300, n_iter=100, random_state=42)
Xtr_s = svd.fit_transform(Xtr_t)
Xte_s = svd.transform(Xte_t)

lr_s = LogisticRegression(max_iter=5000, random_state=0)
lr_s.fit(Xtr_s, y_train)

yp_s = lr_s.predict(Xte_s)
pp_s = lr_s.predict_proba(Xte_s)[:, 1]

acc_s = lr_s.score(Xte_s, y_test)
auc_s = roc_auc_score(y_test, pp_s)
pre_s = precision_score(y_test, yp_s, zero_division=0)
rec_s = recall_score(y_test, yp_s, zero_division=0)
f1_s  = f1_score(y_test, yp_s, zero_division=0)

print("LSA(300) from TFIDF(1,2)")
print(f"acc={acc_s:.4f} pre={pre_s:.4f} rec={rec_s:.4f} f1={f1_s:.4f} auc={auc_s:.4f}")



LSA(300) from TFIDF(1,2)
acc=0.8464 pre=0.8269 rec=0.8794 f1=0.8523 auc=0.9268


In [None]:
#Embeddings

#This code block switches from "counting words" (TF-IDF) to using Semantic Vectors (Word2Vec).

#Instead of creating a massive table with, one column for every unique word,
#it compresses every review into a 300-dimensional vector representing the avg meaning of the review.
#This is an improvement of the TF-IDF earlier since now Word2Vec can recognise synonyms, they have identical vector numbers

#Word2Vec
Xtr_txt = X_train.tolist()
Xte_txt = X_test.tolist()

Xtr_w2v = np.zeros((len(Xtr_txt), 300))
Xte_w2v = np.zeros((len(Xte_txt), 300))

for i in range(len(Xtr_txt)):
    ws = Xtr_txt[i].split()
    ws = [w for w in ws if (w not in sw) and (len(w) > 1) and (w in word2vec_model.key_to_index)]
    if len(ws) > 0:
        Xtr_w2v[i] = np.mean([word2vec_model[w] for w in ws], axis=0)

for i in range(len(Xte_txt)):
    ws = Xte_txt[i].split()
    ws = [w for w in ws if (w not in sw) and (len(w) > 1) and (w in word2vec_model.key_to_index)]
    if len(ws) > 0:
        Xte_w2v[i] = np.mean([word2vec_model[w] for w in ws], axis=0)

lr_w2v = LogisticRegression(max_iter=5000, random_state=0)
lr_w2v.fit(Xtr_w2v, y_train) #fittin the logistic regression into 300 words

yp_w2v = lr_w2v.predict(Xte_w2v)
pp_w2v = lr_w2v.predict_proba(Xte_w2v)[:, 1]

#Calculating again the accuracy and precision measures
acc_w2v = lr_w2v.score(Xte_w2v, y_test)
auc_w2v = roc_auc_score(y_test, pp_w2v)
pre_w2v = precision_score(y_test, yp_w2v, zero_division=0)
rec_w2v = recall_score(y_test, yp_w2v, zero_division=0)
f1_w2v  = f1_score(y_test, yp_w2v, zero_division=0)

print("W2V(avg)")
print(f"acc={acc_w2v:.4f} pre={pre_w2v:.4f} rec={rec_w2v:.4f} f1={f1_w2v:.4f} auc={auc_w2v:.4f}")


W2V(avg)
acc=0.8272 pre=0.8165 rec=0.8476 f1=0.8318 auc=0.9136


In [None]:
#glove
#this method uses a different source of the word vectors.
#It counts how often words appear together in a massive matrix of web data
Xtr_g = np.zeros((len(Xtr_txt), 300))
Xte_g = np.zeros((len(Xte_txt), 300))

for i in range(len(Xtr_txt)):
    ws = Xtr_txt[i].split()
    ws = [w for w in ws if (w not in sw) and (len(w) > 1) and (w in glove_model.key_to_index)]
    if len(ws) > 0:
        Xtr_g[i] = np.mean([glove_model[w] for w in ws], axis=0)

for i in range(len(Xte_txt)):
    ws = Xte_txt[i].split()
    ws = [w for w in ws if (w not in sw) and (len(w) > 1) and (w in glove_model.key_to_index)]
    if len(ws) > 0:
        Xte_g[i] = np.mean([glove_model[w] for w in ws], axis=0)

lr_g = LogisticRegression(max_iter=5000, random_state=0)
lr_g.fit(Xtr_g, y_train)

yp_g = lr_g.predict(Xte_g)
pp_g = lr_g.predict_proba(Xte_g)[:, 1]

acc_g = lr_g.score(Xte_g, y_test)
auc_g = roc_auc_score(y_test, pp_g)
pre_g = precision_score(y_test, yp_g, zero_division=0)
rec_g = recall_score(y_test, yp_g, zero_division=0)
f1_g  = f1_score(y_test, yp_g, zero_division=0)

print("GloVe(avg)")
print(f"acc={acc_g:.4f} pre={pre_g:.4f} rec={rec_g:.4f} f1={f1_g:.4f} auc={auc_g:.4f}")


GloVe(avg)
acc=0.8216 pre=0.8097 rec=0.8444 f1=0.8267 auc=0.9125


In [None]:
#Here we move from Word2Vec and GloVe which are static embeddings, to dynamic embeddings generated by LLM via OpenAI API

from google.colab import userdata
from openai import OpenAI

key = userdata.get("OPENAI_API_KEY")
openai_client = OpenAI(api_key=key)


xtr = X_train.tolist()
xte = X_test.tolist()

m1 = "text-embedding-ada-002" #this reads the entire review at once
bs = 100
ftr1 = "ada_tr.npy"
fte1 = "ada_te.npy"

if os.path.exists(ftr1):
    Etr1 = np.load(ftr1)
else:
    Etr1 = []
    for i in range(0, len(xtr), bs):
        r = openai_client.embeddings.create(model=m1, input=xtr[i:i+bs])
        Etr1 += [d.embedding for d in r.data]
    Etr1 = np.array(Etr1, dtype=np.float32)
    np.save(ftr1, Etr1)

if os.path.exists(fte1):
    Ete1 = np.load(fte1)
else:
    Ete1 = []
    for i in range(0, len(xte), bs):
        r = openai_client.embeddings.create(model=m1, input=xte[i:i+bs])
        Ete1 += [d.embedding for d in r.data]
    Ete1 = np.array(Ete1, dtype=np.float32)
    np.save(fte1, Ete1)

lr_oai1 = LogisticRegression(max_iter=5000, random_state=0)
lr_oai1.fit(Etr1, y_train)

yp_oai1 = lr_oai1.predict(Ete1)
pp_oai1 = lr_oai1.predict_proba(Ete1)[:, 1]

acc_oai1 = lr_oai1.score(Ete1, y_test)
auc_oai1 = roc_auc_score(y_test, pp_oai1)
pre_oai1 = precision_score(y_test, yp_oai1, zero_division=0)
rec_oai1 = recall_score(y_test, yp_oai1, zero_division=0)
f1_oai1  = f1_score(y_test, yp_oai1, zero_division=0)

print("OAI ada-002")
print(f"acc={acc_oai1:.4f} pre={pre_oai1:.4f} rec={rec_oai1:.4f} f1={f1_oai1:.4f} auc={auc_oai1:.4f}")


OAI ada-002
acc=0.9224 pre=0.9069 rec=0.9429 f1=0.9245 auc=0.9760


In [None]:
#This model is slightly more efficient as we use a newer text embedding

m2 = "text-embedding-3-small"
ftr2 = "3s_tr.npy"
fte2 = "3s_te.npy"

if os.path.exists(ftr2):
    Etr2 = np.load(ftr2)
else:
    Etr2 = []
    for i in range(0, len(xtr), bs):
        r = openai_client.embeddings.create(model=m2, input=xtr[i:i+bs])
        Etr2 += [d.embedding for d in r.data]
    Etr2 = np.array(Etr2, dtype=np.float32)
    np.save(ftr2, Etr2)

if os.path.exists(fte2):
    Ete2 = np.load(fte2)
else:
    Ete2 = []
    for i in range(0, len(xte), bs):
        r = openai_client.embeddings.create(model=m2, input=xte[i:i+bs])
        Ete2 += [d.embedding for d in r.data]
    Ete2 = np.array(Ete2, dtype=np.float32)
    np.save(fte2, Ete2)

lr_oai2 = LogisticRegression(max_iter=5000, random_state=0)
lr_oai2.fit(Etr2, y_train)

yp_oai2 = lr_oai2.predict(Ete2)
pp_oai2 = lr_oai2.predict_proba(Ete2)[:, 1]

acc_oai2 = lr_oai2.score(Ete2, y_test)
auc_oai2 = roc_auc_score(y_test, pp_oai2)
pre_oai2 = precision_score(y_test, yp_oai2, zero_division=0)
rec_oai2 = recall_score(y_test, yp_oai2, zero_division=0)
f1_oai2  = f1_score(y_test, yp_oai2, zero_division=0)

print("OAI 3-small")
print(f"acc={acc_oai2:.4f} pre={pre_oai2:.4f} rec={rec_oai2:.4f} f1={f1_oai2:.4f} auc={auc_oai2:.4f}")


OAI 3-small
acc=0.9416 pre=0.9331 rec=0.9524 f1=0.9427 auc=0.9836


In [None]:
#Summary of the performance of the different text classification methods
#TF-IDF -
#Word2Vec
#Glove
#Two Open AI Keys -

In [None]:
cmp_all = pd.DataFrame([
    {"m": "TFIDF(1,1)",       "p": Xtr1.shape[1],      "acc": acc1,     "pre": pre1,     "rec": rec1,     "f1": f11,     "auc": auc1},
    {"m": "TFIDF(2,2)",       "p": Xtr2.shape[1],      "acc": acc2,     "pre": pre2,     "rec": rec2,     "f1": f12,     "auc": auc2},
    {"m": "LSA(300) TFIDF(1,2)", "p": Xtr_s.shape[1],  "acc": acc_s,    "pre": pre_s,    "rec": rec_s,    "f1": f1_s,    "auc": auc_s},
    {"m": "W2V(avg)",         "p": Xtr_w2v.shape[1],   "acc": acc_w2v,  "pre": pre_w2v,  "rec": rec_w2v,  "f1": f1_w2v,  "auc": auc_w2v},
    {"m": "GloVe(avg)",       "p": Xtr_g.shape[1],     "acc": acc_g,    "pre": pre_g,    "rec": rec_g,    "f1": f1_g,    "auc": auc_g},
    {"m": "OAI ada-002",      "p": Etr1.shape[1],      "acc": acc_oai1, "pre": pre_oai1, "rec": rec_oai1, "f1": f1_oai1, "auc": auc_oai1},
    {"m": "OAI 3-small",      "p": Etr2.shape[1],      "acc": acc_oai2, "pre": pre_oai2, "rec": rec_oai2, "f1": f1_oai2, "auc": auc_oai2},
])

print("\n" + "="*70)
print("Comparison:")
print("="*70)
print(cmp_all.sort_values("f1", ascending=False).to_string(index=False))




Comparison:
                  m    p    acc      pre      rec       f1      auc
        OAI 3-small 1536 0.9416 0.933126 0.952381 0.942655 0.983554
        OAI ada-002 1536 0.9224 0.906870 0.942857 0.924514 0.976014
         TFIDF(1,1) 9100 0.8520 0.830609 0.887302 0.858020 0.932616
LSA(300) TFIDF(1,2)  300 0.8464 0.826866 0.879365 0.852308 0.926805
           W2V(avg)  300 0.8272 0.816514 0.847619 0.831776 0.913635
         GloVe(avg)  300 0.8216 0.809741 0.844444 0.826729 0.912514
         TFIDF(2,2) 5365 0.7800 0.756874 0.830159 0.791824 0.868697


The best performing text classifier is the LLM embeddings as they understand context. The newer one also outperforms the older one as mentioned -  the f1 score for OAI 3-small is 0.94 and for OAI ada-002 is 0.92. Interesingly, the TFIDF(1,1) outperformed GloVe and W2V which are more complex. Potential driver is that specific key words may predict better when it comes to a more basic sentimental analysis. Therefore, the more complex classifiers are having trouble when averaging 2 words which may contain both positive/negative and neutral sentiment. The worst performer is the TF-IDF Bigrams.

In [None]:
#Trying a different classifier - Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_decision_tree(name, X_train, X_test, y_train, y_test):
    #max_depth=50 helps prevent the tree from over-memorizing the training data
    dt = DecisionTreeClassifier(random_state=0, max_depth=50)
    dt.fit(X_train, y_train)

    yp = dt.predict(X_test)

    print(f"--- {name} (Decision Tree) ---")
    print(f"Accuracy:  {accuracy_score(y_test, yp):.4f}")
    print(f"Precision: {precision_score(y_test, yp, zero_division=0):.4f}")
    print(f"Recall:    {recall_score(y_test, yp, zero_division=0):.4f}")
    print(f"F1 Score:  {f1_score(y_test, yp, zero_division=0):.4f}\n")

#Here we apply the function to our already existing variables

#TF-IDF Unigrams
evaluate_decision_tree("TF-IDF (1,1)", Xtr1, Xte1, y_train, y_test)

#LSA
evaluate_decision_tree("LSA (SVD)", Xtr_s, Xte_s, y_train, y_test)

#Word2Vec
evaluate_decision_tree("Word2Vec", Xtr_w2v, Xte_w2v, y_train, y_test)

#OpenAI Embeddings
if 'Etr2' in globals():
    evaluate_decision_tree("OpenAI 3-small", Etr2, Ete2, y_train, y_test)

--- TF-IDF (1,1) (Decision Tree) ---
Accuracy:  0.7056
Precision: 0.6997
Recall:    0.7286
F1 Score:  0.7138

--- LSA (SVD) (Decision Tree) ---
Accuracy:  0.7280
Precision: 0.7231
Recall:    0.7460
F1 Score:  0.7344

--- Word2Vec (Decision Tree) ---
Accuracy:  0.6464
Precision: 0.6506
Recall:    0.6444
F1 Score:  0.6475

--- OpenAI 3-small (Decision Tree) ---
Accuracy:  0.7888
Precision: 0.7859
Recall:    0.7984
F1 Score:  0.7921



It can be seen that the decision tree classifier generated lower results across all measures compared to the logistic regression. The logistic regression looks at all features simultaneously to separate positive from negative. On the other hand, decision trees look at features one by one. It tries to separate the data, which is harder in high dimensions. Nevertheless, we see that the best results are again generated by the OpenAI embedding.