# Introduction to Text Analytics Assignment

The first code cell below loads a subset of 5000 movie reviews from the IMDB dataset. Each review is labeled as either positive or negative. The task here is to compare the performance of different text classification methods on this dataset.

Train a classifier of your choice (e.g., logistic regression, SVM, decision tree) using only the review text to create features. Evaluate the classifier using accuracy, precision, recall, and F1-score when predicting the sentiment labels.

- Use TF-IDF vectorization as one of the feature extraction methods with varying n-gram ranges (e.g., unigrams, bigrams).
- Compare the results with using a lower-dimensional representation of the text data using Singular Value Decomposition (SVD) on the TF-IDF matrix.
- Experiment with word embeddings (e.g., Word2Vec, GloVe) to represent the text data and train a classifier on these embeddings (by using the average of word vectors for each review to create a document-level representation).
- Finally, use the document embeddings of two different LLMs available via OpenAI API (e.g., text-embedding-ada-002 and text-embedding-3-small) to represent the reviews and train classifiers on these embeddings.

## Working on a Local Machine

You can edit the notebook on Google Colab or locally on your computer. The project dependencies are managed by `uv`. For local development, download and install `uv` from [here](https://docs.astral.sh/uv/getting-started/installation/) then run the following command in your terminal to set up the environment:

```bash
uv sync
```

This will create a virtual environment under `.venv` in the project directory and install all required dependencies. You can connect this environment to the Jupyter notebook by selecting the appropriate kernel (in VSCode, hit Ctrl+Shift+P and search for "Python: Select Interpreter").

## Working on Google Colab

You can download the notebook from the GitHub repository and upload it to Google Colab. When you work on it you can save intermediate results to your Google Drive (find the command in the File menu). When you are done, download the completed notebook and upload it to your GitHub repository.

## Using OpenAI API

To use the OpenAI API, get the API key from [here](https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/lit%2Foc.txt?alt=media&token=768020c6-62d2-4c1b-9c53-966c322922e0) and edit the first code cell to set the API key in the `OpenAI` client as shown below:

## How to submit

When you are done with the assignment, please upload the completed notebook to your GitHub repository:


In [5]:
%pip install gensim

import gensim.downloader as api
import pandas as pd
from openai import OpenAI

openai_client = OpenAI(api_key="WRITE_THE_API_KEY_HERE")

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip")
df.head()

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative


In [None]:
if 'word2vec_model' not in globals():
    # Download the pretrained Word2Vec vectors (GoogleNews-vectors-negative300)
    word2vec_model = api.load('word2vec-google-news-300')

if 'glove_model' not in globals():
    # Download the pretrained GloVe vectors (glove-wiki-gigaword-300)
    glove_model = api.load('glove-wiki-gigaword-300')

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Get embedding for a text using OpenAI's API (v1.0+ compatible).

    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "text-embedding-3-small")

    Returns:
    list: The embedding vector
    """
    response = openai_client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# get_openai_embedding("This is a sample text.", model="gpt-4.1")

PermissionDeniedError: Error code: 403 - {'error': {'message': 'You are not allowed to generate embeddings from this model', 'type': 'invalid_request_error', 'param': None, 'code': None}}

In [6]:
import re
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA, TruncatedSVD
from gensim.models import Word2Vec
import numpy as np

In [7]:
df.shape

(5000, 2)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     5000 non-null   object
 1   sentiment  5000 non-null   object
dtypes: object(2)
memory usage: 78.3+ KB


In [9]:
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [10]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,2519
negative,2481


In [11]:
df['review'].iloc[0]

"I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Dibiase, Steiner Brothers vs Heavenly Bodies, Shawn Michaels vs Curt Hening, this was the event where Shawn named his big monster of a body guard Diesel, IRS vs 1-2-3 Kid, Bret Hart first takes on Doink then takes on Jerry Lawler and stuff with the Harts and Lawler was always very interesting, then Ludvig Borga destroyed Marty Jannetty, Undertaker took on Giant Gonzalez in another terrible match, The Smoking Gunns and Tatanka took on Bam Bam Bigelow and the Headshrinkers, and Yokozuna defended th

In [12]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"<.*?>", "", text)
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [13]:
df['review']=df['review'].apply(clean_text)

In [14]:
df['sentiment']=df['sentiment'].apply(lambda x: 1 if x=="positive" else 0 )

In [15]:
df.head()

Unnamed: 0,review,sentiment
0,i really liked this summerslam due to the look...,1
1,not many television shows appeal to quite as m...,1
2,the film quickly gets to a major chase scene w...,0
3,jane austen would definitely approve of this o...,1
4,expectations were somewhat high for me when i ...,0


In [16]:
X=df['review']
y=df['sentiment']

In [18]:
#train_test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [19]:
X_train.shape,X_test.shape

((4000,), (1000,))

In [None]:
y_train.shape,y_test.shape

((4000,), (1000,))

In [20]:
#unigram
tfidf_uni = TfidfVectorizer(ngram_range=(1, 1))
#TF_IDF
X_train_vec = tfidf_uni.fit_transform(X_train)
X_test_vec = tfidf_uni.transform(X_test)
#model training
model = LogisticRegression()
model.fit(X_train_vec, y_train)
#score
prediction = model.predict(X_test_vec)
print("Accuracy:",accuracy_score(y_test, prediction))

Accuracy: 0.859


In [21]:
#bigram
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2))
#TF_IDF
X_train_vec = tfidf_bi.fit_transform(X_train)
X_test_vec = tfidf_bi.transform(X_test)
#model training
model = LogisticRegression()
model.fit(X_train_vec, y_train)
#score
prediction = model.predict(X_test_vec)
print("Accuracy:",accuracy_score(y_test, prediction))

Accuracy: 0.86


In [22]:
#using SDV for unigram
tfidf = TfidfVectorizer(ngram_range=(1, 1))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
#aplly SDV
svd = TruncatedSVD(n_components=100, random_state=42)
X_train_svd = svd.fit_transform(X_train_tfidf)
X_test_svd = svd.transform(X_test_tfidf)
#train
model_svd = LogisticRegression()
model_svd.fit(X_train_svd, y_train)
#score
prediction = model_svd.predict(X_test_svd)
print("Accuracy:", accuracy_score(y_test, prediction))

Accuracy: 0.828


In [23]:
#using SDV for bigram
tfidf_bi = TfidfVectorizer(ngram_range=(1, 2))
X_train_bi = tfidf_bi.fit_transform(X_train)
X_test_bi = tfidf_bi.transform(X_test)
#apply SDV
svd_bi = TruncatedSVD(n_components=100, random_state=42)
X_train_svd_bi = svd_bi.fit_transform(X_train_bi)
X_test_svd_bi = svd_bi.transform(X_test_bi)
#train
model_svd_bi = LogisticRegression()
model_svd_bi.fit(X_train_svd_bi, y_train)
#score
prediction = model_svd_bi.predict(X_test_svd_bi)
print("Accuracy:", accuracy_score(y_test, prediction))

Accuracy: 0.829


In [None]:
#Word2vec
#tokens
token_train=[text.split() for text in X_train]
token_test=[text.split() for text in X_test]
#train Word2vec
w2v= Word2Vec(sentences=token_train,vector_size=100,window=5,min_count=2,workers=4)

def sentence_vector(tokens, model):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)
#reviews to vector
X_train_w2v = np.vstack([sentence_vector(t, w2v) for t in token_train])
X_test_w2v  = np.vstack([sentence_vector(t, w2v) for t in token_test])
# training
model_w2v = LogisticRegression(max_iter=1000)
model_w2v.fit(X_train_w2v, y_train)
#score
y_pred = model_w2v.predict(X_test_w2v)
print("Accuracy:", accuracy_score(y_test, y_pred))



Accuracy: 0.663
