# Introduction to Text Analytics Assignment

The first code cell below loads a subset of 5000 movie reviews from the IMDB dataset. Each review is labeled as either positive or negative. The task here is to compare the performance of different text classification methods on this dataset.

Train a classifier of your choice (e.g., logistic regression, SVM, decision tree) using only the review text to create features. Evaluate the classifier using accuracy, precision, recall, and F1-score when predicting the sentiment labels.

- Use TF-IDF vectorization as one of the feature extraction methods with varying n-gram ranges (e.g., unigrams, bigrams).
- Compare the results with using a lower-dimensional representation of the text data using Singular Value Decomposition (SVD) on the TF-IDF matrix.
- Experiment with word embeddings (e.g., Word2Vec, GloVe) to represent the text data and train a classifier on these embeddings (by using the average of word vectors for each review to create a document-level representation).
- Finally, use the document embeddings of two different LLMs available via OpenAI API (e.g., text-embedding-ada-002 and text-embedding-3-small) to represent the reviews and train classifiers on these embeddings.

## Working on a Local Machine

You can edit the notebook on Google Colab or locally on your computer. The project dependencies are managed by `uv`. For local development, download and install `uv` from [here](https://docs.astral.sh/uv/getting-started/installation/) then run the following command in your terminal to set up the environment:

```bash
uv sync
```

This will create a virtual environment under `.venv` in the project directory and install all required dependencies. You can connect this environment to the Jupyter notebook by selecting the appropriate kernel (in VSCode, hit Ctrl+Shift+P and search for "Python: Select Interpreter").

## Working on Google Colab

You can download the notebook from the GitHub repository and upload it to Google Colab. When you work on it you can save intermediate results to your Google Drive (find the command in the File menu). When you are done, download the completed notebook and upload it to your GitHub repository.

## Using OpenAI API

To use the OpenAI API, get the API key from [here](https://firebasestorage.googleapis.com/v0/b/uni-sofia.appspot.com/o/lit%2Foc.txt?alt=media&token=768020c6-62d2-4c1b-9c53-966c322922e0) and edit the first code cell to set the API key in the `OpenAI` client as shown below:

## How to submit

When you are done with the assignment, please upload the completed notebook to your GitHub repository:


In [None]:
%pip install gensim

import gensim.downloader as api
import pandas as pd
from openai import OpenAI

openai_client = OpenAI(api_key="WRITE_THE_API_KEY_HERE")

df = pd.read_csv("https://github.com/febse/data/raw/refs/heads/main/ta/IMDB-Dataset-5000.csv.zip")
df.head()

/home/amarov/stats/ta2025-hw/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,review,sentiment
0,I really liked this Summerslam due to the look...,positive
1,Not many television shows appeal to quite as m...,positive
2,The film quickly gets to a major chase scene w...,negative
3,Jane Austen would definitely approve of this o...,positive
4,Expectations were somewhat high for me when I ...,negative


In [None]:
if 'word2vec_model' not in globals():
    # Download the pretrained Word2Vec vectors (GoogleNews-vectors-negative300)
    word2vec_model = api.load('word2vec-google-news-300')

if 'glove_model' not in globals():
    # Download the pretrained GloVe vectors (glove-wiki-gigaword-300)
    glove_model = api.load('glove-wiki-gigaword-300')

def get_openai_embedding(text, model="text-embedding-3-small"):
    """
    Get embedding for a text using OpenAI's API (v1.0+ compatible).
    
    Parameters:
    text (str): The text to embed
    model (str): The embedding model to use (default: "text-embedding-3-small")
    
    Returns:
    list: The embedding vector
    """
    response = openai_client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# get_openai_embedding("This is a sample text.", model="gpt-4.1")

PermissionDeniedError: Error code: 403 - {'error': {'message': 'You are not allowed to generate embeddings from this model', 'type': 'invalid_request_error', 'param': None, 'code': None}}