# IDS Week 14_2: Text Classification with Sentence Embeddings & Logistic Regression

In this notebook, we’ll walk step-by-step through building a movie-review classifier on the Rotten Tomatoes dataset using pretrained sentence embeddings and scikit-learn’s Logistic Regression. We’ll cover:

1. **Environment Setup** – install and import libraries.  
2. **Load & Inspect Data** – load the Rotten Tomatoes reviews and take a first look.  
3. **Generate Embeddings** – encode each review into a fixed-size vector via sentence-transformers.  
4. **Train/Test Split** – partition our data for training and evaluation.  
5. **Train Classifier** – fit a Logistic Regression model on embedding features.  
6. **Evaluate Performance** – compute accuracy, confusion matrix, and classification report.  
7. **Interpret Results** – inspect model coefficients and predicted probabilities.  

## **Step 1: Environment Setup**

First, install the required libraries (`sentence-transformers` for embeddings, `datasets` for loading our data) and import everything we need.

In [1]:
# Install sentence-transformers for easy embeddings
!pip install sentence-transformers
# Install the Hugging Face datasets library to load Rotten Tomatoes reviews
!pip install datasets

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

## **Step 2: Load & Inspect Data**
We’ll load the [Cornell Movie Review “Rotten Tomatoes” dataset](https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes) (just the training split for sample size purposes) and convert it to a pandas DataFrame for easy inspection.
- `label = 1` means "fresh"
- `label = 0` means "rotten"

In [4]:
from datasets import load_dataset
import numpy as np
import pandas as pd

# Load the Rotten Tomatoes dataset
dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
df = dataset.to_pandas()
df.head()

Unnamed: 0,text,label
0,the rock is destined to be the 21st century's ...,1
1,"the gorgeously elaborate continuation of "" the...",1
2,effective but too-tepid biopic,1
3,if you sometimes like to go to the movies to h...,1
4,"emerges as something rare , an issue movie tha...",1


In [6]:
# Check class distribution to ensure balance
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,4265
0,4265


## **Step 3: Generate Sentence Embeddings**
We use the [`SentenceTransformer` wrapper](https://huggingface.co/sentence-transformers), which exposes a simple `.encode()` method. It returns an (n_samples × embedding_dim) NumPy array.

In [9]:
# sentence-transformers for pretrained encoder
from sentence_transformers import SentenceTransformer

# Instantiate the pretrained embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)

# Encode all review texts to embeddings
# convert_to_numpy=True returns a NumPy array
# show_progress_bar displays encoding progress
X_embs = embedder.encode(
    df["text"].to_list(),
    convert_to_numpy = True,
    show_progress_bar = True
)

# Inspect shape: (n_reviews, embedding_dimension)
print("Embeddings Shape:", X_embs.shape)

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Embeddings Shape: (8530, 384)


In [12]:
X_embs

array([[ 1.1072483e-02,  4.2864863e-02,  7.8563429e-02, ...,
        -1.0394953e-01,  5.1843394e-02,  2.2958547e-03],
       [-4.4598565e-02, -6.0998961e-02, -9.7962180e-03, ...,
         2.0241024e-02, -3.2264404e-02,  4.2230777e-02],
       [-7.8210659e-02,  4.5944989e-02, -3.3882767e-02, ...,
         2.6836723e-02,  5.1949125e-02, -1.3693033e-02],
       ...,
       [-2.8554400e-02, -9.4656483e-04,  1.4808460e-02, ...,
        -6.0052190e-02,  8.4265284e-02,  2.6126364e-02],
       [ 3.1843908e-02, -1.4560679e-02,  3.0452888e-02, ...,
        -2.1129327e-02, -6.3253634e-02, -5.1687439e-03],
       [-7.3710355e-05, -6.2387969e-02,  1.7913787e-02, ...,
        -5.7953559e-02,  2.9048488e-02,  1.3152838e-02]], dtype=float32)

## **Step 4: Split into Training & Testing Sets**
We’ll hold out 20% of our embeddings/labels for testing, using a fixed random_state for reproducibility.

In [13]:
from sklearn.model_selection import train_test_split
# Define features (embeddings) and target (labels)
y = df["label"].to_numpy()

# Split embeddings and labels into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X_embs,
    y,
    test_size = 0.2,
    random_state = 42
)

# Confirm sizes


## **Step 5: Train Logistic Regression**
Fit a logistic regression classifier on our training set. We use the default L2 penalty and solver, and let `scikit-learn` choose sensible defaults.

In [14]:
from sklearn.linear_model import LogisticRegression

# Initialize the logistic regression model
clf = LogisticRegression()

# Train on embedding features
clf.fit(X_train, y_train)

## **Step 6: Evaluate Performance**
We’ll predict on the test set, compute overall accuracy, plot the confusion matrix, and print a detailed classification report (precision, recall, F1-score).

## **Step 7: Predict New Texts**
Now that we have a trained model, let’s see how it does on user-supplied reviews. We’ll:

1. Define a few custom review strings.  
2. Encode them into embeddings.  
3. Predict labels and class-probabilities with our logistic regressor.  
4. Print out each review with its predicted sentiment and confidence.

In [None]:
# 7.1 Define some new reviews to classify as a list


# 7.2 Generate embeddings for these new texts
# (using the same embedder we initialized earlier)


# 7.3 Predict labels and probabilities


# 7.4 Display results
