# Sentiment analysis with ML baselines (POLITISKY24)

This notebook does two things:

1) **Create sentiment labels** for a large set of posts using a ready-made Transformer model (DistilBERT fine-tuned on SST-2).
2) **Train a few classic ML baselines** on those labels (TF‑IDF + Logistic Regression / Naive Bayes / Linear SVM), so we have a solid reference point before trying anything more advanced.


In [5]:
%pip install --upgrade "torch>=2.6.0" --no-cache-dir

[31mERROR: Could not find a version that satisfies the requirement torch>=2.6.0 (from versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2)[0m[31m
[0m[31mERROR: No matching distribution found for torch>=2.6.0[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/opt/python@3.11/bin/python3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
%pip install --upgrade "transformers" "safetensors" "torch"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/opt/python@3.11/bin/python3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [25]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from datasets import Dataset
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, AutoModel
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, f1_score

In [8]:
PARQUET_PATH = "user_post_history_dataset.parquet"  
OUTPUT_PARQUET = "user_post_history_with_sentiment.parquet"
OUTPUT_CSV = "user_post_history_with_sentiment.csv"

In [None]:
df = pd.read_parquet(PARQUET_PATH)

## What’s in the dataset?

The dataset contains one row per post, with both text and conversation/reshare metadata. The columns we see here include:

- `PostId`, `UserId`, `PostTime`: identifiers and timestamps
- `Content`: the main post text we’ll use for sentiment
- `Hashtags`, `Mentions`, `Languages`: extracted metadata (note: `Languages` is stored as a list per post, e.g. `['en']`)
- Conversation structure: `IsReply`, `ParentId`, `ParentUserId`
- Sharing structure: `IsRepost`, `SourceUserId`
- Quote structure: `IsQuote`, `QuoteId`, `QuoteUserId`, `QuoteContent`

This schema matters later because some analyses (or filters) may treat replies/quotes differently from original posts.


In [8]:
df.columns

Index(['PostId', 'UserId', 'PostTime', 'Content', 'Hashtags', 'Languages',
       'Mentions', 'IsRepost', 'SourceUserId', 'IsQuote', 'QuoteId',
       'QuoteUserId', 'QuoteContent', 'IsReply', 'ParentId', 'ParentUserId'],
      dtype='object')

## Quick sample of rows

Here we print the first few rows to confirm that:

- `Content` looks like real post text (not empty / not only metadata),
- `Languages` and other list fields have the expected format,
- timestamps (`PostTime`) are parsed as strings we can later convert to datetimes if needed.


In [9]:
df.head()

Unnamed: 0,PostId,UserId,PostTime,Content,Hashtags,Languages,Mentions,IsRepost,SourceUserId,IsQuote,QuoteId,QuoteUserId,QuoteContent,IsReply,ParentId,ParentUserId
0,7840946,167406,2024-11-19T23:12:12.092Z,it’s a tragedy of the commons. so. anyone and ...,[],[en],[],False,167406,False,-1,-1,,True,15792425,345622
1,17079336,377198,2024-11-19T23:12:11Z,"Notícia da @anonymous\n\n""Xi Jinping arrives i...",[],[],[],False,377198,False,-1,-1,,False,-1,-1
2,9179962,196837,2024-11-19T23:12:11.298Z,Have you run it by the Godless Democratic Part...,[],[en],[],False,196837,False,-1,-1,,True,15458353,336741
3,2108430,45089,2024-11-19T23:12:11.084Z,This! This is the energy we need to be bringin...,[],[en],[],False,45089,True,2109954,45144,WATCH: “You’re a baby shit.”\n\nRep. Andy Ogle...,False,-1,-1
4,4231604,90172,2024-11-19T23:12:10Z,"WATCH: Trending Stories: Gas Prices, Sheetz Ga...",[],[],[],False,90172,False,-1,-1,,False,-1,-1


# Optional: filter to English posts

This cell builds a boolean mask from the `Languages` list and selects only posts tagged as English (exactly `['en']` in this dataset).  
It’s useful if we want to keep sentiment labeling consistent, because the sentiment model used below is trained for English.


In [39]:
mask_list = []
for lang_arr in df["Languages"]:
    if len(lang_arr) == 1 and lang_arr[0] == "en":
        mask_list.append(True)
    else:
        mask_list.append(False)
df_only_english = df.loc[mask_list]       

# Create the sentiment labeler (Transformer)

We use a pretrained HuggingFace model:

- `distilbert-base-uncased-finetuned-sst-2-english`

For each text it outputs:
- a **label** (`POSITIVE` / `NEGATIVE`)
- a **score** (model confidence for that label)

This gives us pseudo‑labels quickly, without manual annotation - but we should remember they are *model-generated* labels, not ground truth.


In [3]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

device = 0 if torch.cuda.is_available() else -1
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=device
)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Device set to use cpu


In [1]:
TEXT_COLUMN = 'Content'

# Run sentiment labeling in batches

We take the `Content` column as plain strings and run the sentiment pipeline in batches.  
Batching is important here: it reduces overhead and is the only practical way to label tens of thousands of posts in a reasonable time.


In [None]:
texts = df[TEXT_COLUMN].astype(str).tolist()
labels = []
scores = []
batch_size = 32
print("Running sentiment model...")
for i in tqdm(range(0, len(texts), batch_size)):
    batch_texts = texts[i:i + batch_size]
    outputs = sentiment_pipeline(
        batch_texts,
        batch_size=batch_size,
        truncation=True,
        max_length=128,
    )
    for out in outputs:
        labels.append(out["label"])
        scores.append(float(out["score"]))

df["sent_label"] = labels  
df["sent_score"] = scores     

# Label a manageable sample

Instead of labeling the entire dataset immediately, we sample **50,000 posts**.  
That keeps runtime and storage reasonable while still giving enough data to train and compare baseline classifiers.


In [None]:
NUM_SAMPLES = 50_000
df_sample = df.sample(n=NUM_SAMPLES, random_state=42)[["text"]].reset_index(drop=True)
dataset = Dataset.from_pandas(df_sample)

## Batch prediction function

HuggingFace `datasets` expects a function that takes a batch and returns new columns.  
This function attaches two new fields:

- `sent_label`: `POSITIVE` / `NEGATIVE`
- `sent_score`: confidence score (float)


In [None]:
def predict_batch(batch):
    outputs = sentiment_pipeline(
        batch["text"],
        truncation=True,
        max_length=64,
    )
    batch["sent_label"] = [o["label"] for o in outputs]
    batch["sent_score"] = [float(o["score"]) for o in outputs]
    return batch

In [None]:
# Run if not existing
# dataset = dataset.map(
#     predict_batch,
#     batched=True,
#     batch_size=64,
#     num_proc=4,     
# )
# df_labeled = dataset.to_pandas()
# df_labeled.to_csv(OUTPUT_CSV, index=False)

# Load labeled dataset

Here we load the saved labeled file back into a DataFrame so we can train classifiers.  
From this point on, everything is standard supervised learning using the Transformer’s `POSITIVE/NEGATIVE` as labels.


In [9]:
df_labeled = pd.read_csv(OUTPUT_CSV)

# Prepare features and train/test split

We map sentiment strings to integers (`NEGATIVE → 0`, `POSITIVE → 1`), drop any unexpected labels, and then split into train/test:

- **80/20 split**
- **stratified**, so the class proportions stay comparable in both splits

This is important because the dataset is not perfectly balanced.


In [10]:
label_map = {"NEGATIVE": 0, "POSITIVE": 1}
df_labeled = df_labeled[df_labeled["sent_label"].isin(label_map)]  
df_labeled["label"] = df_labeled["sent_label"].map(label_map)
X_text = df_labeled[TEXT_COLUMN].astype(str).values
y = df_labeled["label"].values
X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# TF‑IDF text features

Before training classic ML models, we convert raw text into a sparse numeric representation using **TF‑IDF**.

In this configuration we use:
- up to **50,000** features,
- **unigrams + bigrams** (`ngram_range=(1,2)`),
- a minimum document frequency of **5** (`min_df=5`) to drop very rare tokens.

This gives a strong and interpretable baseline feature space for linear models.


In [11]:
tfidf = TfidfVectorizer(
    max_features=50_000,
    ngram_range=(1, 2),
    min_df=5
)
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

## Class balance check (counts)

These next two cells simply count how many labeled examples we have in each sentiment class in the labeled sample.  
It’s a quick way to confirm whether the dataset is skewed, which directly affects how we interpret accuracy vs macro metrics.


In [13]:
len(df_labeled.loc[df_labeled["sent_label"] == "NEGATIVE", ])

33068

In [14]:
len(df_labeled.loc[df_labeled["sent_label"] == "POSITIVE", ])

16932

At this point we already have TF‑IDF features, so the next cell trains **Logistic Regression** as a strong baseline for text classification.

On the held‑out test split (10,000 posts), the model reaches **accuracy = 0.787** and **macro F1 = 0.730**.  
What stands out is the class asymmetry: the model does very well on class **0 (negative)** (recall **0.944**, F1 **0.855**), but it misses many class **1 (positive)** posts (recall **0.482**, F1 **0.605**). That gap is why we report **macro** metrics (they don’t let the majority class hide the problem).


In [15]:
clf_tfidf = LogisticRegression(
    max_iter=1000,
    n_jobs=-1,
)

clf_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf = clf_tfidf.predict(X_test_tfidf)

print("=== TF-IDF + Logistic Regression ===")
print(classification_report(y_test, y_pred_tfidf, digits=3))

=== TF-IDF + Logistic Regression ===
              precision    recall  f1-score   support

           0      0.781     0.944     0.855      6614
           1      0.815     0.482     0.605      3386

    accuracy                          0.787     10000
   macro avg      0.798     0.713     0.730     10000
weighted avg      0.792     0.787     0.770     10000



# Cross‑validation for Logistic Regression (macro F1)

The next cell performs a small grid search over:
- regularization type (`l1` vs `l2`)
- regularization strength (`C`)

We use **5‑fold stratified CV** and score with **macro F1**, which penalizes models that do well only on the majority class.


In [16]:
X_cv = X_train_tfidf  
y_cv = y_train
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scorer = make_scorer(f1_score, average="macro")
C_values = [0.01, 0.1, 1.0, 10.0]
penalties = ["l1", "l2"]
results = []
for penalty in penalties:
    for C in C_values:
        clf = LogisticRegression(
            penalty=penalty,
            C=C,
            solver="liblinear",  # supports l1 and l2
            max_iter=1000,
            n_jobs=-1,
        )

        scores = cross_val_score(
            clf,
            X_cv, y_cv,
            cv=cv,
            scoring=scorer,
            n_jobs=-1,
        )

        results.append({
            "penalty": penalty,
            "C": C,
            "mean_f1_macro": scores.mean(),
            "std_f1_macro": scores.std(),
        })
        print(f"penalty={penalty}, C={C} -> F1_macro={scores.mean():.4f} ± {scores.std():.4f}")

res_df = pd.DataFrame(results).sort_values("mean_f1_macro", ascending=False)
print("\nSorted CV results:")
print(res_df.head())



penalty=l1, C=0.01 -> F1_macro=0.4890 ± 0.0036




penalty=l1, C=0.1 -> F1_macro=0.5830 ± 0.0017




penalty=l1, C=1.0 -> F1_macro=0.7187 ± 0.0026




penalty=l1, C=10.0 -> F1_macro=0.7320 ± 0.0046




penalty=l2, C=0.01 -> F1_macro=0.4890 ± 0.0036




penalty=l2, C=0.1 -> F1_macro=0.5787 ± 0.0033




penalty=l2, C=1.0 -> F1_macro=0.7199 ± 0.0036




penalty=l2, C=10.0 -> F1_macro=0.7443 ± 0.0067

Sorted CV results:
  penalty     C  mean_f1_macro  std_f1_macro
7      l2  10.0       0.744256      0.006697
3      l1  10.0       0.732019      0.004579
6      l2   1.0       0.719947      0.003589
2      l1   1.0       0.718741      0.002623
1      l1   0.1       0.583039      0.001682


Next we tune Logistic Regression a bit more systematically using **5‑fold stratified cross‑validation** on the training set, scoring with **macro F1** (so both classes matter equally).

The table printed afterwards summarizes the mean and standard deviation across folds. In these runs, the best setting is **L2 penalty with C = 10**, reaching **mean macro F1 ≈ 0.744 ± 0.0067**, which is noticeably better than stronger regularization (smaller C) and slightly better than L1 at the same C.


In [17]:
res_df

Unnamed: 0,penalty,C,mean_f1_macro,std_f1_macro
7,l2,10.0,0.744256,0.006697
3,l1,10.0,0.732019,0.004579
6,l2,1.0,0.719947,0.003589
2,l1,1.0,0.718741,0.002623
1,l1,0.1,0.583039,0.001682
5,l2,0.1,0.578729,0.003267
0,l1,0.01,0.488955,0.003594
4,l2,0.01,0.488955,0.003594


# DistilBERT embeddings

TF‑IDF is sparse and purely lexical. As an alternative, we can represent each post using **dense Transformer embeddings**.

The next two cells:
1) define a helper that encodes texts with DistilBERT and extracts the `[CLS]` vector (one embedding per post),
2) compute embeddings for train and test splits and print their shapes.

This sets up a feature matrix that can later be fed into standard classifiers (e.g., Logistic Regression) or used for clustering


In [None]:
def encode_texts(texts, batch_size=32, max_length=128):
    all_embeddings = []

    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch = list(texts[i:i+batch_size])

            inputs = tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=max_length,
                return_tensors="pt"
            )
            inputs = {k: v.to(device) for k, v in inputs.items()}
            outputs = encoder(**inputs) 
            cls_embeddings = outputs.last_hidden_state[:, 0, :] 
            all_embeddings.append(cls_embeddings.cpu().numpy())

    return np.vstack(all_embeddings)

In [None]:
model_name = "distilbert-base-uncased"  # base encoder, with safetensors on HF
tokenizer = AutoTokenizer.from_pretrained(model_name)
encoder = AutoModel.from_pretrained(model_name, use_safetensors=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
encoder.eval()
print("Encoding train texts...")
X_train_emb = encode_texts(X_train_text)
print("Encoding test texts...")
X_test_emb = encode_texts(X_test_text)
print("Train embeddings shape:", X_train_emb.shape)
print("Test  embeddings shape:", X_test_emb.shape)

Now we try **Multinomial Naive Bayes** on the same TF‑IDF features. It’s a common baseline, but it often struggles when one class is easier to “default” to.

Here that shows up clearly: overall **accuracy = 0.760**, but macro F1 drops to **0.663**. The classifier is extremely confident on the negative class (recall **0.980**), while positive recall is only **0.330** - meaning many positive posts are still getting pulled into the negative bucket.


# Baseline 2: Multinomial Naive Bayes on TF‑IDF

This cell trains Naive Bayes on the TF‑IDF vectors and prints a full classification report on the same test split.  
It’s useful as a lightweight baseline, especially to highlight whether a “simple” probabilistic model collapses toward the majority class.


In [23]:
nb_clf = MultinomialNB()
nb_clf.fit(X_train_tfidf, y_train)
y_pred_nb = nb_clf.predict(X_test_tfidf)

print("=== Multinomial Naive Bayes on TF-IDF ===")
print(classification_report(y_test, y_pred_nb, digits=3))

=== Multinomial Naive Bayes on TF-IDF ===
              precision    recall  f1-score   support

           0      0.741     0.980     0.844      6614
           1      0.893     0.330     0.482      3386

    accuracy                          0.760     10000
   macro avg      0.817     0.655     0.663     10000
weighted avg      0.792     0.760     0.721     10000



Finally, we test **LinearSVC (linear SVM)**, which is usually a very competitive classic model for sparse TF‑IDF text.

Here it performs best among the TF‑IDF baselines: **accuracy = 0.788** and **macro F1 = 0.753** on the 10,000‑post test set. Compared to Logistic Regression, it improves the balance between classes, especially for positives (positive recall **0.605** vs **0.482** earlier).


# Baseline 3: Linear SVM on TF‑IDF

This cell trains a linear Support Vector Machine (`LinearSVC`) on the TF‑IDF vectors and evaluates it on the test set.  
In high-dimensional sparse text settings, linear SVMs often give very strong results - so this is an important reference baseline.


In [26]:
svm_clf = LinearSVC()
svm_clf.fit(X_train_tfidf, y_train)
y_pred_svm = svm_clf.predict(X_test_tfidf)

print("=== LinearSVC on TF-IDF ===")
print(classification_report(y_test, y_pred_svm, digits=3))

=== LinearSVC on TF-IDF ===
              precision    recall  f1-score   support

           0      0.814     0.882     0.846      6614
           1      0.724     0.605     0.659      3386

    accuracy                          0.788     10000
   macro avg      0.769     0.744     0.753     10000
weighted avg      0.783     0.788     0.783     10000

