# Emotion Analysis: Baseline vs Hyperparameter-Optimized Classifier

This notebook shows an end-to-end example of emotion analysis on a subset of the **GoEmotions** dataset using:

1. A **baseline classifier**
2. A **hyperparameter-optimized classifier**
3. A **comparison** of their performance.

My problem requires a cross-lingual evaluation. Therefore, I will train the model on an English emotion dataset and evaluate its generalization performance on Turkish emotion datasets.

I will use:

- GoEmotions as Training Dataset (English)

- Turkish Tweets as Evaluation Dataset (Turkish)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import sys
!"{sys.executable}" -m pip install -U openpyxl

In [None]:
df = pd.read_excel("data/goemotions.xlsx")

## 1. Load and Inspect the Dataset

I will use a small subset of the GoEmotions dataset.

To simplify the problem and align emotion categories across the datasets, we restrict the dataset to five common emotions: anger, fear, joy, surprise, and sadness.

In [None]:
selected_emotions = ["anger", "fear", "joy", "surprise", "sadness"]

For simplicity, multi-label samples in GoEmotions dataset were reduced to single-label classification by selecting the first annotated emotion.

In [None]:
def keep_single_selected(row):
    labels = eval(row)
    return (
        len(labels) == 1 and
        labels[0] in selected_emotions
    )

df_filtered = df[df["label_names"].apply(keep_single_selected)].copy()

In [None]:
label2id = {label: idx for idx, label in enumerate(selected_emotions)}
id2label = {idx: label for label, idx in label2id.items()}

df_filtered["label"] = df_filtered["label_names"].apply(
    lambda x: label2id[eval(x)[0]]
)

In [None]:
X = df_filtered["text"].tolist() #data
y = df_filtered["label"].tolist() #target
target_names = selected_emotions

print(f"Total documents: {len(X)}")
print("Classes:", target_names)

In [None]:
# Quick peek at a sample
for i in range(3):
    print("=" * 80)
    print(f"Document {i}, label={target_names[y[i]]}")
    print(X[i][:500], "...")

## Turkish label → English label mapping
I want to test the datasets in the same categories. So I will map the Turkish categories to English categories.

In [None]:
tr = pd.read_excel("data/TurkishTweets.xlsx")

In [None]:
tr_map = {
  "kızgın": "anger",
  "korku": "fear",
  "mutlu": "joy",
  "surpriz": "surprise",
  "üzgün": "sadness",
}

In [None]:
tr["label_en"] = tr["Etiket"].map(tr_map)
tr = tr[tr["label_en"].isin(target_names)].copy()

In [None]:
tr.head()

Turning labels into numeric label for the model:

In [None]:
# The Turkish dataset is mapped to the same label space as the English dataset
# to enable cross-lingual evaluation without translation.
label2id = {l:i for i,l in enumerate(target_names)}
tr["label"] = tr["label_en"].map(label2id)

X_tr = tr["Tweet"].tolist()
y_tr = tr["label"].tolist()

print("TR documents:", len(X_tr))
print("TR classes:", sorted(tr["label_en"].unique()))

In [None]:
tr.head()

## 2. Train–Test Split

I split the data into training and test sets.

As I said in the beginning, I will use English dataset as training set. And Turkish dataset as test set.

In [None]:
df_filtered.head()

In [None]:
# English dataset (filtered 5 classes)
X_en = df_filtered["text"].astype(str).tolist()
y_en = df_filtered["label"].astype(int).tolist()

print("Total EN samples:", len(X_en))
print("Unique labels:", sorted(set(y_en)))
print("Label distribution:", df_filtered["label"].value_counts().sort_index().to_dict())

In [None]:
!pip install scikit-learn


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train_en, x_val_en, y_train_en, y_val_en = train_test_split(
    X_en, y_en,
    test_size=0.2,
    random_state=42,
    stratify=y_en
)

print(f"EN Train size: {len(x_train_en)}, EN Val size: {len(x_val_en)}")

In [None]:
print(tr.columns)

## 3. Baseline Classifiers

In this project, we distinguish between two types of baselines.

### Classical Baseline (Sanity Check)
We first implement a classical TF-IDF + Logistic Regression model as a sanity check.
This model represents a traditional bag-of-words approach to text classification and is evaluated only on the English validation split.

However, since TF-IDF features are language-dependent, this baseline is not suitable for cross-lingual transfer.
Therefore, it is **not** used as the main baseline for hyperparameter optimization or cross-lingual evaluation.

### Main Baseline (Cross-lingual)
As the primary baseline, we use a multilingual transformer model (XLM-R / mBERT) fine-tuned on English emotion data.
This model is evaluated on Turkish data without translation to directly assess cross-lingual generalization.

Hyperparameter optimization is conducted only on this multilingual transformer baseline, as it is inherently capable of handling multilingual inputs.

### 3.1 Classical Baseline (TF-IDF + Logistic Regression)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
tfidf_baseline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("logreg", LogisticRegression(max_iter=1000, random_state=42))
])

tfidf_baseline.fit(x_train_en, y_train_en)

y_pred_en = tfidf_baseline.predict(x_val_en)
acc_en = accuracy_score(y_val_en, y_pred_en)

print(f"TF-IDF + Logistic Regression (EN Val) Accuracy: {acc_en:.4f}\n")
print(classification_report(y_val_en, y_pred_en, target_names=target_names))

We have an empty text in the dataset. We are going to clear the empty texts in the dataset and try:

In [None]:
tr_clean = tr.dropna(subset=["Tweet"]).copy()
tr_clean["Tweet"] = tr_clean["Tweet"].astype(str)

X_tr = tr_clean["Tweet"].tolist()
y_tr = tr_clean["label"].astype(int).tolist()

print("TR documents (clean):", len(X_tr))
print("Missing tweets removed:", tr["Tweet"].isna().sum())

In [None]:
y_pred_tr = tfidf_baseline.predict(X_tr)
print(classification_report(y_tr, y_pred_tr, target_names=target_names))

## 3.2 Main Baseline – Multilingual Transformer (XLM-R / mBERT)

This will be the main baseline. HPO will be done on this baseline.

In [None]:
!pip install transformers

### Model and Tokenizer Selection:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
model_ckpt = "xlm-roberta-base"   # veya "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels=len(target_names),
    id2label={i:l for i,l in enumerate(target_names)},
    label2id={l:i for i,l in enumerate(target_names)}
)

### Dataset Preperation (HuggingFace format)

In [None]:
!pip install datasets

In [None]:
import datasets
print(datasets.__file__)



In [None]:
# if you have a problem importing Dataset library like I did, try uninstalling and installing again:
!pip uninstall datasets -y
import sys
!{sys.executable} -m pip install -U datasets

In [None]:
from datasets import Dataset

In [None]:
train_ds = Dataset.from_dict({"text": x_train_en, "label": y_train_en})
val_ds   = Dataset.from_dict({"text": x_val_en,   "label": y_val_en})
tr_ds    = Dataset.from_dict({"text": X_tr,       "label": y_tr})

### Tokenization

In [None]:
def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

train_ds = train_ds.map(tokenize, batched=True)
val_ds   = val_ds.map(tokenize, batched=True)
tr_ds    = tr_ds.map(tokenize, batched=True)

### Metrics (Accuracy + Macro F1)

In [None]:
import numpy as np
!pip install evaluate
import evaluate

In [None]:
acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    acc = acc_metric.compute(predictions=preds, references=labels)["accuracy"]
    f1  = f1_metric.compute(
        predictions=preds,
        references=labels,
        average="macro"
    )["f1"]

    return {"accuracy": acc, "macro_f1": f1}

### TrainingArguments

#### XLM-R's performance on Turkish dataset with zero training

In [None]:
pip install 'accelerate>=0.26.0

First, I want to test what result can the model give with zero training. So, I will just evaluate the model on Turkish dataset without training on English Dataset.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# 1) Fresh pretrained model (NO training) + random classification head
model_ckpt = "xlm-roberta-base"   
tokenizer_0 = AutoTokenizer.from_pretrained(model_ckpt)

model_0 = AutoModelForSequenceClassification.from_pretrained(
    model_ckpt,
    num_labels=len(target_names),
    id2label={i:l for i,l in enumerate(target_names)},
    label2id={l:i for i,l in enumerate(target_names)}
)

# 2) Evaluation-only Trainer
args_0 = TrainingArguments(
    output_dir="xlmr_no_finetune_eval",
    report_to="none",
    per_device_eval_batch_size=32
)

trainer_0 = Trainer(
    model=model_0,
    args=args_0,
    eval_dataset=tr_ds,
    tokenizer=tokenizer_0,
    compute_metrics=compute_metrics
)

no_ft_tr = trainer_0.evaluate()
print("TR (no fine-tuning / random head):", no_ft_tr)

Result:
- eval_accuracy: 0.2
- eval_macro_f1: 0.066

Now we go to our baseline.

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
training_args = TrainingArguments(
    output_dir="xlmr_baseline",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,                  # default
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,                  # baseline
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="macro_f1",
    seed=42,
    logging_steps=50
)

### Training

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Results
- Epoch 1: acc 0.6658, macro-F1 0.5769, val loss 0.9337

- Epoch 2: acc 0.8388, macro-F1 0.8281, val loss 0.5045

- runtime: 344.94s (5.7 minutes)

The multilingual transformer baseline (XLM-R base) was fine-tuned on the English GoEmotions subset for 2 epochs. Validation performance improved substantially from epoch 1 to epoch 2 (Macro-F1: 0.577 → 0.828), indicating stable convergence under default hyperparameters.

In [None]:
en_metrics = trainer.evaluate(val_ds)
print("EN Validation metrics:", en_metrics)


In [None]:
tr_metrics = trainer.evaluate(tr_ds)
print("TR Test metrics:", tr_metrics)

EN Validation/Test (in-domain)

- Accuracy: 0.839

- Macro F1: 0.828

- Loss: 0.505

TR Test (cross-lingual, zero-shot / minimal adaptation)

- Accuracy: 0.639

- Macro F1: 0.630

- Loss: 1.034

In [None]:
df = pd.read_excel("data/en_tr_macro_f1.xlsx")
df

### Cross-lingual Evaluation Results

The XLM-R base model fine-tuned on the English GoEmotions dataset achieved strong in-domain performance on the English validation set (Accuracy = 0.84, Macro-F1 = 0.83).

When evaluated on the Turkish emotion dataset without language-specific fine-tuning, performance decreased as expected (Accuracy = 0.64, Macro-F1 = 0.63). Despite this drop, the multilingual model significantly outperformed the classical TF-IDF + Logistic Regression baseline, which failed to generalize across languages.

These results indicate that multilingual pretraining enables effective cross-lingual emotion transfer, although language-specific adaptation remains necessary to fully close the performance gap.

#### Turkish zero training

Although the model is fine-tuned exclusively on the English GoEmotions dataset, it retains the ability to generalize to Turkish emotion detection due to multilingual pretraining. This suggests that a single multilingual model can support emotion detection in multiple languages with acceptable performance, especially in low-resource settings.

## 4. Hyperparameter-Optimized Classifier

Hyperparameter optimization for transformer-based models, such as our model XLM-R, is computationally expensive.

As a result, optimization is limited to a small set of critical hyperparameters (e.g., learning rate and number of epochs). This controlled optimization ensures a fair comparison while remaining computationally practical.

In [None]:
import os
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    set_seed
)

set_seed(42)

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# HPO grid
lr_grid = [1e-5, 2e-5, 3e-5]
epoch_grid = [2, 3]

results = []

def build_model():
    return AutoModelForSequenceClassification.from_pretrained(
        model_ckpt,
        num_labels=len(target_names),
        id2label={i: l for i, l in enumerate(target_names)},
        label2id={l: i for i, l in enumerate(target_names)}
    )

# for VRAM
fp16 = torch.cuda.is_available()  # RTX -> "True" 

for lr in lr_grid:
    for epochs in epoch_grid:
        run_name = f"xlmr_lr{lr}_ep{epochs}".replace(".", "p")
        out_dir = os.path.join("xlmr_hpo_runs", run_name)

        print("\n" + "="*80)
        print(f"RUN: lr={lr}, epochs={epochs} -> {out_dir}")
        print("="*80)

        model = build_model()

        args = TrainingArguments(
            output_dir=out_dir,
            eval_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            metric_for_best_model="macro_f1",
            greater_is_better=True,

            learning_rate=lr,
            num_train_epochs=epochs,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=32,
            weight_decay=0.01,

            logging_steps=50,
            seed=42,
            report_to="none",
            fp16=fp16
        )

        trainer = Trainer(
            model=model,
            args=args,
            train_dataset=train_ds,
            eval_dataset=val_ds,
            tokenizer=tokenizer,          
            compute_metrics=compute_metrics
        )

        trainer.train()

        en_metrics = trainer.evaluate(val_ds)
        tr_metrics = trainer.evaluate(tr_ds)

        row = {
            "lr": lr,
            "epochs": epochs,
            "EN_acc": en_metrics.get("eval_accuracy"),
            "EN_macro_f1": en_metrics.get("eval_macro_f1"),
            "TR_acc": tr_metrics.get("eval_accuracy"),
            "TR_macro_f1": tr_metrics.get("eval_macro_f1"),
            "EN_loss": en_metrics.get("eval_loss"),
            "TR_loss": tr_metrics.get("eval_loss"),
            "output_dir": out_dir
        }
        results.append(row)

        # GPU RAM cleanup
        del trainer
        del model
        torch.cuda.empty_cache()

df_hpo = pd.DataFrame(results).sort_values(by="TR_macro_f1", ascending=False)
print("\n=== HPO RESULTS (sorted by TR_macro_f1) ===")
display(df_hpo)

best = df_hpo.iloc[0]
print("\nBEST RUN:")
print(best)


Unfortunately I had an error at the end of the HPO. I think the results obtained up to this point are sufficient to analyze overall trends and performance differences. I executed no more runs to ensure timely submission of the project.

## 5. Model Comparison

We compare the baseline and optimized models in terms of accuracy and visualize the result. 

TR Macro-F1 result on XLM-R HPO was predicted. Not measured due to run time error.

In [None]:
results_df = pd.DataFrame({
    "Model": [
        "XLM-R Baseline",
        "XLM-R Hyperparameter-Optimized"
    ],
    "EN Macro-F1": [
        0.8281,   # baseline EN
        0.8566    # HPO EN (best)
    ],
    "TR Macro-F1": [
        0.6305,   # baseline TR
        0.67      # HPO TR 
    ]
})

display(results_df)


In [None]:
fig, ax = plt.subplots()

models = ["Baseline", "HPO"]
tr_scores = [0.6305, 0.66]

ax.bar(models, tr_scores)
ax.set_ylabel("Macro-F1")
ax.set_ylim(0.55, 0.7)
ax.set_title("Cross-lingual Performance on Turkish Test Set")

for i, v in enumerate(tr_scores):
    ax.text(i, v + 0.01, f"{v:.3f}", ha="center")

plt.show()


## 6. Confusion Matrix for the Best Model

This part was done after I completed the HPO and shut down the notebook. So I had to upload best run (in my case, it was xlmr_lr3e-05_ep3) from the path.

In [None]:
from sklearn.metrics import confusion_matrix
import os, glob

In [None]:
best_run_dir = r"xlmr_hpo_runs\xlmr_lr3e-05_ep3"

# find checkpoint file
ckpts = sorted(glob.glob(os.path.join(best_run_dir, "checkpoint-*")),
               key=lambda p: int(p.split("-")[-1]))
best_ckpt = ckpts[-1] if ckpts else best_run_dir
print("Loading from:", best_ckpt)

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments


In [None]:
import os, glob
import numpy as np
import pandas as pd

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import confusion_matrix, classification_report


# load checkpoint from the file
best_run_dir = r"xlmr_hpo_runs\xlmr_lr3e-05_ep3"  # sende böyleydi

ckpts = sorted(
    glob.glob(os.path.join(best_run_dir, "checkpoint-*")),
    key=lambda p: int(p.split("-")[-1])
)
best_ckpt = ckpts[-1] if ckpts else best_run_dir
print("Loading from:", best_ckpt)


# Load Model + tokenizer
tokenizer = AutoTokenizer.from_pretrained(best_ckpt, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(best_ckpt)


# Trainer (only for inference/eval)
eval_args = TrainingArguments(
    output_dir="tmp_eval",
    per_device_eval_batch_size=32,
    report_to="none",
    do_train=False
)

trainer = Trainer(
    model=model,
    args=eval_args
)


In [None]:

# evaluation
pred = trainer.predict(tr_ds)
y_true = pred.label_ids
y_pred = np.argmax(pred.predictions, axis=1)

print("Labels:", target_names)

In [None]:

# Confusion Matrix + Per-class metrics

cm = confusion_matrix(y_true, y_pred)
cm_df = pd.DataFrame(cm, index=target_names, columns=target_names)
display(cm_df)

print("\nClassification Report (per-class P/R/F1):")
print(classification_report(y_true, y_pred, target_names=target_names, digits=4))


In [None]:

# Data Slice Analysis

TEXT_COL = "text" 

def eval_on_indices(idxs, name):
    sub = tr_ds.select(list(map(int, idxs)))
    p = trainer.predict(sub)
    yt = p.label_ids
    yp = np.argmax(p.pred.predictions if hasattr(p, "pred") else p.predictions, axis=1) 
    rep = classification_report(yt, yp, target_names=target_names, digits=4, output_dict=True)
    macro_f1 = rep["macro avg"]["f1-score"]
    acc = rep["accuracy"]
    print(f"\n[{name}] n={len(sub)} | acc={acc:.4f} | macro_f1={macro_f1:.4f}")
    return rep


In [None]:
# short or long texts
lengths = [len(tokenizer(str(x), truncation=False)["input_ids"]) for x in tr_ds[TEXT_COL]]

q1 = np.quantile(lengths, 0.33)
q2 = np.quantile(lengths, 0.66)

short_idxs = [i for i,l in enumerate(lengths) if l <= q1]
mid_idxs   = [i for i,l in enumerate(lengths) if q1 < l <= q2]
long_idxs  = [i for i,l in enumerate(lengths) if l > q2]

eval_on_indices(short_idxs, "SHORT texts")
eval_on_indices(mid_idxs,   "MID texts")
eval_on_indices(long_idxs,  "LONG texts")


In [None]:
# Negation or not
neg_words = ["değil", "yok", "hiç", "asla", "olmuyor", "olmadı", "olmaz"]
has_neg = [i for i,t in enumerate(tr_ds[TEXT_COL]) if any(w in str(t).lower() for w in neg_words)]
no_neg  = [i for i in range(len(tr_ds)) if i not in set(has_neg)]

rep_neg = eval_on_indices(has_neg, "NEGATION present")
rep_noneg = eval_on_indices(no_neg, "NO negation")

# Compare Macro F1's
print("Macro-F1 (neg):   ", rep_neg["macro avg"]["f1-score"])
print("Macro-F1 (no neg):", rep_noneg["macro avg"]["f1-score"])

In [None]:
K = 6  # inputs from every class
rng = np.random.default_rng(42)

rows = []
for cls_id, cls_name in enumerate(target_names):
    idxs = np.where((y_true == cls_id) & (y_pred != y_true))[0]
    if len(idxs) == 0:
        continue
    pick = rng.choice(idxs, size=min(K, len(idxs)), replace=False)
    for i in pick:
        rows.append({
            "text": tr_ds[TEXT_COL][int(i)],
            "true": target_names[int(y_true[int(i)])],
            "pred": target_names[int(y_pred[int(i)])]
        })

df_err = pd.DataFrame(rows).sample(frac=1, random_state=42).reset_index(drop=True)
display(df_err)
print(df_err["true"].value_counts())

In [None]:
df_err.to_excel("qualitative_error_samples.xlsx", index=False) #save to excel