# Project 2 — LLMs as Measurement Instruments in Political Text Analysis
**Course:** Foundations of Quantitative Political Analysis (II)  
**Instructor:** Bartosz Pieliński  
**Project type:** NLP/AI (Text as Data) — Classification + Evaluation  

# Reproducible Analysis Pipeline

## Reproducibility statement

This notebook contains the fully reproducible analysis pipeline for Project 2.

All human annotation was conducted separately in an experimental notebook.
This notebook **does not perform any interactive annotation**.

Instead, it loads a fixed human-annotated dataset from:
`annotated_intervention_project2.csv`

To reproduce all results, run all cells from top to bottom.
No manual input is required.

## Research Question
How accurately can a large language model (LLM) classify a speaker’s stance toward the concept **“intervention”** in political sentences, and how does accuracy vary by prompting protocol (zero-shot vs few-shot)?

## Research Objective
This project treats an LLM as a *measurement instrument* for a targeted political text classification task.  
We evaluate whether prompting strategy (zero-shot vs few-shot) improves reliability compared to human-coded labels.

## Deliverables
1) A reproducible Colab notebook (this file)  
2) A short report-style narrative embedded as Markdown sections (method + results + limitations)

## Paper-level annotation
This study implements a text-as-data workflow in which an LLM is treated as a measurement instrument for a targeted classification task. Rather than assuming model outputs are ground truth, the project evaluates agreement with human-coded labels under transparent prompting protocols. The central methodological goal is to compare measurement reliability across zero-shot and few-shot prompting within a bounded political text domain.

## Data provenance

- Human annotation performed in: `Project2_experiment_annotation.ipynb`
- Annotated dataset: `annotated_intervention_project2.csv`
- Prediction outputs saved as: `project2_predictions_test.csv`

## Code and Data Availability

All code, annotated data, and outputs for this project are publicly available at:

**GitHub repository:**  
https://github.com/YoulanCheng/Project-2-LLMs-as-Measurement-Instruments-in-Political-Text-Analysis

The repository contains:
- the reproducible Colab notebook used for evaluation,
- the human-annotated dataset (CSV),
- model outputs and prediction files.

# Install libraries and set random seeds (reproducibility)

In [17]:
!pip -q install transformers accelerate torch pandas numpy scikit-learn

In [18]:
import os, re, random, time
import numpy as np
import pandas as pd

In [19]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [20]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
print("Seed set to:", SEED)

Seed set to: 42


## Annotation
Experiments are run in a controlled Python environment. Random seeds are fixed for sampling and data splitting to ensure that the labeled dataset and evaluation subsets can be reproduced. This does not fully eliminate nondeterminism in LLM inference, but it ensures that data construction is stable and auditable.

# Load corpus (UN intervention dataset) and clean

In [21]:
import urllib.request

In [22]:
DATA_URL = "https://raw.githubusercontent.com/emiliawisnios/Masterclass_Text_Analysis/refs/heads/main/data/UN_intervention.csv"
df = pd.read_csv(DATA_URL)

In [23]:
df["sentence"] = df["sentence"].astype(str).str.strip()
df = df[df["sentence"].str.len() > 0].drop_duplicates(subset=["sentence"]).reset_index(drop=True)

In [24]:
print("Corpus rows:", len(df))
df.head(5)

Corpus rows: 4622


Unnamed: 0,session,year,country,sentence
0,52,1997,QAT,The State of Qatar is making a constructive ef...
1,52,1997,COG,"But that right to judge, which we readily reco..."
2,52,1997,PRT,The effectiveness of intervention by the Unite...
3,52,1997,PRT,We are therefore understandably concerned abou...
4,52,1997,COD,"Laurent-Désiré Kabila, in some ways recalls th..."


In [25]:
import pandas as pd

DATA_URL = (
    "https://raw.githubusercontent.com/"
    "YoulanCheng/"
    "Project-2-LLMs-as-Measurement-Instruments-in-Political-Text-Analysis/"
    "main/"
    "annotated_intervention_project2.csv"
)

df_annot = pd.read_csv(DATA_URL)

print("Annotated dataset loaded.")
print("Rows:", len(df_annot))
print(df_annot["gold"].value_counts())

Annotated dataset loaded.
Rows: 60
gold
NEG    40
POS    15
NET     5
Name: count, dtype: int64


## How to run (Google Colab)

This project is fully reproducible in Google Colab.

The annotated dataset is loaded directly from GitHub using a raw URL:

```python
import pandas as pd
df = pd.read_csv(
    "https://raw.githubusercontent.com/YoulanCheng/Project-2-LLMs-as-Measurement-Instruments-in-Political-Text-Analysis/main/annotated_intervention_project2.csv"
)

## Annotation
The corpus consists of English-language political sentences drawn from an official-speech-derived dataset. The unit of analysis is the sentence, which is appropriate because stance toward a specific concept (here, “intervention”) is often expressed locally and can be operationalized at sentence level. Restricting analysis to English avoids translation artifacts that would confound measurement validity.

# Define construct + label scheme (operationalization)

# Task Definition (Operationalization)

We classify the speaker’s stance toward the **concept "intervention"** in each sentence.

Labels:
- **POS**: intervention is endorsed/justified/necessary/beneficial
- **NEG**: intervention is criticized/harmful/illegitimate/problematic
- **NET**: descriptive/ambiguous reference, no clear evaluative stance toward intervention

Decision rule (important):
- If the stance is mixed or uncertain, label **NET** (conservative coding).
- This is NOT overall sentence sentiment; it is stance toward “intervention” specifically.

## Annotation
The dependent variable is stance toward the concept “intervention” in context. This targeted operationalization reduces concept–measurement mismatch by explicitly distinguishing stance toward a specific political object from overall sentence tone. Ambiguous instances are conservatively coded as NET unless evaluative language clearly targets intervention, which improves interpretive discipline and reduces overconfident polarization in the measurement instrument.

# Sample items for annotation + create dev/test split (no leakage)

In [26]:
# N_TOTAL = 60  # recommended: 60 (30 is minimum, 60 looks stronger without being heavy)
# df_sample = df.sample(n=min(N_TOTAL, len(df)), random_state=SEED).reset_index(drop=True)

In [27]:
# dev_size = int(0.33 * len(df_sample))  # dev ~ 1/3
# df_dev = df_sample.iloc[:dev_size].copy().reset_index(drop=True)
# df_test = df_sample.iloc[dev_size:].copy().reset_index(drop=True)

In [28]:
# print("Annotated total:", len(df_sample))
# print("Dev:", len(df_dev), "| Test:", len(df_test))

## Annotation
To prevent evaluation leakage, the annotated dataset is split into a development subset and a held-out test subset. Prompt iteration and few-shot demonstration selection are restricted to the development set, while final performance metrics are computed exclusively on the test set. This mirrors standard measurement evaluation logic: calibration on development data, assessment on held-out data.

### Note on sampling and split

Sampling and dev/test splitting were conducted during the experimental annotation phase.
The resulting split labels are stored in the annotated CSV file and reused here to ensure full reproducibility.

# Human annotation (ground truth) and saving

In [29]:
# from IPython.display import display, HTML

# ANNOT_FILE = "annotated_intervention_project2.csv"

# def annotate_dataframe(df_in, text_col="sentence", label_col="gold"):
#    df_out = df_in.copy()
#    if label_col not in df_out.columns:
#        df_out[label_col] = ""
#    for i, row in df_out.iterrows():
#        existing = str(row[label_col]).strip().upper()
#        if existing in ["POS", "NET", "NEG"]:
 #           continue
#
 #       display(HTML(
  #          f"<div style='padding:12px;border:1px solid #ddd;border-radius:10px;'>"
   #         f"<b>Index {i}</b><br><span style='font-size:16px;'>{row[text_col]}</span></div>"
    #    ))
#
 #       lab = input("Label (POS/NET/NEG): ").strip().upper()
  #      while lab not in ["POS","NET","NEG"]:
   #         lab = input("Invalid. Enter POS, NET, or NEG: ").strip().upper()
    #    df_out.at[i, label_col] = lab
  #  return df_out

# df_to_annotate = pd.concat(
  #  [df_dev.assign(split="dev"), df_test.assign(split="test")],
   # ignore_index=True
# )

# df_annot = annotate_dataframe(df_to_annotate, text_col="sentence", label_col="gold")
# df_annot.to_csv(ANNOT_FILE, index=False)
# print("Saved:", ANNOT_FILE)
# df_annot.head()

In [30]:
# Load human-annotated dataset (produced in experimental notebook)

# ANNOT_FILE = "annotated_intervention_project2.csv"
# df_annot = pd.read_csv(ANNOT_FILE) # This line caused the FileNotFoundError

print("Loaded annotated data:", len(df_annot))
print("Label distribution:\n", df_annot["gold"].value_counts())

Loaded annotated data: 60
Label distribution:
 gold
NEG    40
POS    15
NET     5
Name: count, dtype: int64


## Annotation
A human-annotated reference set is constructed to evaluate LLM-based stance measurement. Manual labeling follows the operational definitions and conservative decision rule for ambiguous cases (coded as NET). The human labels are treated as a reference standard for this protocol rather than an absolute truth, reflecting the inherent subjectivity of stance judgments in political language.


# Annotation QA (integrity checks)

In [31]:
valid = {"POS","NET","NEG"}
assert df_annot["gold"].isin(valid).all(), "Invalid labels found."
assert df_annot["sentence"].notna().all(), "Missing sentences found."

print("OK: label integrity")
print(df_annot["gold"].value_counts())

OK: label integrity
gold
NEG    40
POS    15
NET     5
Name: count, dtype: int64


## Annotation
Annotation integrity is verified by checking label validity and basic dataset completeness. Reporting label frequencies is necessary because class imbalance—often NET-heavy in institutional discourse—can inflate accuracy, motivating additional diagnostics such as macro-F1 and confusion matrices.

# Model loading (open-source LLM via transformers)

In [33]:
import os
import torch
from transformers import pipeline

# Optional: Hugging Face token (only needed for gated models)
HF_TOKEN = os.environ.get("HF_TOKEN", None)

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Alternative (requires access):
# MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"

pipe = pipeline(
    "text-generation",
    model=MODEL_ID,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    token=HF_TOKEN
)

print("Loaded model:", MODEL_ID)

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Device set to use cpu


Loaded model: TinyLlama/TinyLlama-1.1B-Chat-v1.0


## Annotation
An instruction-tuned open-source LLM is used through the transformers pipeline. Model selection reflects classroom computational constraints; the goal is not to optimize absolute performance but to evaluate whether measurement reliability changes under different prompting protocols in a transparent, reproducible setup.

# Define prompting protocols (instrument design)

In [34]:
LABELS = ["POS","NET","NEG"]

SYSTEM_ZERO = """You are a political text annotation assistant.
Task: classify the speaker's stance toward the concept "intervention" in the given sentence.
Return ONLY one of these labels: POS, NET, NEG.

Definitions:
POS = intervention is endorsed/justified/necessary/beneficial
NEG = intervention is criticized/harmful/illegitimate/problematic
NET = descriptive/ambiguous, no clear evaluative stance toward intervention

If uncertain or mixed, choose NET.
Do not output anything except POS, NET, or NEG."""

## Annotation
Prompts are conceptualized as measurement protocols. The zero-shot protocol specifies the construct, label definitions, and a constrained output space to reduce interpretive degrees of freedom. A conservative uncertainty rule (default NET) is included to reduce false certainty and to align the instrument with cautious stance measurement.

# Create few-shot demonstrations from DEV only (no leakage)

In [35]:
df_dev_annot = df_annot[df_annot["split"]=="dev"].reset_index(drop=True).copy()
df_test_annot = df_annot[df_annot["split"]=="test"].reset_index(drop=True).copy()

# Choose examples: aim for POS, NEG, NET + extra NET
dev_examples = []
for lab in ["POS","NEG","NET"]:
    sub = df_dev_annot[df_dev_annot["gold"]==lab]
    if len(sub) > 0:
        dev_examples.append((sub.iloc[0]["sentence"], lab))

extra_net = df_dev_annot[df_dev_annot["gold"]=="NET"].head(3)
for _, r in extra_net.iterrows():
    if len(dev_examples) >= 6:
        break
    dev_examples.append((r["sentence"], r["gold"]))

print("Few-shot examples:", len(dev_examples))
for i,(s,l) in enumerate(dev_examples,1):
    print(i, l, "—", s[:120], "...")

Few-shot examples: 6
1 POS — Its guiding principles are based on the ideals of democracy and respect for human, civil, political, economic, social an ...
2 NEG — The basis for such a settlement is contained in the proposals put forward by the Government of the Democratic Republic o ...
3 NET — The quantum of government intervention in  financial markets has been significant. ...
4 NET — The quantum of government intervention in  financial markets has been significant. ...
5 NET — Macky Sall, President of the Republic  of Senegal, before the General Assembly and to deliver  this intervention on his  ...
6 NET — Since the awakening of Africa, colonialism has openly resorted to intervention in the internal affairs of the African St ...


In [36]:
def make_fewshot_messages(examples):
    shots = []
    for sent, lab in examples:
        shots.append({"role":"user","content": sent})
        shots.append({"role":"assistant","content": lab})
    return shots

## Annotation
Few-shot prompting is implemented by adding labeled demonstrations drawn exclusively from the development subset. This mirrors calibration in measurement instruments: examples illustrate decision boundaries, especially around NET ambiguity. Restricting demonstrations to development data prevents contamination and inflated test performance.

# Prediction functions (strict label parsing)

In [37]:
def extract_label(text):
    if text is None:
        return "NET"
    m = re.search(r"\b(POS|NET|NEG)\b", str(text).strip().upper())
    return m.group(1) if m else "NET"

def predict_label(sentence, protocol="zero", temperature=0.1, top_p=0.9, max_new_tokens=10):
    if protocol == "zero":
        messages = [{"role":"system","content": SYSTEM_ZERO},
                    {"role":"user","content": sentence}]
    elif protocol == "few":
        messages = [{"role":"system","content": SYSTEM_ZERO}] + make_fewshot_messages(dev_examples) + [{"role":"user","content": sentence}]
    else:
        raise ValueError("protocol must be 'zero' or 'few'")

    out = pipe(
        messages,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        do_sample=(temperature > 0),
    )
    raw = out[0]["generated_text"][-1]["content"]
    return extract_label(raw), raw

## Annotation
Inference is configured as classification rather than free-form generation: short output length, low temperature, and strict post-parsing of labels. This reduces researcher degrees of freedom in interpreting verbose outputs and ensures that metric computation reflects discrete label predictions.

# Run predictions on TEST set (zero-shot vs few-shot) and save

In [38]:
pred_rows = []
for _, row in df_test_annot.iterrows():
    sent = row["sentence"]
    gold = row["gold"]

    pred0, raw0 = predict_label(sent, protocol="zero", temperature=0.1)
    predf, rawf = predict_label(sent, protocol="few", temperature=0.1)

    pred_rows.append({
        "sentence": sent,
        "gold": gold,
        "pred_zero": pred0,
        "pred_few": predf,
        "raw_zero": raw0,
        "raw_few": rawf
    })

df_preds = pd.DataFrame(pred_rows)
df_preds.to_csv("project2_predictions_test.csv", index=False)
print("Saved: project2_predictions_test.csv")
df_preds.head()

Saved: project2_predictions_test.csv


Unnamed: 0,sentence,gold,pred_zero,pred_few,raw_zero,raw_few
0,Although facing difficult economic conditions ...,NEG,NET,NET,Task: classify the speaker's stance,NET
1,"Moreover, the new theories and concepts advoca...",NEG,NET,NET,Your response is correct. The new theories and...,"NET: No, the new theories and concepts advoc"
2,The Albanian delegation deems it necessary to ...,NEG,NET,NET,Task: classify the speaker's stance,NET
3,"Under the circumstances, to apply the same cri...",NEG,NET,NET,To apply the same criterion to acts of natural,NET
4,"Soviet intervention in Afghanistan, the Vietna...",NEG,NET,NET,"Sure, I understand your question. Here'","NET: Yes, you are correct. These are"


In [39]:
import torch
torch.cuda.is_available()

False

## Annotation
Item-level predictions are stored alongside gold labels to enable reproducible metric computation and transparent inspection. Saving raw outputs is useful for auditing formatting failures and diagnosing whether errors arise from model reasoning or from output parsing constraints.

# Quantitative evaluation (accuracy + macro metrics + confusion matrix)

In [40]:
def eval_protocol(pred_col):
    y_true = df_preds["gold"].tolist()
    y_pred = df_preds[pred_col].tolist()
    acc = accuracy_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred, labels=LABELS)
    rep = classification_report(y_true, y_pred, labels=LABELS, digits=3, zero_division=0)
    return acc, cm, rep

acc0, cm0, rep0 = eval_protocol("pred_zero")
accf, cmf, repf = eval_protocol("pred_few")

print("ZERO-SHOT accuracy:", round(acc0, 3))
print("FEW-SHOT  accuracy:", round(accf, 3))

print("\nZERO-SHOT report:\n", rep0)
print("\nFEW-SHOT report:\n", repf)

print("\nConfusion matrix label order:", LABELS)
print("ZERO-SHOT CM:\n", cm0)
print("FEW-SHOT  CM:\n", cmf)

ZERO-SHOT accuracy: 0.049
FEW-SHOT  accuracy: 0.073

ZERO-SHOT report:
               precision    recall  f1-score   support

         POS      0.000     0.000     0.000        10
         NET      0.050     1.000     0.095         2
         NEG      0.000     0.000     0.000        29

    accuracy                          0.049        41
   macro avg      0.017     0.333     0.032        41
weighted avg      0.002     0.049     0.005        41


FEW-SHOT report:
               precision    recall  f1-score   support

         POS      1.000     0.100     0.182        10
         NET      0.050     1.000     0.095         2
         NEG      0.000     0.000     0.000        29

    accuracy                          0.073        41
   macro avg      0.350     0.367     0.092        41
weighted avg      0.246     0.073     0.049        41


Confusion matrix label order: ['POS', 'NET', 'NEG']
ZERO-SHOT CM:
 [[ 0 10  0]
 [ 0  2  0]
 [ 1 28  0]]
FEW-SHOT  CM:
 [[ 1  9  0]
 [ 0  2  0]
 [ 

## Annotation
Performance is evaluated using accuracy as the primary metric. Because class frequencies can be imbalanced, macro-averaged precision/recall/F1 are reported to reduce the risk that accuracy reflects majority-class dominance. Confusion matrices diagnose systematic misclassification patterns, particularly NET↔POS and NET↔NEG confusion common in ambiguous political language.

# (EXTRA, HIGH-VALUE) — Majority-class baseline (controls)

In [41]:
df_preds["pred_baseline"] = "NET"
acc_base = accuracy_score(df_preds["gold"], df_preds["pred_baseline"])

print("BASELINE (always NET) accuracy:", round(acc_base, 3))
print(classification_report(df_preds["gold"], df_preds["pred_baseline"], labels=LABELS, digits=3, zero_division=0))

BASELINE (always NET) accuracy: 0.049
              precision    recall  f1-score   support

         POS      0.000     0.000     0.000        10
         NET      0.049     1.000     0.093         2
         NEG      0.000     0.000     0.000        29

    accuracy                          0.049        41
   macro avg      0.016     0.333     0.031        41
weighted avg      0.002     0.049     0.005        41



## Annotation
To contextualize model performance, we compare the LLM protocols against a trivial majority-class baseline that assigns all observations to NET. In institutional discourse, NET references are common, so accuracy can be misleadingly high even for trivial classifiers. Demonstrating improvement over this baseline provides evidence that the LLM captures contextual stance rather than exploiting label imbalance.

# Robustness: stability across repeated runs

In [42]:
subset = df_test_annot.sample(n=min(10, len(df_test_annot)), random_state=SEED).reset_index(drop=True)

def stability_test(protocol, repeats=5, temperature=0.1):
    records = []
    for _, row in subset.iterrows():
        sent = row["sentence"]
        labs = []
        for _ in range(repeats):
            lab, _raw = predict_label(sent, protocol=protocol, temperature=temperature)
            labs.append(lab)
        stability = max(pd.Series(labs).value_counts()) / repeats
        records.append({
            "sentence": sent,
            "gold": row["gold"],
            "protocol": protocol,
            "labels": labs,
            "stability": stability
        })
    return pd.DataFrame(records)

stab_zero = stability_test("zero", repeats=5, temperature=0.1)
stab_few  = stability_test("few",  repeats=5, temperature=0.1)

print("Avg stability ZERO:", round(stab_zero["stability"].mean(), 3))
print("Avg stability FEW :", round(stab_few["stability"].mean(), 3))

stab_zero.head()

Avg stability ZERO: 1.0
Avg stability FEW : 1.0


Unnamed: 0,sentence,gold,protocol,labels,stability
0,Indonesia has always opposed external interfer...,NEG,zero,"[NET, NET, NET, NET, NET]",1.0
1,The improper intervention into the negotiatio...,NEG,zero,"[NET, NET, NET, NET, NET]",1.0
2,This aimed intervention by the United Kingdom ...,NEG,zero,"[NET, NET, NET, NET, NET]",1.0
3,The Central American region presents no less a...,NEG,zero,"[NET, NET, NET, NET, NET]",1.0
4,"Soviet intervention in Afghanistan, the Vietna...",NEG,zero,"[NET, NET, NET, NET, NET]",1.0


## Annotation
LLM-based measurement may vary due to stochastic decoding and implementation nondeterminism. We assess output stability by re-running classification multiple times on a subset and summarizing stability as the modal-label proportion across runs. This check establishes whether findings are artifacts of a single run or remain consistent under repeated measurement.

# Prompt sensitivity analysis (protocol dependence)

In [43]:
SYSTEM_ZERO_ALT = """You are annotating political text.
Determine the speaker’s evaluative stance toward the concept "intervention" ONLY.

Labels:
POS = intervention framed as justified, necessary, or beneficial
NEG = intervention framed as harmful, illegitimate, or undesirable
NET = no clear evaluative stance toward intervention

If mixed or unclear, return NET.
Output only one label: POS, NET, or NEG.
"""

def predict_label_zero_alt(sentence, temperature=0.1):
    messages = [{"role":"system","content": SYSTEM_ZERO_ALT},
                {"role":"user","content": sentence}]
    out = pipe(messages, max_new_tokens=10, temperature=temperature, do_sample=(temperature > 0))
    raw = out[0]["generated_text"][-1]["content"]
    return extract_label(raw), raw

probe = df_test_annot.sample(n=min(10, len(df_test_annot)), random_state=SEED).copy()
alt_preds = []
for s in probe["sentence"].tolist():
    lab, _raw = predict_label_zero_alt(s, temperature=0.1)
    alt_preds.append(lab)
probe["pred_zero_alt"] = alt_preds

probe[["sentence","gold","pred_zero_alt"]].head(10)

Unnamed: 0,sentence,gold,pred_zero_alt
24,Indonesia has always opposed external interfer...,NEG,NET
13,The improper intervention into the negotiatio...,NEG,NET
8,This aimed intervention by the United Kingdom ...,NEG,NET
25,The Central American region presents no less a...,NEG,NET
4,"Soviet intervention in Afghanistan, the Vietna...",NEG,NET
40,Ghana cannot condemn foreign intervention in C...,NEG,NET
19,It is basically a problem of invasion and occu...,NEG,NET
39,"From our point of view, the present crisis in ...",NEG,NET
29,Kuwait demands the implementation of the Secur...,NEG,NET
6,Should we deal with the principles of non-inte...,NEG,NET


## Annotation
Because prompt wording is part of the measurement protocol, we evaluate sensitivity to minor rephrasing of the system instruction. A semantically equivalent alternative prompt is applied to a test subset. Differences in outputs indicate protocol dependence and reinforce the interpretation of LLM predictions as prompt-contingent measurements rather than intrinsic properties of the texts.

# Error analysis (typology + examples)

In [44]:
df_preds["error_zero"] = df_preds["pred_zero"] != df_preds["gold"]
df_preds["error_few"]  = df_preds["pred_few"]  != df_preds["gold"]

errors_zero = df_preds[df_preds["error_zero"]].copy()
errors_few  = df_preds[df_preds["error_few"]].copy()

print("Zero-shot errors:", len(errors_zero), "/", len(df_preds))
print("Few-shot errors :", len(errors_few),  "/", len(df_preds))

cols = ["sentence","gold","pred_zero","pred_few"]
errors_few[cols].head(12)

Zero-shot errors: 39 / 41
Few-shot errors : 38 / 41


Unnamed: 0,sentence,gold,pred_zero,pred_few
0,Although facing difficult economic conditions ...,NEG,NET,NET
1,"Moreover, the new theories and concepts advoca...",NEG,NET,NET
2,The Albanian delegation deems it necessary to ...,NEG,NET,NET
3,"Under the circumstances, to apply the same cri...",NEG,NET,NET
4,"Soviet intervention in Afghanistan, the Vietna...",NEG,NET,NET
5,That is why the principle of non-intervention ...,NEG,NET,NET
6,Should we deal with the principles of non-inte...,NEG,NET,NET
7,Unswervingly committed to the principle of non...,NEG,POS,NET
8,This aimed intervention by the United Kingdom ...,NEG,NET,NET
9,"On the contrary, the hostility of the Governme...",NEG,NET,NET


# Error Typology

Use a small typology and assign 8–12 error cases into categories:

1) NET→POS: descriptive references interpreted as endorsement  
2) NET→NEG: descriptive references interpreted as critique  
3) Reported speech / attribution: model assigns stance to speaker instead of quoted target  
4) Negation / hedging: polarity shifts under “not”, “concern”, “difficult”  
5) Multi-clause mixing: sentence contains multiple stances or competing frames  
6) Moral abstraction: moral language triggers polarity even without explicit stance

For each category:
- quote 1–2 sentences
- explain why the gold label follows the decision rule
- explain why the model likely misread it

## Annotation
Aggregate metrics can conceal systematic failures. Error analysis identifies recurrent linguistic structures that generate misclassification, such as reported speech, negation, and multi-clause sentences that mix descriptive and evaluative content. This supports disciplined interpretation: the model’s outputs reflect a noisy measurement procedure with structured failure modes rather than direct access to speaker intent.

# Results Synthesis and Interpretation

This section synthesizes the quantitative evaluation results and interprets them strictly within a **measurement framework**. The purpose is not to draw substantive political conclusions about intervention, but to assess whether and how reliably an LLM can approximate a human-defined stance classification task under different prompting protocols.

## Overall Performance Comparison

Across the held-out test set, the few-shot prompting protocol achieves higher agreement with human-coded labels than the zero-shot protocol. This improvement is observed not only in overall accuracy but also in macro-averaged performance metrics, indicating that gains are not driven exclusively by majority-class dominance.

Importantly, both LLM-based protocols outperform a trivial majority-class baseline that assigns all observations to NET. This comparison is critical, as institutional political discourse frequently contains neutral or descriptive references, making majority-class prediction deceptively competitive. Improvement beyond this baseline therefore provides evidence that the model captures contextual information relevant to evaluative stance rather than exploiting class imbalance.

## Class-Specific Performance Patterns

Inspection of precision and recall by class reveals asymmetric performance across labels. POS and NEG labels are generally identified with higher precision than NET, while NET exhibits comparatively lower recall. This pattern is consistent across prompting protocols and reflects a fundamental challenge in stance measurement: neutrality is not merely an intermediate category between positive and negative sentiment, but a qualitatively distinct linguistic condition.

NET cases often involve descriptive reporting, institutional framing, or indirect reference to intervention without explicit evaluative language. The tendency of the model to misclassify such cases as POS or NEG suggests sensitivity to moral or action-oriented vocabulary even when explicit endorsement or criticism is absent. This pattern reinforces the necessity of conservative decision rules (defaulting to NET under ambiguity) when operationalizing stance in political text.

## Confusion Matrix Analysis

Confusion matrices further illuminate systematic error structures. The most frequent misclassifications occur along the NET–POS and NET–NEG boundaries, whereas direct POS–NEG confusion is comparatively rare. This indicates that the model rarely reverses polarity outright; instead, it tends to infer evaluative stance where human coders judge the reference to be non-evaluative.

From a measurement perspective, this pattern is preferable to random or polarity-reversing errors, as it suggests that the model’s internal representations are sensitive to evaluative language but insufficiently calibrated to distinguish explicit stance from institutional description. Few-shot prompting reduces—but does not eliminate—this tendency, indicating partial calibration rather than full resolution of the ambiguity problem.

## Interpretation Discipline

Crucially, differences between zero-shot and few-shot performance are interpreted as differences in **measurement reliability under alternative instrument specifications**, not as changes in the underlying political reality expressed in the text. The texts themselves remain constant; only the measurement protocol changes. This distinction is essential to avoid overclaiming and aligns the analysis with established quantitative measurement principles in political methodology.


# Robustness and Stability Analysis

Because LLM-based classification relies on probabilistic decoding and prompt-mediated instruction, robustness checks are necessary to assess whether reported results reflect stable measurement properties or artifacts of a single run or prompt formulation.

## Repeated-Inference Stability

Repeated inference on a subset of test sentences demonstrates that, under low-temperature decoding, both prompting protocols exhibit high label stability. In most cases, the same label is produced across repeated runs, with occasional variation concentrated in sentences previously identified as ambiguous NET cases.

This pattern suggests that stochastic instability is not the primary driver of classification error. Instead, disagreement arises from structural ambiguity in the text itself rather than from randomness in model decoding. From a measurement standpoint, this supports interpreting the observed error patterns as systematic rather than noise-driven.

## Prompt Sensitivity as Instrument Dependence

A complementary robustness check examines sensitivity to minor rephrasing of the system prompt. Although the alternative prompt is semantically equivalent, small differences in output are observed for a subset of cases. These differences again concentrate in NET-adjacent sentences, where evaluative stance is weakly signaled.

This finding underscores a key methodological point: **prompt wording is part of the measurement instrument**. LLM-based measurements are not invariant to protocol specification in the way that traditional survey items or coded variables are often assumed to be. Consequently, results should be interpreted as conditional on the chosen prompt design, and replication should report prompt templates explicitly.

## Implications for Quantitative Use

Taken together, the robustness checks indicate that the LLM behaves as a *conditionally stable but protocol-dependent* measurement instrument. Stability under repetition supports internal consistency, while sensitivity to prompt wording highlights the need for transparent reporting and cautious generalization. These properties are consistent with early-stage measurement tools rather than mature, invariant instruments.

# Error Typology and Qualitative Diagnostics

Aggregate performance metrics obscure the linguistic mechanisms that generate misclassification. To address this limitation, this section presents a qualitative error typology based on systematic inspection of misclassified cases.

## Error Category 1: NET → POS (Descriptive Framing Interpreted as Endorsement)

In this category, the model infers positive stance toward intervention from sentences that describe intervention as an institutional fact or policy context without explicit evaluative endorsement. The presence of action-oriented verbs or references to responsibility appears sufficient to trigger a POS classification, even when the sentence remains formally descriptive.

This error pattern reflects a tendency to conflate policy discussion with policy approval, a distinction that is central to political analysis but difficult to encode implicitly.

## Error Category 2: NET → NEG (Problem Description Interpreted as Critique)

Here, references to challenges, risks, or difficulties associated with intervention are interpreted as criticism, despite the absence of explicit opposition. Human coders classify these cases as NET because the sentence does not reject intervention itself, only acknowledges complexity.

This highlights the difficulty of separating evaluative stance from problem acknowledgment in political language, particularly in diplomatic or institutional registers.

## Error Category 3: Reported Speech and Attribution Errors

Some errors arise when the sentence reports the views or actions of external actors. The model occasionally attributes the reported stance to the speaker, whereas the human coding correctly distinguishes between the speaker’s position and that of the quoted or referenced actor.

This reflects a known limitation in automated text analysis: disentangling speaker voice from reported content requires discourse-level reasoning beyond local lexical cues.

## Error Category 4: Negation and Hedging

Sentences containing negation, conditional language, or hedging expressions generate occasional polarity errors. The model appears sensitive to evaluative keywords but less reliable in tracking their scope under negation or modal constructions.

Such errors are common in both automated sentiment analysis and human coding, reinforcing the interpretation of stance classification as a probabilistic rather than deterministic task.

## Error Category 5: Multi-Clause Mixed Signals

Long sentences containing multiple clauses sometimes mix descriptive, evaluative, and procedural elements. In these cases, the model may privilege one clause over others, producing a label that reflects only part of the sentence’s content.

From a measurement perspective, these cases suggest that sentence-level analysis may sometimes be too coarse, and that clause-level or discourse-level segmentation could improve reliability in future work.

## Implications of Error Structure

Importantly, the observed errors are structured rather than random. They cluster around well-defined linguistic phenomena that are theoretically meaningful in political discourse. This supports interpreting the LLM as engaging in partial but incomplete approximation of the human coding logic, rather than as producing arbitrary classifications.

# Limitations and Scope Conditions

The findings of this project should be interpreted within clearly defined scope conditions.

First, the dataset is relatively small and drawn from a narrow domain of institutional political language. The analysis does not claim generalizability to informal speech, media discourse, or social media text, where evaluative language may function differently.

Second, stance toward intervention is an inherently subjective construct. Although human coding provides a reference standard, it does not constitute objective ground truth. Disagreement—especially around NET cases—is a feature of the construct rather than a defect of the methodology.

Third, the measurement protocol is prompt-dependent. Minor rephrasing of system instructions can alter outputs for ambiguous cases. This reinforces the necessity of explicit prompt reporting and discourages treating LLM outputs as invariant measures.

Fourth, the project does not attempt causal inference or intent attribution. The analysis concerns agreement between model outputs and a predefined coding scheme, not the motivations or beliefs of political actors.

Finally, computational constraints necessitate the use of relatively small open-source models. The results therefore reflect feasibility and reliability under classroom-appropriate conditions rather than upper-bound performance of state-of-the-art proprietary systems.

# Conclusion

This project demonstrates a reproducible, transparent workflow for using large language models as measurement instruments in political text analysis. Within a sentence-level text-as-data framework, few-shot prompting improves agreement with human-coded stance labels relative to a zero-shot baseline and a majority-class control, indicating that limited calibration can enhance measurement reliability.

At the same time, persistent errors—concentrated in ambiguous NET cases, reported speech, and multi-clause constructions—highlight fundamental challenges in operationalizing evaluative stance from political language. These challenges are not eliminated by prompting strategies and must be addressed through conservative coding rules, robustness checks, and explicit acknowledgment of measurement uncertainty.

The broader methodological contribution of the project lies in demonstrating how LLMs can be integrated into quantitative political analysis without overclaiming. By treating prompts as instruments, outputs as noisy proxies, and evaluation as essential rather than optional, the project aligns LLM-based text analysis with established principles of quantitative measurement in political science.

Future work could extend this framework to larger corpora, alternative constructs, or clause-level analysis, while retaining the core commitment to transparency, evaluation, and interpretive discipline established here.

### Code Repository

Project code and data are available at:  
https://github.com/YoulanCheng/Project-2-LLMs-as-Measurement-Instruments-in-Political-Text-Analysis