# Advanced Preprocessing & Sentiment-Augmented Features

This notebook expands the preprocessing pipeline for the CLARITY dataset.
Goals:
- Keep linguistically informative tokens (no stop-word removal) while normalizing casing/spacing.
- Engineer structural features (length, overlap, punctuation) useful for ambiguity/evasion cues.
- Enrich each sample with sentiment signals from a free Hugging Face transformer.
- Prototype a classifier that fuses TF–IDF text with the engineered numeric features.

In [11]:
!pip install --upgrade transformers
!pip install --upgrade Pillow

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting Pillow
  Downloading pillow-12.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (8.8 kB)
Downloading pillow-12.0.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (7.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m814.2 kB/s[0m  [33m0:00:08[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: Pillow
Successfully installed Pillow-12.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mno

In [14]:
import AutoTokenizer

ModuleNotFoundError: No module named 'AutoTokenizer'

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path
import re

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

pd.options.display.max_columns = 50

  from .autonotebook import tqdm as notebook_tqdm
2025-11-13 23:35:54.337741: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-13 23:35:54.720343: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-13 23:35:56.427940: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


## Load QEvasion data

In [4]:
DATA_PATH = Path('../data/raw/QEvasion.csv')
if not DATA_PATH.exists():
    raise FileNotFoundError('Expected QEvasion CSV under data/raw/. Run preprocessing first.')

df = pd.read_csv(DATA_PATH)
print(f"Loaded {len(df):,} rows")
df.head(2)

Loaded 3,448 rows


Unnamed: 0,title,date,president,url,question_order,interview_question,interview_answer,gpt3.5_summary,gpt3.5_prediction,question,annotator_id,annotator1,annotator2,annotator3,inaudible,multiple_questions,affirmative_questions,index,clarity_label,evasion_label
0,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,https://www.presidency.ucsb.edu/documents/the-...,1,Q. Of the Biden administration. And accused th...,"Well, look, first of all, theI am sincere abou...",The question consists of 2 parts: \n1. How wou...,Question part: 1. How would you respond to the...,How would you respond to the accusation that t...,85,,,,False,False,False,0,Clear Reply,Explicit
1,"The President's News Conference in Hanoi, Vietnam","September 10, 2023",Joseph R. Biden,https://www.presidency.ucsb.edu/documents/the-...,1,Q. Of the Biden administration. And accused th...,"Well, look, first of all, theI am sincere abou...",The question consists of 2 parts: \n1. How wou...,Question part: 1. How would you respond to the...,Do you think President Xi is being sincere abo...,85,,,,False,False,False,1,Ambivalent,General


## Normalize dates and keep key columns

In [5]:
DROP_COLS = [
    'annotator_id', 'annotator1', 'annotator2', 'annotator3',
    'inaudible', 'multiple_questions', 'affirmative_questions',
    'index', 'question_order', 'url'
]

df['interview_year'] = pd.to_datetime(df['date'], errors='coerce').dt.year
df = df.drop(columns=DROP_COLS + ['date'], errors='ignore')
print('Columns after drop:', len(df.columns))
df[['title', 'interview_year', 'clarity_label']].head(3)

Columns after drop: 10


Unnamed: 0,title,interview_year,clarity_label
0,"The President's News Conference in Hanoi, Vietnam",2023,Clear Reply
1,"The President's News Conference in Hanoi, Vietnam",2023,Ambivalent
2,"The President's News Conference in Hanoi, Vietnam",2023,Ambivalent


## Text normalization (stop-words retained)

In [6]:
def normalize_text(series: pd.Series) -> pd.Series:
    return (series.fillna('')
                  .str.replace(r"\s+", ' ', regex=True)
                  .str.strip()
                  .str.lower())

df['normalized_question'] = normalize_text(df['interview_question'])
df['normalized_answer'] = normalize_text(df['interview_answer'])
df[['normalized_question', 'normalized_answer']].head(3)

Unnamed: 0,normalized_question,normalized_answer
0,q. of the biden administration. and accused th...,"well, look, first of all, thei am sincere abou..."
1,q. of the biden administration. and accused th...,"well, look, first of all, thei am sincere abou..."
2,q. no worries. do you believe the country's sl...,"look, i think china has a difficult economic p..."


## Structural feature engineering

In [7]:
def sentence_stats(text: str):
    sentences = re.split(r"(?<=[.!?])\s+", text.strip()) if text else []
    sentences = [s for s in sentences if s]
    if not sentences:
        return 0, 0
    lengths = [len(s.split()) for s in sentences]
    return len(sentences), np.mean(lengths)

length_features = df[['normalized_question', 'normalized_answer']].copy()
length_features['answer_word_count'] = length_features['normalized_answer'].str.split().str.len()
length_features['question_word_count'] = length_features['normalized_question'].str.split().str.len()
length_features['answer_char_count'] = df['normalized_answer'].str.len()
length_features['answer_sentence_count'], length_features['answer_avg_sentence_len'] = zip(*length_features['normalized_answer'].map(sentence_stats))

def lexical_overlap(row):
    q_tokens = row['normalized_question'].split()
    a_tokens = row['normalized_answer'].split()
    if not q_tokens or not a_tokens:
        return 0.0
    intersection = len(set(q_tokens) & set(a_tokens))
    return intersection / len(set(q_tokens))

length_features['qa_overlap_ratio'] = df.apply(lexical_overlap, axis=1)
length_features['question_mark_flag'] = df['normalized_question'].str.contains('\?', regex=False).astype(int)
length_features['answer_hedge_freq'] = df['normalized_answer'].str.count(r"(maybe|perhaps|sort of|kind of|i think)")

feature_cols = [
    'answer_word_count', 'question_word_count', 'answer_char_count',
    'answer_sentence_count', 'answer_avg_sentence_len', 'qa_overlap_ratio',
    'question_mark_flag', 'answer_hedge_freq'
]

for col in feature_cols:
    df[col] = length_features[col]

df[feature_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
answer_word_count,3448.0,293.572216,301.541101,1.0,56.0,207.0,440.0,2117.0
question_word_count,3448.0,61.50609,59.859859,3.0,22.0,50.0,82.0,780.0
answer_char_count,3448.0,1663.320476,1733.344026,3.0,309.0,1155.5,2495.0,12102.0
answer_sentence_count,3448.0,14.233759,13.007882,1.0,4.0,11.0,20.0,87.0
answer_avg_sentence_len,3448.0,23.943976,65.22607,1.0,10.4,16.630682,24.291667,1302.0
qa_overlap_ratio,3448.0,0.306205,0.166873,0.0,0.190357,0.333333,0.426829,0.8125
question_mark_flag,3448.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
answer_hedge_freq,3448.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Sentiment extraction via Hugging Face model

In [8]:
!pip install torch

Defaulting to user installation because normal site-packages is not writeable


In [9]:
import torch
print(torch.__version__)


2.9.1+cu128


In [10]:
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
sentiment_pipeline = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    padding=True,
    return_all_scores=True
)

label_order = [model.config.id2label[idx] for idx in range(model.config.num_labels)]

def batch_sentiment(texts, batch_size=32):
    scores = sentiment_pipeline(texts, batch_size=batch_size)
    rows = []
    for entry in scores:
        rows.append({item['label']: item['score'] for item in entry})
    return pd.DataFrame(rows)[label_order]

question_sentiment = batch_sentiment(df['normalized_question'].tolist())
question_sentiment.columns = [f'question_sent_{col.lower()}' for col in question_sentiment.columns]
answer_sentiment = batch_sentiment(df['normalized_answer'].tolist())
answer_sentiment.columns = [f'answer_sent_{col.lower()}' for col in answer_sentiment.columns]

df = pd.concat([df, question_sentiment, answer_sentiment], axis=1)
df.filter(like='sent_').head(3)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Unnamed: 0,question_sent_negative,question_sent_neutral,question_sent_positive,answer_sent_negative,answer_sent_neutral,answer_sent_positive
0,0.284938,0.694826,0.020236,0.064227,0.749542,0.186231
1,0.284938,0.694826,0.020236,0.064227,0.749542,0.186231
2,0.370508,0.612024,0.017468,0.641184,0.339137,0.019679


In [22]:
pip install --upgrade pip

Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m1.5 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.2
    Uninstalling pip-25.2:
      Successfully uninstalled pip-25.2
Successfully installed pip-25.3
Note: you may need to restart the kernel to use updated packages.


## Sentiment-augmented classifier prototype

In [11]:
model_df = df.dropna(subset=['clarity_label']).copy()
model_df['text_concat'] = model_df['normalized_question'] + ' ' + model_df['normalized_answer']

numeric_feats = feature_cols + list(question_sentiment.columns) + list(answer_sentiment.columns)

X_train, X_test, y_train, y_test = train_test_split(
    model_df[['text_concat'] + numeric_feats],
    model_df['clarity_label'],
    test_size=0.2,
    stratify=model_df['clarity_label'],
    random_state=7
)

preprocessor = ColumnTransformer([
    ('text', TfidfVectorizer(max_features=30000, ngram_range=(1,2)), 'text_concat'),
    ('numeric', StandardScaler(), numeric_feats)
])

pipeline_model = Pipeline([
    ('features', preprocessor),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipeline_model.fit(X_train, y_train)
preds = pipeline_model.predict(X_test)
print('Validation accuracy:', (preds == y_test).mean().round(3))
classification_report(y_test, preds)

Validation accuracy: 0.561


'                 precision    recall  f1-score   support\n\n     Ambivalent       0.69      0.61      0.65       408\nClear Non-Reply       0.39      0.66      0.49        71\n    Clear Reply       0.44      0.43      0.44       211\n\n       accuracy                           0.56       690\n      macro avg       0.51      0.57      0.52       690\n   weighted avg       0.58      0.56      0.57       690\n'