## Sentiment Analysis of HeritageRoots VR Think-Aloud 

#### Transcripts with NLTK (VADER)

1. Load a JSON dataset
2. Extract Participant lines (ignoring Interviewer lines)
3. Use NLTK VADER to score sentiment per utterance
4. Aggregate sentiment per participant and saves CSVs

In [29]:
import json
import os
from pathlib import Path
from typing import List, Dict

# Data wrangling
import pandas as pd

# NLTK VADER sentiment
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [30]:
# ---------- Config ----------
DATA_PATH = Path("data/heritageroots_ux_transcripts.json")  
OUT_DIR = Path("data/sentiment_outputs")
OUT_DIR.mkdir(exist_ok=True)

In [31]:
# ---------- Ensure VADER is available ----------
# If your environment doesn't have the lexicon, uncomment the download line.
# (In some teaching environments, internet is blocked; if so, pre-install before class.)
try:
    _ = SentimentIntensityAnalyzer()
except:
    nltk.download("vader_lexicon")

# Create analyzer (now that we (should) have the lexicon)
sia = SentimentIntensityAnalyzer()

In [32]:
# ---------- Load JSON ----------
if not DATA_PATH.exists():
    raise FileNotFoundError(f"Could not find {DATA_PATH.resolve()}. "
                            "Place the JSON in this folder or update DATA_PATH.")

with open(DATA_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

participants = data.get("participants", [])
print(f"Loaded {len(participants)} participants.")

Loaded 20 participants.


In [33]:
participants

[{'id': 'P01',
  'bio': '24-year-old Computer Science student named Alex, female, highly familiar with VR and technology.',
  'transcript': [{'time': '[00:00:14]',
    'speaker': 'Interviewer',
    'text': 'You are now in the HeritageRoots virtual lobby. What do you notice first?'},
   {'time': '[00:00:48]',
    'speaker': 'Participant',
    'text': 'The lighting and spatial sound immediately make it feel immersive. I noticed the floating panels, but wasn’t sure which gesture activates them.'},
   {'time': '[00:00:59]',
    'speaker': 'Interviewer',
    'text': 'Try navigating to a new scene using the teleport tool.'},
   {'time': '[00:01:20]',
    'speaker': 'Participant',
    'text': 'Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.'},
   {'time': '[00:01:39]',
    'speaker': 'Interviewer',
    'text': 'How do the menus and icons feel to interact with?'},
   {'time': '[00:01:54]',
    'speake

In [34]:
# ---------- Flatten to a row-per-utterance dataframe ----------
rows: List[Dict] = []
for p in participants:
    pid = p.get("id")
    bio = p.get("bio")
    for turn in p.get("transcript", []):
        if turn.get("speaker") != "Participant":
            continue  # analyze only participant utterances
        rows.append({
            "participant_id": pid,
            "bio": bio,
            "time": turn.get("time"),
            "text": turn.get("text", "").strip()
        })

df = pd.DataFrame(rows)
print(f"Utterances (Participant): {len(df)}")

Utterances (Participant): 120


In [35]:
df.head()

Unnamed: 0,participant_id,bio,time,text
0,P01,24-year-old Computer Science student named Ale...,[00:00:48],The lighting and spatial sound immediately mak...
1,P01,24-year-old Computer Science student named Ale...,[00:01:20],"Teleportation is smooth, but I’d like a quick ..."
2,P01,24-year-old Computer Science student named Ale...,[00:01:54],The icons are intuitive but slightly small. I ...
3,P01,24-year-old Computer Science student named Ale...,[00:02:16],Picking up artifacts feels satisfying. However...
4,P01,24-year-old Computer Science student named Ale...,[00:03:08],"No lag, though the ambient sound loop is a lit..."


In [None]:
# Optional: drop empty lines if any
df = df[df["text"].str.len() > 0].copy()

In [None]:
# ---------- Score sentiment per utterance ----------
def label_from_compound(score: float, pos=0.05, neg=-0.05) -> str:
    """Map VADER compound score to label: positive / neutral / negative."""
    if score >= pos:
        return "positive"
    elif score <= neg:
        return "negative"
    else:
        return "neutral"

scores = df["text"].apply(sia.polarity_scores).apply(pd.Series)
df = pd.concat([df, scores], axis=1)
df["label"] = df["compound"].apply(label_from_compound)

In [43]:
df.sample(5)

Unnamed: 0,participant_id,bio,time,text,neg,neu,pos,compound,label
79,P14,19-year-old Music Technology student named Eme...,[00:01:00],"Teleportation is smooth, but I’d like a quick ...",0.0,0.651,0.349,0.8807,positive
57,P10,"27-year-old Psychology student named Skyler, f...",[00:02:43],Picking up artifacts feels satisfying. However...,0.0,0.833,0.167,0.4588,positive
37,P07,"27-year-old History student named Morgan, nonb...",[00:00:46],"Teleportation is smooth, but I’d like a quick ...",0.0,0.651,0.349,0.8807,positive
2,P01,24-year-old Computer Science student named Ale...,[00:01:54],The icons are intuitive but slightly small. I ...,0.0,1.0,0.0,0.0,neutral
29,P05,28-year-old Information Science student named ...,[00:03:06],Pretty intuitive overall. I’d just tweak inter...,0.0,0.824,0.176,0.4939,positive


In [None]:
# Top clearly positive/negative quotes (by compound)
top_pos = df.sort_values("compound", ascending=False).head(1)
top_neg = df.sort_values("compound", ascending=True).head(1)

print("\nTop positive utterance:")
for _, r in top_pos.iterrows():
    print(f"- {r['participant_id']} {r['time']} | {r['compound']:.3f} | {r['text']}")

print("\nTop negative utterance:")
for _, r in top_neg.iterrows():
    print(f"- {r['participant_id']} {r['time']} | {r['compound']:.3f} | {r['text']}")



Top 5 positive utterances:
- P12 [00:01:40] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.
- P03 [00:01:21] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.
- P17 [00:00:44] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.
- P16 [00:01:25] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.
- P07 [00:00:46] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clearer arrow would help directionally.

Top 5 negative utterances:
- P07 [00:02:52] | -0.557 | No lag, though the ambient sound loop is a little short—I noticed repet

In [None]:

# ---------- Aggregate per participant ----------
agg = (
    df.groupby("participant_id")
      .agg(
          n_utterances=("text", "count"),
          compound_mean=("compound", "mean"),
          compound_median=("compound", "median"),
          pos_mean=("pos", "mean"),
          neg_mean=("neg", "mean"),
          neu_mean=("neu", "mean"),
      )
      .reset_index()
      .sort_values("compound_mean", ascending=False)
)

# Add a participant-level label from the mean compound
agg["participant_label"] = agg["compound_mean"].apply(label_from_compound)

# Join back one bio per participant
bios = df.groupby("participant_id")["bio"].first().reset_index()
agg = agg.merge(bios, on="participant_id", how="left")

print("\nParticipant-level sentiment summary:")
print(agg.head(10))

# ---------- Save outputs ----------
utterance_csv = OUT_DIR / "utterance_sentiment.csv"
participant_csv = OUT_DIR / "participant_sentiment_summary.csv"

df.to_csv(utterance_csv, index=False)
agg.to_csv(participant_csv, index=False)

print(f"\nSaved utterance-level sentiment to: {utterance_csv.resolve()}")
print(f"Saved participant-level summary to: {participant_csv.resolve()}")


Sample scored rows:
   participant_id        time  \
44            P08  [00:01:59]   
47            P08  [00:03:54]   
4             P01  [00:03:08]   
55            P10  [00:01:08]   
26            P05  [00:01:54]   

                                                 text    neg    neu    pos  \
44  The icons are intuitive but slightly small. I ...  0.000  1.000  0.000   
47  Pretty intuitive overall. I’d just tweak inter...  0.000  0.824  0.176   
4   No lag, though the ambient sound loop is a lit...  0.261  0.739  0.000   
55  Teleportation is smooth, but I’d like a quick ...  0.000  0.651  0.349   
26  The icons are intuitive but slightly small. I ...  0.000  1.000  0.000   

    compound     label  
44    0.0000   neutral  
47    0.4939  positive  
4    -0.5574  negative  
55    0.8807  positive  
26    0.0000   neutral  

Top 5 positive utterances:
- P12 [00:01:40] | 0.881 | Teleportation is smooth, but I’d like a quick preview of where I’ll land. The fade helps, but maybe a clea