# Data Preprocessing for Public Unrest Classification

This notebook prepares the **GoEmotions dataset** for the final project in  
**CCS 248 â€“ Artificial Neural Networks**.

The goal of this step is to:
- Clean raw social-media-style text
- Convert emotion labels into **public unrest classes**
- Produce train / validation / test CSV files for model training

In [58]:
import json
import re
from pathlib import Path
import pandas as pd

In [59]:
# Base directory: PublicUnrest/
BASE_DIR = Path("..")

GOEMOTION_DIR = BASE_DIR.parent / "Goemotion" / "data"
DATA_DIR = BASE_DIR / "data"
PROCESSED_DIR = DATA_DIR / "processed"

EMOTIONS_FILE = GOEMOTION_DIR / "emotions.txt"
WEIGHTS_FILE = DATA_DIR / "emotion_weights.json"

TRAIN_TSV = GOEMOTION_DIR / "train.tsv"
DEV_TSV   = GOEMOTION_DIR / "dev.tsv"
TEST_TSV  = GOEMOTION_DIR / "test.tsv"

PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

## Load Emotion Labels and Weights

GoEmotions uses numeric emotion IDs.
These are mapped to emotion names using `emotions.txt`,
and then converted into unrest scores using predefined weights.

In [63]:
def load_emotions():
    return [e.strip() for e in EMOTIONS_FILE.read_text().splitlines() if e.strip()]

def load_weights():
    with open(WEIGHTS_FILE, "r", encoding="utf-8") as f:
        return json.load(f)

emotions = load_emotions()
emotion_weights = load_weights()

print("Loaded emotions:", len(emotions))
print("Loaded emotion weights:", len(emotion_weights))

Loaded emotions: 28
Loaded emotion weights: 28


## Convert Emotion Labels to Unrest Classes

Each text sample may have multiple emotions.
We compute an **average unrest score**, then convert it into:

| Unrest Class | Meaning |
|-------------|---------|
| 0 | Low unrest |
| 1 | Medium unrest |
| 2 | High unrest |

In [None]:
def labels_to_unrest_percent(label_str):
    if not isinstance(label_str, str):
        return 0.0
    
    ids = [int(x) for x in label_str.split(",") if x.isdigit()]
    if not ids:
        return 0.0
    
    weights = []
    for i in ids:
        if i < len(emotions):
            emo = emotions[i]
            weights.append(emotion_weights.get(emo, 0.0))
    
    if not weights:
        return 0.0
    
    return (sum(weights) / len(weights)) * 100

In [67]:
def unrest_percent_to_class(p):
    if p < 33:
        return 0
    elif p < 66:
        return 1
    else:
        return 2


## Text Cleaning

The text is normalized to improve model performance:
- Lowercasing
- URL removal
- Mention removal
- Punctuation removal

In [69]:
URL_RE = re.compile(r"http\S+|www\S+")
MENTION_RE = re.compile(r"@\w+")
PUNCT_RE = re.compile(r"[^\w\s]")
SPACE_RE = re.compile(r"\s+")

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = URL_RE.sub("", text)
    text = MENTION_RE.sub("", text)
    text = PUNCT_RE.sub("", text)
    text = SPACE_RE.sub(" ", text).strip()
    return text

## Load GoEmotions Train / Dev / Test Splits

Each TSV file contains:
- Text
- Emotion label IDs
- Sample ID

In [70]:
def load_split(path, name):
    df = pd.read_csv(
        path,
        sep="\t",
        header=None,
        names=["text", "labels", "id"]
    )
    df["split"] = name
    return df

df_train = load_split(TRAIN_TSV, "train")
df_dev   = load_split(DEV_TSV, "dev")
df_test  = load_split(TEST_TSV, "test")

df = pd.concat([df_train, df_dev, df_test], ignore_index=True)
print("Total samples:", len(df))

Total samples: 54263


## Generate Final Features and Labels

We compute:
- `text_clean`
- `unrest_percent`
- `unrest_class`

In [71]:
df["unrest_percent"] = df["labels"].apply(labels_to_unrest_percent)
df["unrest_class"]   = df["unrest_percent"].apply(unrest_percent_to_class)
df["text_clean"]     = df["text"].apply(clean_text)

df[["text_clean", "unrest_class", "split"]].head()

Unnamed: 0,text_clean,unrest_class,split
0,my favourite food is anything i didnt have to ...,0,train
1,now if he does off himself everyone will think...,0,train
2,why the fuck is bayless isoing,2,train
3,to make her feel threatened,2,train
4,dirty southern wankers,2,train


## Save Processed Datasets

These files will be used by the training notebook.

In [72]:
for split, d in df.groupby("split"):
    out_path = PROCESSED_DIR / f"goemotions_unrest_{split}.csv"
    d.to_csv(out_path, index=False)
    print("Saved:", out_path)

Saved: c:\Users\ryanc\Downloads\ANN-FinalProj-main\ANN-FinalProj-main\PublicUnrest\data\processed\goemotions_unrest_dev.csv
Saved: c:\Users\ryanc\Downloads\ANN-FinalProj-main\ANN-FinalProj-main\PublicUnrest\data\processed\goemotions_unrest_test.csv
Saved: c:\Users\ryanc\Downloads\ANN-FinalProj-main\ANN-FinalProj-main\PublicUnrest\data\processed\goemotions_unrest_train.csv
