# Model Training – Public Unrest Classification

This notebook trains an **Artificial Neural Network (ANN)** to classify
levels of public unrest from social-media-style text.

The model predicts **three classes**:
- **0 – Low Unrest**
- **1 – Medium Unrest**
- **2 – High Unrest**

In [47]:
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

from tensorflow.keras import models, layers, callbacks

## Load Preprocessed Dataset

The datasets were generated in **01_data_preprocessing.ipynb** and are stored in:

`PublicUnrest/data/processed/`

Each file contains:
- `text_clean` – cleaned text
- `unrest_class` – class label (0, 1, or 2)

In [48]:
BASE_DIR = Path.cwd().parent

# Processed data directory
DATA_DIR = BASE_DIR / "PublicUnrest" / "data" / "processed"

# Load classification datasets
df_train = pd.read_csv(DATA_DIR / "goemotions_unrest_train_cls.csv")
df_dev   = pd.read_csv(DATA_DIR / "goemotions_unrest_dev_cls.csv")
df_test  = pd.read_csv(DATA_DIR / "goemotions_unrest_test_cls.csv")

print("Train:", len(df_train))
print("Dev:", len(df_dev))
print("Test:", len(df_test))

Train: 43410
Dev: 5426
Test: 5427


## Separate Input Text and Labels

- **Input**: cleaned text (`text_clean`)
- **Target**: unrest class (`unrest_class`)

In [49]:
X_train = df_train["text_clean"].fillna("")
y_train = df_train["unrest_class"].astype(int)

X_dev = df_dev["text_clean"].fillna("")
y_dev = df_dev["unrest_class"].astype(int)

X_test = df_test["text_clean"].fillna("")
y_test = df_test["unrest_class"].astype(int)

## Text Vectorization (TF-IDF)

TF-IDF converts text into numeric feature vectors.
No external embeddings are used, in compliance with course requirements.

In [50]:
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2
)

X_train_vec = vectorizer.fit_transform(X_train)
X_dev_vec   = vectorizer.transform(X_dev)
X_test_vec  = vectorizer.transform(X_test)

print("Feature dimension:", X_train_vec.shape[1])

Feature dimension: 5000


## Neural Network Architecture

The model is a **feedforward ANN (MLP)**:

- Input Layer: TF-IDF features
- Hidden Layer 1: 128 neurons (ReLU)
- Dropout: 0.3
- Hidden Layer 2: 64 neurons (ReLU)
- Output Layer: 3 neurons (Softmax)

Loss Function:
- Sparse Categorical Cross-Entropy

Optimizer:
- Adam

In [51]:
model = models.Sequential([
    layers.Input(shape=(X_train_vec.shape[1],)),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.Dense(3, activation="softmax")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()

## Model Training

Early stopping is used to prevent overfitting.

In [52]:
early_stop = callbacks.EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True
)

history = model.fit(
    X_train_vec.toarray(),
    y_train,
    validation_data=(X_dev_vec.toarray(), y_dev),
    epochs=20,
    batch_size=64,
    callbacks=[early_stop],
    verbose=1
)

Epoch 1/20
[1m679/679[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.6882 - loss: 0.7878 - val_accuracy: 0.7357 - val_loss: 0.6534
Epoch 2/20
[1m679/679[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 14ms/step - accuracy: 0.7751 - loss: 0.5650 - val_accuracy: 0.7355 - val_loss: 0.6558
Epoch 3/20
[1m679/679[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 14ms/step - accuracy: 0.8136 - loss: 0.4787 - val_accuracy: 0.7342 - val_loss: 0.6790
Epoch 4/20
[1m679/679[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 10ms/step - accuracy: 0.8584 - loss: 0.3887 - val_accuracy: 0.7285 - val_loss: 0.7376


## Model Evaluation

Performance is measured using **classification accuracy** on the test set.

In [54]:
y_pred = model.predict(X_test_vec.toarray())
y_pred_classes = np.argmax(y_pred, axis=1)

test_accuracy = accuracy_score(y_test, y_pred_classes)
print(f"Test Accuracy: {test_accuracy:.4f}")

[1m170/170[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step
Test Accuracy: 0.7455


## Save Trained Model

The trained model and TF-IDF vectorizer are saved for later evaluation and inference.

In [55]:
MODELS_DIR = BASE_DIR / "PublicUnrest" / "models"
MODELS_DIR.mkdir(parents=True, exist_ok=True)

model.save(MODELS_DIR / "unrest_classifier.keras")

import joblib
joblib.dump(vectorizer, MODELS_DIR / "tfidf_vectorizer.joblib")

print("Model and vectorizer saved.")

Model and vectorizer saved.
