<a href="https://colab.research.google.com/github/Ynaos/YAM-Final_Project/blob/main/YAM_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup and Data Preparation

**Explanation:**
- This Google Colab code installs several important Python libraries used for natural language processing, machine learning, and model optimization. The first line installs or upgrades Hugging Face tools such as transformers for working with pretrained NLP models, datasets for efficient data handling, and accelerate for speeding up model training across different hardware. It also adds Ray Tune and Optuna, two powerful frameworks for hyperparameter tuning. The second line installs TensorFlow for deep learning, openpyxl for working with Excel files, and scikit-learn for traditional machine learning methods, along with transformers again to ensure compatibility. Finally, the last line installs VADER Sentiment, a lightweight sentiment analysis tool commonly used for short texts like social media posts. Together, these libraries enable you to build, train, evaluate, and optimize NLP models within your Colab environment.

In [1]:
!pip install transformers datasets accelerate ray[tune] optuna -U
!pip install transformers tensorflow openpyxl scikit-learn -q
!pip install vaderSentiment -q

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting optuna
  Downloading optuna-4.6.0-py3-none-any.whl.metadata (17 kB)
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading optuna-4.6.0-py3-none-any.whl (404 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

**Explanation:**
- This section of the code imports all the necessary libraries and sets up the basic configuration for data processing, sentiment analysis, machine learning, and model training. It begins by importing standard Python modules like random, os, pandas, and numpy for data handling, randomness control, file access, and numerical operations. The train_test_split function from scikit-learn is used to divide the dataset into training and testing sets. It also imports VADER SentimentIntensityAnalyzer for rule-based sentiment scoring. The LabelEncoder is included to convert categorical text labels into numeric form.

- From Hugging Face Transformers, it imports AutoTokenizer and AutoModelForSequenceClassification to load pretrained models for text classification, as well as TrainingArguments, Trainer, and pipeline for fine-tuning and running NLP models. The torch library provides PyTorch support for deep learning computations. The datasets module is used to structure data into a format suitable for model training. Finally, scikit-learn metrics like accuracy, precision, recall, F1-score, classification reports, and confusion matrices are imported to evaluate model performance.

In [12]:
# 0) Imports and basic config
import random
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.preprocessing import LabelEncoder
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer, pipeline)
import torch
from datasets import Dataset, DatasetDict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report, confusion_matrix

# **1) Load CSV and prepare HF Datasets**

**Explanation:**

This part of the code handles uploading your dataset, preparing it for training, and converting it into a format compatible with Hugging Face models. First, it imports the Google Colab file uploader and waits for you to upload a CSV file — in this case, data_mmda_traffic_spatial.csv. Once uploaded, the file is read into a pandas DataFrame and its size is printed. The code assumes your text column is named "Tweet", then removes any rows where this text is missing. It also checks if the dataset already contains a sentiment label column (e.g., label, sentiment, etc.).

If no label column exists, the code automatically generates sentiment labels using VADER, a rule-based sentiment analyzer. It assigns each tweet a compound sentiment score and converts it into a numeric label:

- 0 = Negative

- 1 = Neutral

- 2 = Positive

These become training labels called label_id. If a label column does exist, the code instead encodes it numerically using LabelEncoder.

Next, the dataset is split into training (85%) and validation (15%) sets, with balanced sentiment distribution using stratification. Finally, both splits are converted into Hugging Face Dataset objects, renamed to use "text" and "label" columns, and stored in a DatasetDict. This prepares your uploaded dataset — data_mmda_traffic_spatial.csv — for model fine-tuning in the next steps.-

In [14]:
# 1) Load CSV and prepare HF Dataset
from google.colab import files
# --- UPLOAD CSV ---
uploaded = files.upload()

# Load the first uploaded CSV file
for file_name in uploaded.keys():
    df = pd.read_csv(file_name)
    print(f"✅ Loaded: {file_name} (shape={df.shape})")
    break

# --- DEFINE TEXT COLUMN & CANDIDATES ---
TEXT_COL = "Tweet"   # adjust if your text column name is different
LOC_COL_CANDIDATES = ["Location", "location", "place", "place_name", "area", "location_name"]

# Basic cleaning: drop rows without text
df = df[df[TEXT_COL].notna()].reset_index(drop=True)

# --- Detect existing label column (if any) ---
label_col = None
for potential in ["label", "Label", "sentiment", "Sentiment", "sent"]:
    if potential in df.columns:
        label_col = potential
        break

# --- If no gold labels are present, create VADER pseudo-labels ---
if label_col is None:
    # Install / import and initialize VADER (nltk VADER)
    import nltk
    nltk.download("vader_lexicon", quiet=True)
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()

    # Map compound score to numerical labels: NEGATIVE=0, NEUTRAL=1, POSITIVE=2
    def vader_label_from_text(text):
        c = sia.polarity_scores(str(text))["compound"]
        if c >= 0.05:
            return 2
        elif c <= -0.05:
            return 0
        else:
            return 1

    # Create VADER columns and numeric labels
    df["vader_compound"] = df[TEXT_COL].astype(str).apply(lambda t: sia.polarity_scores(t)["compound"])
    df["vader_label_id"] = df[TEXT_COL].astype(str).apply(vader_label_from_text)

    # Use VADER labels as the training label (safe default)
    df["label_id"] = df["vader_label_id"]
    label_col = "vader_label_id"
    print("No gold label found — using VADER pseudo-labels (label_id) distribution:")
    print(df["label_id"].value_counts())
else:
    # If a label column exists, encode it (textual or numeric)
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df["label_id"] = le.fit_transform(df[label_col].astype(str))
    print(f"Using gold label column '{label_col}' -> encoded to 'label_id'. Distribution:")
    print(df["label_id"].value_counts())

# --- Train/validation split (stratify on label_id) ---
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.15, stratify=df["label_id"], random_state=42)

# --- Convert to HuggingFace Dataset objects expected downstream ---
from datasets import Dataset, DatasetDict
hf_train = Dataset.from_pandas(train_df[[TEXT_COL, "label_id"]].rename(columns={TEXT_COL: "text", "label_id":"label"}))
hf_val   = Dataset.from_pandas(val_df[[TEXT_COL, "label_id"]].rename(columns={TEXT_COL: "text", "label_id":"label"}))

dataset_dict = DatasetDict({"train": hf_train, "validation": hf_val})

print("Prepared HuggingFace DatasetDict with train/validation splits.")
print("Train size:", len(dataset_dict["train"]), "Validation size:", len(dataset_dict["validation"]))


Saving data_mmda_traffic_spatial.csv to data_mmda_traffic_spatial.csv
✅ Loaded: data_mmda_traffic_spatial.csv (shape=(17312, 13))
No gold label found — using VADER pseudo-labels (label_id) distribution:
label_id
1    11084
0     4841
2     1387
Name: count, dtype: int64
Prepared HuggingFace DatasetDict with train/validation splits.
Train size: 14715 Validation size: 2597


# **2) Tokenizer & model factory**

**Explanation:**

This section of the code initializes the tokenizer and prepares your text data so it can be fed into a transformer model. It starts by selecting a pretrained model — in this case, distilbert-base-uncased, a lightweight and efficient version of BERT. Using this model name, it loads the corresponding AutoTokenizer, which is responsible for converting raw text into token IDs that the model can understand.

Next, it defines a function called tokenize_fn(), which takes a batch of text and tokenizes it. **The tokenizer applies:**

- truncation — cutting text that’s too long

- padding — adding extra tokens so all sequences have equal length

- max_length=128 — sets the maximum token size per input

The code then applies this tokenizer function to the entire Hugging Face dataset using .map(), which efficiently processes it in batches. After tokenization, the original "text" column is removed since the model only needs tokenized input. The resulting dataset is put into PyTorch format, preparing it for training.

Finally, it calculates how many different sentiment classes (labels) are present by checking the unique values in the label_id column. This allows the model to dynamically adapt whether your dataset has 2, 3, or more sentiment types.

In [15]:
# 2) Tokenizer and model factory function
MODEL_NAME = "distilbert-base-uncased"  # change if you want another model

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_fn(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

tokenized = dataset_dict.map(tokenize_fn, batched=True)
tokenized = tokenized.remove_columns(["text"])
tokenized.set_format("torch")
num_labels = len(np.unique(df["label_id"]))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/14715 [00:00<?, ? examples/s]

Map:   0%|          | 0/2597 [00:00<?, ? examples/s]

# **3) Metrics function**

**Explanation:**

This part of the code defines a function that calculates key evaluation metrics during model training and validation. The function compute_metrics() is used by the Hugging Face Trainer to measure how well the model is performing on the validation set.

**It receives two inputs:**

- logits — the raw predictions generated by the model

- labels — the true sentiment labels from your dataset

First, it converts the model’s logits into predicted class labels by taking the index of the highest score (argmax). Then, it computes four important metrics using scikit-learn:

- Accuracy — the percentage of correct predictions

- Precision (weighted) — how many predicted labels are correct, considering class imbalance

- Recall (weighted) — how many of the true labels were correctly found

- F1-score (weighted) — the harmonic balance between precision and recall

Weighted averaging ensures that all sentiment classes in your dataset — especially from data_mmda_traffic_spatial.csv — are fairly represented, even if some appear less frequently. The function returns these metrics in a dictionary so the Trainer can log them during training.

In [17]:
# 3) Compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    prec = precision_score(labels, preds, average="weighted", zero_division=0)
    rec = recall_score(labels, preds, average="weighted", zero_division=0)
    f1 = f1_score(labels, preds, average="weighted", zero_division=0)
    return {"accuracy": acc, "precision": prec, "recall": rec, "f1": f1}


# **4) Single-trial train function (returns validation metric)**

**Explanation:**

This part of the code defines a function that trains a sentiment classification model using a single set of hyperparameters and returns both the results and the saved model path. The function run_training_trial() accepts a dictionary of hyperparameters — including learning rate, batch size, number of epochs, and weight decay — and uses them to customize the training run.
First, it loads a pretrained transformer model (based on distilbert-base-uncased) and adapts it for sentiment classification by setting num_labels according to the unique sentiment classes in your dataset data_mmda_traffic_spatial.csv. It then creates an output directory for the trial so that logs and checkpoints are stored separately.
Next, **it sets up Hugging Face TrainingArguments, including:**


- Training duration (num_train_epochs)


- Batch size per device


- Learning rate and weight decay


- When to evaluate and save models (every epoch)


- Automatic loading of the best-performing model


- Accuracy as the metric to optimize


- Mixed precision training (fp16) if GPU is available


- The Trainer object is then created, combining:


**The model**


- Training/evaluation settings


- Tokenized training and validation datasets


- The previously defined compute_metrics function


- The tokenizer for proper text processing


The .train() command starts the actual fine-tuning process, training the model on your labeled tweets. When training is done, .evaluate() computes validation metrics. The best version of the model is saved into a best_model folder, and finally, the function returns both the evaluation results and the path where the fine-tuned model was stored.

In [18]:
# 4) Single training-run function (returns validation accuracy and path to saved model)
def run_training_trial(hparams, trial_name="trial"):
    # hparams: dict with keys: learning_rate, per_device_train_batch_size, num_train_epochs, weight_decay
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

    out_dir = os.path.join("./results", trial_name)
    os.makedirs(out_dir, exist_ok=True)

    training_args = TrainingArguments(
        output_dir=out_dir,
        num_train_epochs=hparams["num_train_epochs"],
        per_device_train_batch_size=hparams["per_device_train_batch_size"],
        per_device_eval_batch_size=max(8, hparams["per_device_train_batch_size"]),
        learning_rate=hparams["learning_rate"],
        weight_decay=hparams["weight_decay"],
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        logging_dir=os.path.join(out_dir, "logs"),
        logging_steps=50,
        save_total_limit=1,
        seed=42,
        fp16=torch.cuda.is_available(),
        report_to=[],
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["validation"],
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    )

    # Train (this will run for the specified num_train_epochs)
    trainer.train()
    eval_res = trainer.evaluate()
    # Save
    best_model_dir = os.path.join(out_dir, "best_model")
    trainer.save_model(best_model_dir)
    return eval_res, best_model_dir

# **5) Random Search loop (simple)**

**Explanation:**

This part of the code performs random hyperparameter search to identify the best training configuration for your sentiment model based on accuracy. The function random_search() runs multiple training trials (default: 5) where each trial randomly selects a combination of hyperparameters from a predefined search space.

**The search space includes:**

- learning_rate: how fast the model updates (5e-6 to 5e-5)

- batch size: number of samples processed per step (8, 16, or 32)

- num_train_epochs: how many times the model sees the full dataset (1–3 for short mode)

- weight_decay: regularization strength to prevent overfitting (0.0 to 0.1)

**For every trial:**

- A random set of values is selected from these ranges.

- A trial name like "rs_trial_0" or "rs_trial_1" is assigned.

- The run_training_trial() function is called to train the model using those settings on your dataset data_mmda_traffic_spatial.csv.

- The resulting validation accuracy is retrieved.

If a given trial achieves a higher accuracy than previous ones, its hyperparameters and model path are stored as the current best result.

**After all trials finish, the code prints:**

- The best accuracy achieved

- The optimal hyperparameters that produced it

**Finally, the random search is executed by calling:**

N_TRIALS = 5
best_hparams, best_model_dir, best_metric = random_search(n_trials=N_TRIALS, short_mode=True)


**This runs 5 randomized training experiments and returns:**

best_hparams — the most effective parameter set

best_model_dir — where the best model is saved

best_metric — the highest validation accuracy achieved

In [19]:
# 5) Random search over hyperparameters
def random_search(n_trials=5, short_mode=True):
    best_metric = -999
    best_hparams = None
    best_model_dir = None

    # simple search space
    search_space = {
        "learning_rate": [5e-6, 1e-5, 2e-5, 3e-5, 5e-5],
        "per_device_train_batch_size": [8, 16, 32],
        "num_train_epochs": [1, 2, 3] if short_mode else [2,3,4,5],
        "weight_decay": [0.0, 0.01, 0.1],
    }

    for i in range(n_trials):
        hparams = {
            "learning_rate": random.choice(search_space["learning_rate"]),
            "per_device_train_batch_size": random.choice(search_space["per_device_train_batch_size"]),
            "num_train_epochs": random.choice(search_space["num_train_epochs"]),
            "weight_decay": random.choice(search_space["weight_decay"]),
        }
        trial_name = f"rs_trial_{i}"
        print(f"=== Trial {i+1}/{n_trials}: {hparams} ===")
        eval_res, model_dir = run_training_trial(hparams, trial_name=trial_name)
        acc = eval_res.get("eval_accuracy", eval_res.get("accuracy", None))
        if acc is None:
            acc = eval_res.get("eval_accuracy", -999)
        print(f" -> Eval accuracy: {acc}")
        if acc is not None and acc > best_metric:
            best_metric = acc
            best_hparams = hparams
            best_model_dir = model_dir

    print("=== Random search complete ===")
    print("Best metric:", best_metric)
    print("Best hyperparameters:", best_hparams)
    return best_hparams, best_model_dir, best_metric

# Run the random search (light by default)
N_TRIALS = 5
best_hparams, best_model_dir, best_metric = random_search(n_trials=N_TRIALS, short_mode=True)

=== Trial 1/5: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.1} ===


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2392,0.19773,0.953023,0.955502,0.953023,0.951888
2,0.2394,0.178817,0.952253,0.954067,0.952253,0.951218
3,0.1563,0.203272,0.953408,0.955842,0.953408,0.952294


 -> Eval accuracy: 0.9534077782056218
=== Trial 2/5: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.01} ===


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2436,0.19747,0.952253,0.954763,0.952253,0.951056
2,0.248,0.157419,0.956488,0.957816,0.956488,0.955677
3,0.1475,0.182899,0.958799,0.960657,0.958799,0.957952


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 -> Eval accuracy: 0.9587986137851366
=== Trial 3/5: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.01} ===


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2436,0.19747,0.952253,0.954763,0.952253,0.951056
2,0.248,0.157419,0.956488,0.957816,0.956488,0.955677
3,0.1475,0.182899,0.958799,0.960657,0.958799,0.957952


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 -> Eval accuracy: 0.9587986137851366
=== Trial 4/5: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.01} ===


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2436,0.19747,0.952253,0.954763,0.952253,0.951056
2,0.248,0.157419,0.956488,0.957816,0.956488,0.955677
3,0.1475,0.182899,0.958799,0.960657,0.958799,0.957952


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 -> Eval accuracy: 0.9587986137851366
=== Trial 5/5: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.01} ===


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2436,0.19747,0.952253,0.954763,0.952253,0.951056
2,0.248,0.157419,0.956488,0.957816,0.956488,0.955677
3,0.1475,0.182899,0.958799,0.960657,0.958799,0.957952


 -> Eval accuracy: 0.9587986137851366
=== Random search complete ===
Best metric: 0.9587986137851366
Best hyperparameters: {'learning_rate': 5e-06, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.01}


## **6) Inference: sentiment analysis + location-based congestion detection**

**Explanation:**

This final section of the code loads your fine-tuned sentiment model and adds an intelligent inference layer specialized for detecting potential traffic congestion from social media text — such as the tweets in your dataset data_mmda_traffic_spatial.csv.
First, it loads the best-performing model (found during random search) into a Hugging Face sentiment-analysis pipeline, which handles tokenization, model execution, and prediction. Then it defines a set of CONGESTION_KEYWORDS — terms common in MMDA traffic reports that may indicate stalled vehicles, accidents, lane closures, or heavy traffic.

**Several helper functions then support richer inference:**
**1. extract_location(text)**
A simple rule-based method that:

- Looks for phrases following "at" or "on"

- Detects ALL-CAPS text segments (common in MMDA tweets, e.g., "EDSA", "C5 NB")

**2. has_congestion_keywords(text)**
- Checks whether any congestion-related keywords are present, returning both a boolean and the keyword found.

**3. infer_sentence(text)**

This is the main function that takes a tweet and returns a structured result with:


- Sentiment label (e.g., NEGATIVE / LABEL_0)


- Confidence score


- Extracted location (if any)


- Congestion likelihood ("Likely", "Possible", or "Unlikely")


- Reason for the decision


It uses a rule-plus-model hybrid logic:
Likely Congestion if:


- The model is confident (≥ 0.8) the sentiment is negative, and


- A congestion keyword appears in the text


**Or the tweet is negative with location info (weaker but relevant)
Possible Congestion if:**


- A congestion keyword is present but sentiment confidence is low


**Unlikely Congestion if:**


- No strong negative signal


- No congestion markers found


**For each, it prints:**


- The original text


- Sentiment label + score from your fine-tuned model


- Detected location (e.g., C5 Market-Market NB)


- Congestion likelihood


- Explanation (e.g., keyword found, high negative confidence)



**In summary, this section turns your fine-tuned sentiment model into an applied traffic incident detector, leveraging:**


- Learned sentiment patterns from your dataset


- Traffic-specific keywords


- Simple location extraction heuristics


It converts raw tweets into structured, explainable traffic insights — highly relevant for analyzing incidents in Metro Manila road networks using your uploaded dataset.

In [21]:
# Simple single-sentence inference + congestion heuristic
from transformers import AutoModelForSequenceClassification, pipeline
import re

# Load model into pipeline (safe to re-run; will reuse if already loaded)
try:
    model_for_pipeline  # if pipeline already created
except NameError:
    model_for_pipeline = AutoModelForSequenceClassification.from_pretrained(best_model_dir)
    sentiment_pipe = pipeline("sentiment-analysis", model=model_for_pipeline, tokenizer=tokenizer,
                              device=0 if torch.cuda.is_available() else -1)

# congestion keywords often present in reports that cause traffic impact
CONGESTION_KEYWORDS = [
    "stalled", "stuck", "breakdown", "mechanical problem", "collision", "accident",
    "crash", "overturned", "overturn", "towed", "blocking", "block", "blocked",
    "lane occupied", "lane closed", "lanes closed", "heavy traffic", "traffic jam",
    "congestion", "pileup", "road closed", "one lane", "two lanes", "car stopped"
]

# helper: extract a location phrase (simple heuristics)
def extract_location(text):
    # look for "at <PLACE>" or "on <PLACE>"
    m = re.search(r'\b(?:at|on)\s+([A-Za-z0-9\.\-\/\s,&]+?)(?:[.,;]|$)', text, flags=re.I)
    if m:
        return m.group(1).strip()
    # look for ALL CAPS tokens (common in MMDA feed)
    caps = re.findall(r'\b[A-Z0-9]{3,}(?:\s[A-Z0-9]{3,})*\b', text)
    if caps:
        return caps[0].strip()
    return None

# helper: check presence of congestion-triggering keywords
def has_congestion_keywords(text):
    t = text.lower()
    for kw in CONGESTION_KEYWORDS:
        if kw in t:
            return True, kw
    return False, None

# main inference function
def infer_sentence(text, conf_threshold=0.8):
    """
    Returns a small dict:
      {
        'text': str,
        'sentiment_label': str,
        'sentiment_score': float,
        'location': str or None,
        'congestion': 'Likely'|'Unlikely',
        'reason': str
      }
    """
    # run HF model pipeline
    out = sentiment_pipe(text[:1000])  # pipeline accepts a single string
    # pipeline returns a list when passed list, but when passed single string returns a dict or list depending on version; normalize:
    if isinstance(out, list):
        out = out[0]
    label_raw = str(out.get("label", "")).upper()
    score = float(out.get("score", 0.0))

    # normalize negative label detection (common formats)
    negative_labels = {"NEGATIVE", "LABEL_0", "0", "NEG"}
    is_negative = label_raw in negative_labels

    # extract location
    loc = extract_location(text)

    # check keywords
    kw_found, kw = has_congestion_keywords(text)

    # simple heuristic:
    # - If model says negative with high confidence AND a congestion keyword is present -> Likely congestion
    # - If model says negative with high confidence and a location is present -> Likely (but weaker)
    # - Otherwise -> Unlikely
    if is_negative and score >= conf_threshold and kw_found:
        return {
            "text": text,
            "sentiment_label": label_raw,
            "sentiment_score": score,
            "location": loc,
            "congestion": "Likely congestion",
            "reason": f"Negative ({label_raw}, score={score:.2f}) + keyword '{kw}' found."
        }
    if is_negative and score >= conf_threshold and loc:
        return {
            "text": text,
            "sentiment_label": label_raw,
            "sentiment_score": score,
            "location": loc,
            "congestion": "Likely congestion",
            "reason": f"Negative ({label_raw}, score={score:.2f}) and location '{loc}' detected."
        }
    # fallback: if keyword present even with lower score, flag as possible
    if kw_found:
        return {
            "text": text,
            "sentiment_label": label_raw,
            "sentiment_score": score,
            "location": loc,
            "congestion": "Possible congestion (needs verification)",
            "reason": f"Keyword '{kw}' found but model confidence is low ({score:.2f})."
        }

    # otherwise unlikely
    return {
        "text": text,
        "sentiment_label": label_raw,
        "sentiment_score": score,
        "location": loc,
        "congestion": "Unlikely congestion",
        "reason": "No strong negative signal or congestion keywords detected."
    }

# Example usage:
examples = [
    "MMDA ALERT: Stalled SUV due to mechanical problem at C5 Market-Market NB as of 7:11 PM. 1 lane occupied.",
    "Traffic flowing smoothly on EDSA near Ortigas.",
    "Minor fender-bender on Shaw Boulevard, some slowdown reported."
]

for ex in examples:
    r = infer_sentence(ex)
    print("----")
    print("Text:", r["text"])
    print("Sentiment:", r["sentiment_label"], f"({r['sentiment_score']:.2f})")
    print("Location:", r["location"])
    print("Congestion:", r["congestion"])
    print("Reason:", r["reason"])
    print()


----
Text: MMDA ALERT: Stalled SUV due to mechanical problem at C5 Market-Market NB as of 7:11 PM. 1 lane occupied.
Sentiment: LABEL_0 (1.00)
Location: MMDA ALERT
Congestion: Likely congestion
Reason: Negative (LABEL_0, score=1.00) + keyword 'stalled' found.

----
Text: Traffic flowing smoothly on EDSA near Ortigas.
Sentiment: LABEL_2 (0.98)
Location: EDSA near Ortigas
Congestion: Unlikely congestion
Reason: No strong negative signal or congestion keywords detected.

----
Text: Minor fender-bender on Shaw Boulevard, some slowdown reported.
Sentiment: LABEL_2 (0.94)
Location: Shaw Boulevard
Congestion: Unlikely congestion
Reason: No strong negative signal or congestion keywords detected.

