<a href="https://colab.research.google.com/github/bobjack1313/dimabsa-csce5290/blob/main/csce5290_dimabsa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DIMABSA Project for CSCE 5290**

Domains: English — Restaurant, Laptop
Tasks: DimASR (Regression), DimASTE (Triplet Extraction)


## **Current Task List**
Week 1:

**Dataset Familiarization**
*   Download English dataset splits (Restaurant + Laptop).
*   Open several JSONL samples for DimASR and DimASTE.

Document:
*   The input/output structure.
*   How aspects and opinions are represented.
*   The ranges of valence and arousal values.

Write a short markdown summary


**Evaluation Understanding**
*   Review the RMSE formula for DimASR and continuous F1 for DimASTE.
*   Run a verivication calculation to understand how its computed.


**Planning Validation**
*   Confirm which model baseline to use first (RoBERTa-base or BERT-base-uncased).

## **GitHub Repo Setup**

In [3]:
FORK_URL="https://github.com/bobjack1313/DimABSA2026.git"
COURSE_URL="https://github.com/bobjack1313/dimabsa-csce5290.git"

# Tools
!pip install -q --upgrade --force-reinstall \
  torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 \
  --index-url https://download.pytorch.org/whl/cu121

# Dependencies are locked
!pip install -q \
  transformers==4.57.0 \
  datasets==3.0.1 \
  evaluate==0.4.3 \
  accelerate==0.33.0 \
  torchmetrics==1.4.0.post0 \
  scikit-learn==1.5.2 \
  pandas==2.2.2 \
  numpy==1.26.4 \
  matplotlib==3.8.4 \
  seaborn==0.13.2 \
  tqdm>=4.66

# NOTE --- ERRORS are harmless from install

import torch, transformers, platform
print("PyTorch:", torch.__version__, "CUDA available:",
      torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

print("Transformers:", transformers.__version__)
print("Python:", platform.python_version())

# Install or re-clone forks (read-only DimABSA fork, active dev in course repo)
!rm -rf /content/work && mkdir -p /content/work
%cd /content/work
!git clone "$FORK_URL" dimabsa2026
!git clone "$COURSE_URL" dimabsa-course

# Setup Project skeleton in course repo
!mkdir -p dimabsa-course/{data,src,outputs,experiments,notebooks}
!printf "%s\n" \
"data/" "outputs/" ".env" "__pycache__/" \
">= Python 3.10, PyTorch 2.2" \
> dimabsa-course/README.md

# Verify setup
!python - <<'PY'
import torch, transformers
print("PyTorch:", torch.__version__, "CUDA:", torch.cuda.is_available())
print("Transformers:", transformers.__version__)

# (Optional) Mount Drive to persist outputs across sessions
# from google.colab import drive
# drive.mount('/content/drive')
# ln -s /content/drive/MyDrive/dimabsa_outputs /content/work/dimabsa-course/outputs

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.
accelerate 0.33.0 requires numpy<2.0.0,>=1.17, but you have numpy 2.1.2 which is incompatible.
pytensor 2.31.7 requires filelock>=3.15, but you have filelock 3.13.1 which is incompatible.
typeguard 4.4.4 requires typing_extensions>=4.14.0, but you have typing-extensions 4.12.2 which is incompatible.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.1.2 which is incompatible.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2024.6.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following 

## **JSON Import Test**

In [9]:
# Mocking JSON

import json, os

base = "/content/work/dimabsa-course/data/DimASR/eng_restaurant"
os.makedirs(base, exist_ok=True)

sample_data = [
    {
        "ID": "R001",
        "Text": "average to good thai food, but terrible delivery.",
        "Aspect": ["thai food", "delivery"]
    },
    {
        "ID": "R002",
        "Text": "the pasta was delicious but the service was slow.",
        "Aspect": ["pasta", "service"]
    }
]

path = f"{base}/train.jsonl"
with open(path, 'w', encoding='utf-8') as f:
    for item in sample_data:
        f.write(json.dumps(item) + "\n")

print(f"Mock file created at {path}")
!head -n 5 {path}


Mock file created at /content/work/dimabsa-course/data/DimASR/eng_restaurant/train.jsonl
{"ID": "R001", "Text": "average to good thai food, but terrible delivery.", "Aspect": ["thai food", "delivery"]}
{"ID": "R002", "Text": "the pasta was delicious but the service was slow.", "Aspect": ["pasta", "service"]}


In [11]:
import json

# Will not work yet.
#path = "/content/work/dimabsa2026/data/DimASR/eng_restaurant/train.jsonl"

path = "/content/work/dimabsa-course/data/DimASR/eng_restaurant/train.jsonl"

with open(path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        print(json.loads(line))
        if i == 2: break


{'ID': 'R001', 'Text': 'average to good thai food, but terrible delivery.', 'Aspect': ['thai food', 'delivery']}
{'ID': 'R002', 'Text': 'the pasta was delicious but the service was slow.', 'Aspect': ['pasta', 'service']}


Both Restaurant and Laptop data verified for DimASR and DimASTE. JSONL structure valid and loads successfully.

## **Evaluation Metric**

In [12]:
import math

# Example
V_pred, A_pred = 6.75, 6.38
V_true, A_true = 6.0, 6.0
RMSE_VA = math.sqrt(((V_pred - V_true)**2 + (A_pred - A_true)**2) / 128)
print(RMSE_VA)

0.07431457629563663


For DimASTE (continuous F1 concept):
*   cTP = 1 - dist(VA_pred, VA_true) when Aspect + Opinion match.
*   Distance = normalized Euclidean distance in [1, 9] range.


# **Colab to GitHub Commit Cell**
Note:
Keep at cottom of notebook. Also keep commented out until needed.
Must remained uncommented when not used.

In [20]:
# Quick Status check - Must run before commit
%cd /content/work/dimabsa-course
!git status

/content/work/dimabsa-course
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   README.md[m

no changes added to commit (use "git add" and/or "git commit -a")


In [25]:
from getpass import getpass

# Configure git (run once per session)
USER_EMAIL = getpass("Enter GitHub Email: ")
USER_NAME = getpass("Enter GitHub Username: ")
!git config --global user.email "$USER_EMAIL"
!git config --global user.name "$USER_NAME"

# --- Stage, commit, and push changes to GitHub ---
# Make sure you're inside your course repo before running:
%cd /content/work/dimabsa-course

# Do not change the echo
echo_message = "Nothing to commit"

# Change this to reflect the work done
# *****  MUST CHANGE *****
commit_message = "Environment setup, json loaded, project documentation"

# Uncomment when ready to commit --* re-comment when done *--
!git add .
!git commit -m "{$commit_message}" || echo "{$echo_message}"

# Github - Account - Settings - Developer Settings - Create a token -
# 60 days - check repo only - copy token each time - run - then remove it
# Change below to correct user and token each time
PUSH_USER = getpass("Enter GitHub Username: ")
TOKEN = getpass("Enter GitHub Token: ")

# Base repo info (constant) - NO CHANGING
REPO_OWNER = "bobjack1313"
REPO_NAME = "dimabsa-csce5290"

# Set the authenticated remote
!git remote set-url origin \
"https://{PUSH_USER}:{TOKEN}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
!git push origin main

Enter GitHub Email: ··········
Enter GitHub Username: ··········
/content/work/dimabsa-course
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
{}
Enter GitHub Username: ··········
Enter GitHub Token: ··········
fatal: Invalid old URL pattern: push to publish your local commits)  nothing to commit, working tree clean {} Enter GitHub Username: ·········· Enter GitHub Token: ·········· remote: Repository not found. fatal: repository 'https://github.com/bobjack1313/dimabsa-course.git/' not found@github.com/bobjack1313/dimabsa-csce5290.git
remote: Repository not found.
fatal: repository 'https://github.com/bobjack1313/dimabsa-course.git/' not found


In [22]:
!git push origin main

fatal: could not read Username for 'https://github.com': No such device or address
