# **DIMABSA Project for CSCE 5290**

Domains: English — Restaurant, Laptop
Tasks: DimASR (Regression), DimASTE (Triplet Extraction)


## **Current Task List**
Week 1:

**Dataset Familiarization**
*   Download English dataset splits (Restaurant + Laptop).
*   Open several JSONL samples for DimASR and DimASTE.

Document:
*   The input/output structure.
*   How aspects and opinions are represented.
*   The ranges of valence and arousal values.

Write a short markdown summary


**Evaluation Understanding**
*   Review the RMSE formula for DimASR and continuous F1 for DimASTE.
*   Run a verivication calculation to understand how its computed.


**Planning Validation**
*   Confirm which model baseline to use first (RoBERTa-base or BERT-base-uncased).

## **GitHub Repo Setup**

In [41]:
FORK_URL="https://github.com/bobjack1313/DimABSA2026.git"
COURSE_URL="https://github.com/bobjack1313/dimabsa-csce5290.git"

# Tools
!pip install -q --upgrade --force-reinstall \
  torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 \
  --index-url https://download.pytorch.org/whl/cu121

# Dependencies are locked
!pip install -q \
  transformers==4.57.0 \
  datasets==3.0.1 \
  evaluate==0.4.3 \
  accelerate==0.33.0 \
  torchmetrics==1.4.0.post0 \
  scikit-learn==1.5.2 \
  pandas==2.2.2 \
  numpy==1.26.4 \
  matplotlib==3.8.4 \
  seaborn==0.13.2 \
  tqdm>=4.66

# NOTE --- ERRORS are harmless from install

import torch, transformers, platform
print("PyTorch:", torch.__version__, "CUDA available:",
      torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

print("Transformers:", transformers.__version__)
print("Python:", platform.python_version())

# Install or re-clone forks (read-only DimABSA fork, active dev in course repo)
!rm -rf /content/work && mkdir -p /content/work
%cd /content/work
!git clone "$FORK_URL" dimabsa2026
!git clone "$COURSE_URL" dimabsa-csce5290

# Verify setup
!python - <<'PY'
import torch, transformers
print("PyTorch:", torch.__version__, "CUDA:", torch.cuda.is_available())
print("Transformers:", transformers.__version__)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.
accelerate 0.33.0 requires numpy<2.0.0,>=1.17, but you have numpy 2.1.2 which is incompatible.
pytensor 2.31.7 requires filelock>=3.15, but you have filelock 3.13.1 which is incompatible.
typeguard 4.4.4 requires typing_extensions>=4.14.0, but you have typing-extensions 4.12.2 which is incompatible.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.1.2 which is incompatible.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2024.6.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following 

'PY'

In [54]:
# Use this cell to manipulate project directory and files as needed
# %cd /content/work/dimabsa-csce5290
# !ls

/content/work


In [55]:
# Connect drive to collect outputs
from google.colab import drive
drive.mount('/content/drive')

# Create a symlink to save results
!ln -s /content/drive/MyDrive/dimabsa_outputs /content/work/dimabsa-csce5290/outputs

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
ln: failed to create symbolic link '/content/work/dimabsa-csce5290/outputs/dimabsa_outputs': File exists


## **JSON Import Test**

In [9]:
# Mocking JSON

# import json, os

# base = "/content/work/dimabsa-csce/data/DimASR/eng_restaurant"
# os.makedirs(base, exist_ok=True)

# sample_data = [
#     {
#         "ID": "R001",
#         "Text": "average to good thai food, but terrible delivery.",
#         "Aspect": ["thai food", "delivery"]
#     },
#     {
#         "ID": "R002",
#         "Text": "the pasta was delicious but the service was slow.",
#         "Aspect": ["pasta", "service"]
#     }
# ]

# path = f"{base}/train.jsonl"
# with open(path, 'w', encoding='utf-8') as f:
#     for item in sample_data:
#         f.write(json.dumps(item) + "\n")

# print(f"Mock file created at {path}")
# !head -n 5 {path}

Mock file created at /content/work/dimabsa-course/data/DimASR/eng_restaurant/train.jsonl
{"ID": "R001", "Text": "average to good thai food, but terrible delivery.", "Aspect": ["thai food", "delivery"]}
{"ID": "R002", "Text": "the pasta was delicious but the service was slow.", "Aspect": ["pasta", "service"]}


In [59]:
import json

# If needed to mock
#path = "/content/work/dimabsa-csce5290/evaluation_script/sample_data/subtask_1/eng/gold_eng_restaurant.jsonl"

path = (
    "/content/work/dimabsa2026/"
    "evaluation_script/sample data/subtask_1/"
    "eng/gold_eng_restaurant.jsonl"
)

with open(path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        print(json.loads(line))
        if i == 2: break


{'ID': 'rest16_quad_dev_3', 'Text': 'the spicy tuna roll was unusually good and the rock shrimp tempura was awesome , great appetizer to share !', 'Aspect_VA': [{'Aspect': 'spicy tuna roll', 'VA': '7.50#7.62'}, {'Aspect': 'rock shrimp tempura', 'VA': '8.25#8.38'}]}
{'ID': 'rest16_quad_dev_4', 'Text': 'we love th pink pony .', 'Aspect_VA': [{'Aspect': 'pink pony', 'VA': '7.17#7.00'}]}
{'ID': 'rest16_quad_dev_5', 'Text': 'this place has got to be the best japanese restaurant in the new york area .', 'Aspect_VA': [{'Aspect': 'place', 'VA': '7.88#8.12'}]}


Both Restaurant and Laptop data verified for DimASR and DimASTE. JSONL structure valid and loads successfully.

## **Evaluation Metric**

In [60]:
import math

# Example
V_pred, A_pred = 6.75, 6.38
V_true, A_true = 6.0, 6.0
RMSE_VA = math.sqrt(((V_pred - V_true)**2 + (A_pred - A_true)**2) / 128)
print(RMSE_VA)

0.07431457629563663


For DimASTE (continuous F1 concept):
*   cTP = 1 - dist(VA_pred, VA_true) when Aspect + Opinion match.
*   Distance = normalized Euclidean distance in [1, 9] range.


# **Colab to GitHub Commit Cell**
Note:
Keep at bottom of notebook. Also keep commented out until needed.
Must remained uncommented when not used.

## **Copy Current Notebook**

In [79]:
import os, shutil

# Source notebook (your working one in Drive)
src_path = "/content/drive/MyDrive/Colab Notebooks/csce5290_dimabsa.ipynb"

# Destination path inside your project repo
dst_dir = "/content/work/dimabsa-csce5290/notebooks"
os.makedirs(dst_dir, exist_ok=True)

dst_path = os.path.join(dst_dir, os.path.basename(src_path))

# Copy notebook into repo
if os.path.exists(src_path):
    shutil.copy(src_path, dst_path)
    print(f"Copied notebook to {dst_path}")
else:
    print(f"Notebook not found at {src_path}")


Copied notebook to /content/work/dimabsa-csce5290/notebooks/csce5290_dimabsa.ipynb


## **Stage Changes**

In [84]:
# Quick Status check - Must run before commit
%cd /content/work/dimabsa-csce5290

#!git add/rm <FILE>
!git status

/content/work/dimabsa-csce5290
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


## **Commit**

In [83]:
from getpass import getpass

# Configure git (run once per session)
USER_EMAIL = getpass("Enter Email for GitHub Account: ")
USER_NAME = getpass("Enter First and LasName for GitHub Account: ")
!git config --global user.email "$USER_EMAIL"
!git config --global user.name "$USER_NAME"

# --- Stage, commit, and push changes to GitHub ---
# Make sure you're inside your course repo before running:
%cd /content/work/dimabsa-csce5290

# Do not change the echo
echo_message = "Nothing to commit"

# Change this to reflect the work done
# *****  MUST CHANGE *****
commit_message = "Fixed commit messages"

# Uncomment when ready to commit --* re-comment when done *--
!git add .
!git commit -m "$commit_message" || echo "$echo_message"


Enter Email for GitHub Account: ··········
Enter Name for GitHub Account: ··········
/content/work/dimabsa-csce5290
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
Nothing to commit


## **Push**

In [78]:
# Github - Account - Settings - Developer Settings - Create a token -
# 60 days - check repo only - copy token each time - run - then remove it
# Change below to correct user and token each time
PUSH_USER = getpass("Enter GitHub Username: ")
TOKEN = getpass("Enter GitHub Token: ")

# Base repo info (constant) - NO CHANGING
REPO_OWNER = "bobjack1313"
REPO_NAME = "dimabsa-csce5290"

# Set the authenticated remote
!git remote set-url origin \
"https://{PUSH_USER}:{TOKEN}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
!git push origin main

Enter GitHub Username: ··········
Enter GitHub Token: ··········
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), 6.73 KiB | 6.73 MiB/s, done.
Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/bobjack1313/dimabsa-csce5290.git
   88f035b..4080b63  main -> main
