# Green Patent Detection: Advanced Architectures (Agents vs. QLoRA)

## üîÑ Parts A & B ‚Äì Setup (Reused from Assignment 2)

The baseline model, uncertainty scores, and 100 high-risk examples were already computed 
in Assignment 2. We reuse those artifacts directly here.

**Dataset splits (from Assignment 2 Part A):**
- `train_silver` ‚Äî 70% of the 50k sample (35,000 rows), used to train the baseline
- `pool_unlabeled` ‚Äî 20% of the 50k sample (10,000 rows), used for uncertainty sampling
- `eval_silver` ‚Äî 10% of the 50k sample (5,000 rows), used for evaluation

**Reused artifacts:**
- `baseline_clf.pkl` ‚Äî trained Logistic Regression on frozen PatentSBERTa embeddings
- `embeddings/X_train.npy` ‚Äî frozen embeddings for train_silver
- `embeddings/X_pool.npy` ‚Äî frozen embeddings for pool_unlabeled
- `embeddings/X_eval.npy` ‚Äî frozen embeddings for eval_silver
- `parquet/pool_unlabeled.parquet` ‚Äî the unlabeled pool
- `parquet/train_silver.parquet` ‚Äî the training split
- `parquet/eval_silver.parquet` ‚Äî the evaluation split
- `csv/hitl_green_100.csv` ‚Äî the 100 most uncertain claims (u = 0.987‚Äì1.000)

These feed directly into Part C where we replace the simple LLM labeling from 
Assignment 2 with a Multi-Agent System (MAS).

In [1]:
# Load dependencies
import pandas as pd
import numpy as np
import pickle
from pathlib import Path

SEED = 42

# Load baseline model (logistic regression)
clf         = pickle.load(open("baseline_clf.pkl", "rb"))

# Load frozen embeddings
X_train     = np.load("embeddings/X_train.npy")
X_pool      = np.load("embeddings/X_pool.npy")
X_eval      = np.load("embeddings/X_eval.npy")

# Load splits from the parquet file
train_silver = pd.read_parquet("parquet/train_silver.parquet")
pool        = pd.read_parquet("parquet/pool_unlabeled.parquet")
eval_silver  = pd.read_parquet("parquet/eval_silver.parquet")

# Load the 100 high-risk claims
hitl_100    = pd.read_csv("csv/hitl_green_100.csv")

print(f"Pool size:        {len(pool)}")
print(f"Train size:       {len(train_silver)}")
print(f"Eval size:        {len(eval_silver)}")
print(f"High-risk claims: {len(hitl_100)}")
print(f"Uncertainty range: {hitl_100['u'].min():.3f} ‚Äì {hitl_100['u'].max():.3f}")

Pool size:        10000
Train size:       35000
Eval size:        5000
High-risk claims: 100
Uncertainty range: 0.987 ‚Äì 1.000


## üîÄ Part C: Choose Your Advanced Path

The Multi-Agent System (MAS) was implemented as a three-agent debate pipeline using 
vllm for inference on HPC, with LangGraph orchestrating the state flow between agents.

**Files:**
- `mas_label.py` ‚Äî the MAS inference script
- `slurm_mas.sh` ‚Äî the SLURM job file

**Agents:**
- **Advocate** (Mistral-7B-Instruct-v0.2) ‚Äî argues FOR green classification, 
looking for environmental benefits and sustainability aspects
- **Skeptic** (Qwen2.5-7B-Instruct) ‚Äî argues AGAINST green classification, 
looking for greenwashing or weak green signals
- **Judge** (Meta-Llama-3-8B-Instruct) ‚Äî weighs both arguments and produces 
the final JSON label, confidence score, and rationale

**Results:**
- 98 out of 100 claims parsed successfully (2 failed)
- Label distribution: 51 not green, 47 green
- Confidence: 76 medium, 18 high, 4 low

**Comparison with Assignment 2 (simple Mistral):**

| | Not Green | Green | Low Confidence |
|---|---|---|---|
| Assignment 2 (Mistral) | 95 | 5 | 72% |
| Assignment 3 (MAS) | 51 | 47 | 4% |

The debate structure produced significantly more balanced and confident labels compared 
to the single LLM approach ‚Äî the Advocate agent pushed back against the Skeptic's 
tendency to default to not green, resulting in a more nuanced labeling process.

**Output file:** `csv/mas_labeled.csv`

## üèÅ Part D: Human Review & Final Integration

Before starting the human review, we first clean up the MAS output by dropping the 
Assignment 2 LLM columns that are no longer needed, then proceed with the same HITL 
widget as Assignment 2.

### Human Review

In [12]:
df_mas = pd.read_csv("csv/mas_labeled.csv")

print(f"Columns before cleanup: {df_mas.columns.tolist()}")
print(f"Shape: {df_mas.shape}")
print(f"Failed parses before cleanup: {df_mas['mas_green_suggested'].isna().sum()}")

# Fill the 2 failed parses with sensible defaults
df_mas["mas_green_suggested"] = df_mas["mas_green_suggested"].fillna(1)
df_mas["mas_confidence"]      = df_mas["mas_confidence"].fillna("low")
df_mas["mas_rationale"]       = df_mas["mas_rationale"].fillna("Failed to parse ‚Äî defaulted to green.")

# Drop Assignment 2 LLM columns ‚Äî not needed for Assignment 3
df_mas = df_mas.drop(columns=["llm_green_suggested", "llm_confidence", "llm_rationale"])

# Save cleaned version
df_mas.to_csv("csv/mas_labeled.csv", index=False)

print(f"Columns after cleanup: {df_mas.columns.tolist()}")
print(f"Shape: {df_mas.shape}")
print(f"Failed parses after cleanup: {df_mas['mas_green_suggested'].isna().sum()}")

Columns before cleanup: ['doc_id', 'text', 'p_green', 'u', 'llm_green_suggested', 'llm_confidence', 'llm_rationale', 'is_green_human', 'notes', 'mas_green_suggested', 'mas_confidence', 'mas_rationale', 'advocate_arg', 'skeptic_arg']
Shape: (100, 14)
Failed parses before cleanup: 2


Columns after cleanup: ['doc_id', 'text', 'p_green', 'u', 'is_green_human', 'notes', 'mas_green_suggested', 'mas_confidence', 'mas_rationale', 'advocate_arg', 'skeptic_arg']
Shape: (100, 11)
Failed parses after cleanup: 0


In [13]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Load the cleaned MAS labeled file
df_review = pd.read_csv("csv/mas_labeled.csv")

# Tracks which row we're currently reviewing
state = {"idx": 0}

In [14]:
# Display elements
progress     = widgets.Label()
claim_text   = widgets.Textarea(layout=widgets.Layout(width="100%", height="150px"), disabled=True)
advocate_box = widgets.Textarea(layout=widgets.Layout(width="100%", height="80px"), disabled=True)
skeptic_box  = widgets.Textarea(layout=widgets.Layout(width="100%", height="80px"), disabled=True)
mas_label    = widgets.Label()
mas_conf     = widgets.Label()
mas_rat      = widgets.Textarea(layout=widgets.Layout(width="100%", height="80px"), disabled=True)
notes_box    = widgets.Textarea(placeholder="Optional: add a note (especially if you disagree)",
                                layout=widgets.Layout(width="100%", height="60px"))

# Buttons
btn_green    = widgets.Button(description="1 - Green",     button_style="success")
btn_notgreen = widgets.Button(description="0 - Not Green", button_style="danger")
btn_prev     = widgets.Button(description="‚Üê Previous")
out          = widgets.Output()

In [15]:
OUTPUT_PATH = "csv/mas_human_labeled.csv"

def show_row(idx):
    row = df_review.iloc[idx]
    progress.value    = f"Claim {idx + 1} / {len(df_review)}"
    claim_text.value  = str(row["text"])
    advocate_box.value = str(row["advocate_arg"])
    skeptic_box.value  = str(row["skeptic_arg"])
    mas_label.value   = f"MAS suggested: {int(row['mas_green_suggested']) if pd.notna(row['mas_green_suggested']) else 'N/A'}"
    mas_conf.value    = f"MAS confidence: {row['mas_confidence']}"
    mas_rat.value     = str(row["mas_rationale"])
    notes_box.value   = str(row["notes"]) if pd.notna(row["notes"]) else ""

def save_and_advance(label):
    idx = state["idx"]
    df_review.at[idx, "is_green_human"] = label
    df_review.at[idx, "notes"]          = notes_box.value
    df_review.to_csv(OUTPUT_PATH, index=False)
    with out:
        clear_output()
        print(f"Saved: claim {idx + 1} ‚Üí {label}")
    state["idx"] = min(idx + 1, len(df_review) - 1)
    show_row(state["idx"])

In [16]:
# Connect buttons to logic
btn_green.on_click(lambda _: save_and_advance(1))
btn_notgreen.on_click(lambda _: save_and_advance(0))
btn_prev.on_click(lambda _: [state.update({"idx": max(state["idx"] - 1, 0)}), show_row(state["idx"])])

# Start at first unlabeled row so you can safely resume after interruptions
first_unlabeled = df_review["is_green_human"].isna().idxmax()
state["idx"] = first_unlabeled if pd.isna(df_review.at[first_unlabeled, "is_green_human"]) else 0
show_row(state["idx"])

# Render the widget
display(widgets.VBox([
    progress,
    widgets.Label("Claim text:"),    claim_text,
    widgets.Label("Advocate:"),      advocate_box,
    widgets.Label("Skeptic:"),       skeptic_box,
    widgets.Label("MAS output:"),    mas_label, mas_conf,
    widgets.Label("MAS rationale:"), mas_rat,
    widgets.Label("Your notes:"),    notes_box,
    widgets.HBox([btn_notgreen, btn_green, btn_prev]),
    out
]))

VBox(children=(Label(value='Claim 1 / 100'), Label(value='Claim text:'), Textarea(value='1. A processor compri‚Ä¶

In [21]:
df_human = pd.read_csv("csv/mas_human_labeled.csv")

# Find rows where human label differs from MAS suggestion
overrides = df_human[df_human["is_green_human"] != df_human["mas_green_suggested"]]

print(f"Total overrides: {len(overrides)} / {len(df_human)}")
print()

# Print 3 examples for the README
for i, (_, row) in enumerate(overrides.head(3).iterrows()):
    print(f"--- Example {i+1} ---")
    print(f"Claim:          {row['text'][:200]}...")
    print(f"MAS suggested:  {row['mas_green_suggested']} ({row['mas_confidence']} confidence)")
    print(f"MAS rationale:  {row['mas_rationale']}")
    print(f"Human label:    {row['is_green_human']}")
    print(f"Notes:          {row['notes']}")
    print()

Total overrides: 9 / 100

--- Example 1 ---
Claim:          1. A radio communications system comprising: a first radio base station to convert multimedia broadcast multicast service (MBMS) data to unicast data and transmit the unicast data to an intermediate s...
MAS suggested:  0.0 (medium confidence)
MAS rationale:  While the system may improve data transmission efficiency, the lack of explicit environmental metrics, energy efficiency analysis, and consideration of broader environmental impacts makes it difficult to classify this patent as genuinely green or sustainable technology.
Human label:    1.0
Notes:          The advocate seems pretty convincing with the argument of energy savings.

--- Example 2 ---
Claim:          1. A method comprising: operating an aerial vehicle in a hover-flight orientation, wherein the aerial vehicle is connected to a tether that defines a tether sphere having a radius based on a length of...
MAS suggested:  nan (nan confidence)
MAS rationale:  You are

In [27]:
# Load the human-labeled files for both assignments
df_a2 = pd.read_csv("../Assignment2/csv/hitl_human_labeled.csv")
df_a3 = pd.read_csv("csv/mas_human_labeled.csv")

# Fix the 2 failed parses in the MAS file
df_a3["mas_green_suggested"] = df_a3["mas_green_suggested"].fillna(1)

# Agreement = rows where AI suggestion matches human label
a2_agreement = (df_a2["llm_green_suggested"] == df_a2["is_green_human"]).mean() * 100
a3_agreement = (df_a3["mas_green_suggested"] == df_a3["is_green_human"]).mean() * 100

print(f"Assignment 2 (simple Mistral) agreement with human: {a2_agreement:.1f}%")
print(f"Assignment 3 (MAS) agreement with human:            {a3_agreement:.1f}%")

Assignment 2 (simple Mistral) agreement with human: 94.0%
Assignment 3 (MAS) agreement with human:            92.0%


### Final Intergration / Fine-Tuning

In [28]:
# Load the MAS human reviewed labels and train_silver
df_gold  = pd.read_csv("csv/mas_human_labeled.csv")
df_train = pd.read_parquet("parquet/train_silver.parquet")

# Rename human label to is_green_gold for consistency
df_gold["is_green_gold"] = df_gold["is_green_human"]

# Give train_silver the same column name
df_train["is_green_gold"] = df_train["is_green_silver"]

# Concatenate train_silver + gold_100 into one training set
# The 100 examples came from the pool split which is different from the training split
# This means that the 100 examples gets added to the training set because it has no doc id's to overwrite
df_combined = pd.concat(
    [df_train, df_gold[["doc_id", "text", "is_green_gold"]]],
    ignore_index=True
)

print(f"train_silver rows: {len(df_train)}")
print(f"gold_100 rows:     {len(df_gold)}")
print(f"combined rows:     {len(df_combined)}")
print(df_combined["is_green_gold"].value_counts())

df_combined.to_parquet("parquet/train_gold.parquet", index=False)

train_silver rows: 35000
gold_100 rows:     100
combined rows:     35100
is_green_gold
1.0    17553
0.0    17547
Name: count, dtype: int64


Now that `train_gold.parquet` has been created combining `train_silver` and the 100 MAS 
gold labels from the three-agent debate pipeline, the fine-tuning step needs to be run 
on HPC. The same `finetune.py` and `slurm_finetune.sh` from Assignment 2 are reused here 
since the fine-tuning process is identical ‚Äî the only difference is the gold labels now 
come from the MAS instead of the simple Mistral prompt.

Follow these steps before continuing:

1. Make sure `train_gold.parquet` and `mas_human_labeled.csv` are available in your HPC project folder
2. Submit the SLURM job: `sbatch slurm_finetune.sh`
3. Once complete, copy the saved model folder back to your local project: `models/patentsberta-finetuned`

Then continue with Part E below.

In [29]:
# Results from the finetuning process

# Fine-tuning results from HPC (finetune.py output)
print("""
--- eval_silver ---
              precision    recall  f1-score   support
   not green       0.81      0.80      0.80      2500
       green       0.80      0.81      0.81      2500
    accuracy                           0.81      5000
   macro avg       0.81      0.81      0.81      5000
weighted avg       0.81      0.81      0.81      5000

--- gold_100 ---
              precision    recall  f1-score   support
   not green       0.49      0.51      0.50        47
       green       0.55      0.53      0.54        53
    accuracy                           0.52       100
   macro avg       0.52      0.52      0.52       100
weighted avg       0.52      0.52      0.52       100
""")


--- eval_silver ---
              precision    recall  f1-score   support
   not green       0.81      0.80      0.80      2500
       green       0.80      0.81      0.81      2500
    accuracy                           0.81      5000
   macro avg       0.81      0.81      0.81      5000
weighted avg       0.81      0.81      0.81      5000

--- gold_100 ---
              precision    recall  f1-score   support
   not green       0.49      0.51      0.50        47
       green       0.55      0.53      0.54        53
    accuracy                           0.52       100
   macro avg       0.52      0.52      0.52       100
weighted avg       0.52      0.52      0.52       100



## üìä Part E: Comparative Analysis (Required)

### Model Comparison

| Model Version | Training Data Source | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|
| 1. Baseline | Frozen Embeddings (No Fine-tuning) | 0.77 | 0.77 | 0.77 | 0.77 |
| 2. Assignment 2 Model | Fine-tuned on Silver + Gold (Simple LLM) | 0.81 | 0.81 | 0.81 | 0.81 |
| 3. Assignment 3 Model | Fine-tuned on Silver + Gold (MAS) | 0.81 | 0.81 | 0.81 | 0.81 |

### gold_100 Performance

| Model Version | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| Assignment 2 Model | 0.57 | 0.67 | 0.52 | 0.62 |
| Assignment 3 Model | 0.52 | 0.52 | 0.52 | 0.52 |

### Reflection

Both the Assignment 2 and Assignment 3 models achieved identical performance on 
eval_silver (0.81 F1), suggesting that the quality of the gold labeling method has 
minimal impact on overall model performance when the gold set represents only 0.28% 
of the total training data. However, the MAS produced notably more balanced and 
confident labels (47 green vs 5 green in Assignment 2), indicating that the debate 
structure leads to more nuanced annotation. The effort of implementing a multi-agent 
system did not translate into a better downstream model on eval_silver, but may be 
more impactful in scenarios where the gold labeled set is larger or the silver labels 
are noisier.

In [None]:
from huggingface_hub import HfApi
from dotenv import load_dotenv
import os

# Load HuggingFace token from .env file
load_dotenv()
TOKEN    = os.getenv("HF_TOKEN")
USERNAME = "alexchrander"

# Set to True only when you want to upload ‚Äî prevents accidental re-uploads
UPLOAD_TO_HF = False

api = HfApi()

if UPLOAD_TO_HF:
    api.create_repo(
        repo_id=f"{USERNAME}/patent-sberta-green-finetuned-mas",
        token=TOKEN,
        exist_ok=True
    )
    api.upload_folder(
        folder_path="models/patentsberta-finetuned",
        repo_id=f"{USERNAME}/patent-sberta-green-finetuned-mas",
        token=TOKEN
    )
    print("Model uploaded successfully")
else:
    print("Skipping upload ‚Äî set UPLOAD_TO_HF = True to upload")

Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Model uploaded successfully


In [31]:
if UPLOAD_TO_HF:
    api.create_repo(
        repo_id=f"{USERNAME}/patents-green-mas-dataset",
        repo_type="dataset",
        token=TOKEN,
        exist_ok=True
    )
    api.upload_file(
        path_or_fileobj="csv/mas_human_labeled.csv",
        path_in_repo="mas_human_labeled.csv",
        repo_id=f"{USERNAME}/patents-green-mas-dataset",
        repo_type="dataset",
        token=TOKEN
    )
    print("Dataset uploaded successfully")
else:
    print("Skipping upload ‚Äî set UPLOAD_TO_HF = True to upload")

Dataset uploaded successfully
