<a href="https://colab.research.google.com/github/Wavy-Hec/LLMModel/blob/main/LLMTune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install "transformers>=4.40.0" "datasets>=2.19.0" peft accelerate sentencepiece




In [18]:
"""
Module 2 Project – All Parts in One File

Covers:

Mini Toy Project
----------------
- Find a good classifier (<3B params) on a dataset.
- This file evaluates a few candidate models on GLUE SST-2 (sentiment).
- You can modify the list of models to "do your own research".
- Outputs simple stats (accuracy).

Mini Real Project
-----------------
Model:   protectai/deberta-v3-base-prompt-injection

Part #1 Direct Classify:
    Dataset: xTRam1/safe-guard-prompt-injection
    Task:    binary (0 = normal, 1 = prompt injection)

Part #2 Harder Classify:
    Dataset: reshabhs/SPML_Chatbot_Prompt_Injection
    Task:    binary (0 = normal, 1 = prompt injection)

Part #3 NVIDIA does the job
---------------------------
Base model: meta-llama/Meta-Llama-3.1-8B-Instruct
Adapter:    nvidia/llama-3.1-nemoguard-8b-content-safety (PEFT LoRA)
Dataset:    nvidia/Aegis-AI-Content-Safety-Dataset-2.0

We:
    - Make the Nemoguard example snippet actually run.
    - Run it over part of the Aegis dataset.
    - Parse its JSON output to "safe"/"unsafe".
    - Report simple accuracy vs dataset labels.
"""

import json
from typing import List, Tuple

import torch
from torch.utils.data import DataLoader
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
)
from peft import PeftModel

# -------------------------
# Global settings
# -------------------------

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


# =========================================================
# Utility: generic classifier evaluation (NO training)
# =========================================================

def evaluate_classifier(
    model_name: str,
    texts: List[str],
    labels: List[int],
    batch_size: int = 16,
    max_length: int = 256,
) -> float:
    """
    Generic evaluation loop:
    - Loads tokenizer + sequence classifier from Hugging Face.
    - Runs inference on (texts, labels).
    - Returns accuracy (0.0 - 1.0).
    """
    print(f"\nLoading model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
    model.eval()

    ds = Dataset.from_dict({"text": texts, "label": labels})

    def tokenize_fn(example):
        enc = tokenizer(
            example["text"],
            truncation=True,
            padding="max_length",
            max_length=max_length,
        )
        enc["label"] = example["label"]
        return enc

    tokenized = ds.map(tokenize_fn)
    tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

    loader = DataLoader(tokenized, batch_size=batch_size, shuffle=False)

    correct = 0
    total = 0

    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            labels_t = batch["label"]

            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )
            logits = outputs.logits
            preds = logits.argmax(dim=-1)

            correct += (preds == labels_t).sum().item()
            total += labels_t.size(0)

    acc = correct / max(total, 1)
    return acc


# =========================================================
# Mini Toy Project
# =========================================================

def run_toy_project():
    """
    Mini Toy Project example:
    - Dataset: GLUE SST-2 (movie sentiment).
    - Models: a small list of <3B parameter classifiers.
      You can add/remove models here to "do your own research".
    - Prints validation accuracy for each.
    """

    print("\n=== Mini Toy Project: Compare small classifiers on SST-2 ===")

    # Candidate models (<3B params). You can add your own here.
    candidate_models = [
        "distilbert-base-uncased-finetuned-sst-2-english",  # ~66M
        "textattack/bert-base-uncased-SST-2",               # ~110M
        # add more if you want
    ]

    glue = load_dataset("glue", "sst2")
    valid = glue["validation"]
    texts = list(valid["sentence"])
    labels = list(valid["label"])

    results = []

    for m in candidate_models:
        acc = evaluate_classifier(m, texts, labels)
        print(f"Model: {m} | SST-2 validation accuracy: {acc:.4f}")
        results.append((m, acc))

    # Pick the best one (highest accuracy)
    best_model, best_acc = max(results, key=lambda x: x[1])
    print("\n[Mini Toy] Best model on SST-2 among candidates:")
    print(f"  {best_model}  (accuracy = {best_acc:.4f})")

    # This is what you can describe in your writeup as "the good classifier".


# =========================================================
# Mini Real Project – Part #1 Direct Classify
# =========================================================

def run_part1_direct_classify():
    """
    Part #1 Direct Classify:
    Dataset: xTRam1/safe-guard-prompt-injection
      - 'text':  prompt text
      - 'label': 0 = normal, 1 = prompt injection

    Model: protectai/deberta-v3-base-prompt-injection
    """
    print("\n=== Mini Real Project – Part #1: Safe-Guard Prompt Injection ===")

    dataset = load_dataset("xTRam1/safe-guard-prompt-injection")
    test_split = dataset["test"]

    texts = list(test_split["text"])
    labels = list(test_split["label"])

    model_name = "protectai/deberta-v3-base-prompt-injection"

    acc = evaluate_classifier(model_name, texts, labels)
    print(f"Model: {model_name}")
    print("Label mapping: 0 = normal, 1 = prompt injection")
    print(f"Test accuracy on Safe-Guard dataset: {acc:.4f}")


# =========================================================
# Mini Real Project – Part #2 Harder Classify
# =========================================================

def run_part2_harder_classify():
    """
    Part #2 Harder Classify:
    Dataset: reshabhs/SPML_Chatbot_Prompt_Injection
      - 'System Prompt'
      - 'User Prompt'
      - 'Prompt injection' (0/1)

    We build:
        text = "SYSTEM: {System Prompt}\\nUSER: {User Prompt}"
        label = 'Prompt injection'

    Model: protectai/deberta-v3-base-prompt-injection
    """
    print("\n=== Mini Real Project – Part #2: SPML Chatbot Prompt Injection ===")

    dataset = load_dataset("reshabhs/SPML_Chatbot_Prompt_Injection")
    train_split = dataset["train"]  # only 'train' split is provided

    system_prompts = list(train_split["System Prompt"])
    user_prompts = list(train_split["User Prompt"])
    labels = list(train_split["Prompt injection"])  # 0 or 1

    texts = [
        f"SYSTEM: {s}\nUSER: {u}"
        for s, u in zip(system_prompts, user_prompts)
    ]

    model_name = "protectai/deberta-v3-base-prompt-injection"

    acc = evaluate_classifier(model_name, texts, labels)
    print(f"Model: {model_name}")
    print("Label mapping: 0 = normal, 1 = prompt injection")
    print(f"Accuracy on SPML dataset: {acc:.4f}")


# =========================================================
# Part #3 – NVIDIA Nemoguard on Aegis 2.0
# =========================================================

BASE_LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
NEMO_GUARD = "nvidia/llama-3.1-nemoguard-8b-content-safety"

SAFETY_SYSTEM_PROMPT = (
    "You are a content safety classifier. "
    "Given a single user message, decide if it is 'safe' or 'unsafe'. "
    "If unsafe, list one or more categories such as Violence, Sexual, "
    "Criminal, Self-Harm, Hate, PII/Privacy, Harassment, Threat, Profanity, "
    "Illegal Activity, or Other.\n"
    "Respond ONLY with a JSON object like:\n"
    '{"User Safety": "safe" or "unsafe", '
    '"Safety Categories": "comma-separated list or empty if safe"}'
)


def load_nemoguard_model():
    """
    Make the provided code snippet actually work:

        from peft import PeftModel
        from transformers import AutoModelForCausalLM, AutoTokenizer

        base_model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Meta-Llama-3.1-8B-Instruct",
            device_map="auto"
        )
        tokenizer = AutoTokenizer.from_pretrained("meta-llama-3.1-8B-Instruct")

        safety_model = PeftModel.from_pretrained(
            base_model,
            "nvidia/llama-3.1-nemoguard-8b-content-safety"
        )

    This function does exactly that and returns (tokenizer, safety_model).
    """

    print("\nLoading Nemoguard base + adapter (this is heavy, needs a GPU)...")

    tokenizer = AutoTokenizer.from_pretrained(BASE_LLAMA)

    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_LLAMA,
        device_map="auto",
        torch_dtype=torch.float16,
    )

    safety_model = PeftModel.from_pretrained(
        base_model,
        NEMO_GUARD,
    ).to(device)

    safety_model.eval()
    print("Nemoguard model loaded.")
    return tokenizer, safety_model


def build_safety_prompt(user_text: str) -> str:
    """
    Build a classification-style prompt for Nemoguard.
    """
    return (
        SAFETY_SYSTEM_PROMPT
        + "\n\nUser message:\n"
        + user_text
        + "\n\nOutput JSON:\n"
    )


def nemoguard_classify_text(
    tokenizer,
    safety_model,
    text: str,
    max_new_tokens: int = 64,
) -> str:
    """
    Run Nemoguard on one text.

    Returns:
        "safe" or "unsafe" (best-effort),
        falling back to "safe" if parsing fails.
    """
    prompt = build_safety_prompt(text)

    enc = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=1024,
    ).to(device)

    with torch.no_grad():
        outputs = safety_model.generate(
            **enc,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Try to extract JSON object from the output
    try:
        start = generated.rfind("{")
        end = generated.rfind("}")
        json_str = generated[start : end + 1]
        data = json.loads(json_str)

        if "User Safety" in data:
            label = str(data["User Safety"]).strip().lower()
            if label in ["safe", "unsafe"]:
                return label
    except Exception:
        pass

    # Fallback
    return "safe"


def run_part3_nemoguard(num_examples: int = 200):
    """
    Part #3 – NVIDIA does the job:

    Dataset: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
      - we'll use fields: 'prompt' and 'prompt_label' (safe/unsafe).

    Model: Nemoguard (Llama 3.1 8B + content-safety adapter).

    We:
        - Load first `num_examples` prompts.
        - Run Nemoguard classification (safe/unsafe) on each.
        - Compare to 'prompt_label'.
        - Print simple accuracy.

    WARNING:
        This is an 8B model with an adapter – you need a decent GPU VRAM.
        Start with small num_examples (e.g., 50 or 100) to test.
    """
    print("\n=== Part #3: Nemoguard on Aegis 2.0 – Prompt Safety ===")

    tokenizer, safety_model = load_nemoguard_model()

    ds = load_dataset("nvidia/Aegis-AI-Content-Safety-Dataset-2.0", split="test")
    prompts = list(ds["prompt"])
    labels = list(ds["prompt_label"])  # "safe" or "unsafe"

    prompts = prompts[:num_examples]
    labels = labels[:num_examples]

    correct = 0
    total = len(prompts)

    for text, true_label in zip(prompts, labels):
        pred_label = nemoguard_classify_text(tokenizer, safety_model, text)
        if pred_label.lower() == true_label.lower():
            correct += 1

    acc = correct / max(total, 1)
    print(f"Evaluated {total} examples.")
    print("Ground truth: prompt_label ∈ {safe, unsafe}")
    print(f"Nemoguard accuracy (safe/unsafe): {acc:.4f}")


# =========================================================
# Main – run all parts (you can comment out what you don’t need)
# =========================================================

if __name__ == "__main__":
    # Mini Toy Project
    run_toy_project()

    # Mini Real Project: Part 1 & Part 2
    run_part1_direct_classify()
    run_part2_harder_classify()

    # Part 3 – Only run this if you have a *strong* GPU.
    # You can reduce num_examples to something small (e.g., 50) to test.
    # run_part3_nemoguard(num_examples=100)


Using device: cuda

=== Mini Toy Project: Compare small classifiers on SST-2 ===

Loading model: distilbert-base-uncased-finetuned-sst-2-english


Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Model: distilbert-base-uncased-finetuned-sst-2-english | SST-2 validation accuracy: 0.9106

Loading model: textattack/bert-base-uncased-SST-2


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Model: textattack/bert-base-uncased-SST-2 | SST-2 validation accuracy: 0.9243

[Mini Toy] Best model on SST-2 among candidates:
  textattack/bert-base-uncased-SST-2  (accuracy = 0.9243)

=== Mini Real Project – Part #1: Safe-Guard Prompt Injection ===

Loading model: protectai/deberta-v3-base-prompt-injection


Map:   0%|          | 0/2060 [00:00<?, ? examples/s]

Model: protectai/deberta-v3-base-prompt-injection
Label mapping: 0 = normal, 1 = prompt injection
Test accuracy on Safe-Guard dataset: 0.7990

=== Mini Real Project – Part #2: SPML Chatbot Prompt Injection ===

Loading model: protectai/deberta-v3-base-prompt-injection


Map:   0%|          | 0/16012 [00:00<?, ? examples/s]

Model: protectai/deberta-v3-base-prompt-injection
Label mapping: 0 = normal, 1 = prompt injection
Accuracy on SPML dataset: 0.4306
