# Executive Summary

- **Project Goal**  
  Build a local, quantized Transformer that ingests raw health-inspection violations and returns a structured JSON object with two fields:  
  1. **summary**: a 1–2 sentence consumer-friendly verdict ending in one of four canned food-safety statements  
  2. **keywords**: a list of 3–7 lowercase issue tags  

- **Data Preparation (Notebook1- Not this one)**  
  - Sampled 3,000 violation texts from Chicago’s API  
  - Used OpenAI ChatCompletion to generate training JSONL (`summary` + `keywords`)  
  - Split into train/val/test and augmented with synthetic “not safe” examples  

- **Fine-Tuning (Notebook 2-This notebook)**  
  - Loaded `google/flan-t5-small` with 4-bit quantization and applied LoRA adapters  
  - Trained for 3 epochs on the JSONL data  
  - Saved the quantized + LoRA model for offline inference  

- **Evaluation Attempts**  
  - **Plain-text metrics**: swapped to a “Summarize this…” prompt and computed ROUGE/BERTScore; saw small improvements (e.g. ROUGE-1 from ~0.13→0.15) but lost JSON structure  
  - **JSON-parsing metrics**: tried to parse model outputs back into JSON and score summaries/keywords; scores stayed at zero because the inference prompt didn’t match training  
  - **Debugging**: re-loaded models correctly, batched inference with progress bars, handled malformed outputs—nothing overcame the prompt-mismatch issue  

- **Key Insight**  
  The fine-tuned model only emits valid JSON when given the exact same instruction it saw during training. Any deviation (e.g. plain summarization prompt) collapses to empty or gibberish outputs, making robust metric computation impossible without re-training.

- **Next Steps: Web Integration**  
  - Deploy the quantized + LoRA model in the backend  
  - At inference, prepend the original JSON-generation prompt to each violation text  
  - Parse `summary` + `keywords` from the returned JSON and serve to the frontend  
  - Defer deeper offline metric refinement until after integration, since end-to-end functionality is now validated and cost-free (no API calls)


# Refined Code (1st Run)

In [None]:
# Block A: Imports & Environment Setup

# Install necessary libraries
!pip install transformers datasets peft bitsandbytes accelerate evaluate -q

# Disable Weights & Biases logging
import os
os.environ["WANDB_DISABLED"] = "true"

# Core imports
import torch
import pandas as pd
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    T5ForConditionalGeneration,
    BitsAndBytesConfig,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from tqdm.auto import tqdm
import evaluate
import json

# Check environment
print(f"PyTorch version: {torch.__version__}")

PyTorch version: 2.6.0+cu124


In [None]:
from datasets import load_dataset

data_files = {
    "train": "violations_train_final.jsonl",
    "validation": "violations_val.jsonl",
    "test": "violations_test.jsonl"
}

data = load_dataset("json", data_files=data_files)

# Inspect the dataset
print(data)
print("\nSample training example:")
print(data["train"][0])

DatasetDict({
    train: Dataset({
        features: ['violation', 'summary', 'keywords'],
        num_rows: 2466
    })
    validation: Dataset({
        features: ['violation', 'summary', 'keywords'],
        num_rows: 297
    })
    test: Dataset({
        features: ['violation', 'summary', 'keywords'],
        num_rows: 300
    })
})

Sample training example:
{'violation': '32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: All food and non-food contact equipment and utensils shall be smooth, easily cleanable, and durable, and shall be in good repair. MUST CLEAN AND MAINTAIN THE FOLLOWING: FAN GUARD COVERS OF WALK-IN COOLER TO REMOVE DUST OBSERVED, BOTTOM OF FRYERS TO REMOVE DUST AND GREASE, INTERIOR OF FREEZER AT GRILL AREA, SURFACE AREA ABOVE HOT UNIT WHERE FRENCH FRIES HELD. | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: VIOLATION PARTIALLY CORRECTED.    REPLACE

In [None]:
# 2. Tokenizer
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 3. Preprocessing Function
max_input_length = 512
max_target_length = 128

def preprocess_fn(examples):
    inputs = tokenizer(
        examples["violation"],
        max_length=max_input_length,
        truncation=True,
        padding="max_length"
    )
    targets = [
        json.dumps({"summary": s, "keywords": k})
        for s, k in zip(examples["summary"], examples["keywords"])
    ]
    with tokenizer.as_target_tokenizer():
        tokenized_targets = tokenizer(
            targets,
            max_length=max_target_length,
            truncation=True,
            padding="max_length"
        )
    # mask pad tokens as -100
    labels = [
        [(tok if tok != tokenizer.pad_token_id else -100) for tok in seq]
        for seq in tokenized_targets["input_ids"]
    ]
    inputs["labels"] = labels
    return inputs

# 4. Apply Tokenization
tokenized = data.map(
    preprocess_fn,
    batched=True,
    remove_columns=data["train"].column_names
)

# 5. Quantization + LoRA Setup
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16  # A100-friendly
)
base_model = T5ForConditionalGeneration.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
model = prepare_model_for_kbit_training(base_model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)
model = get_peft_model(model, lora_config)

# 6. Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# 7. Training Arguments (use BF16, disable FP16)
training_args = Seq2SeqTrainingArguments(
    output_dir="flan_t5_small_lora",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    num_train_epochs=3,
    learning_rate=2e-5,
    bf16=True,
    fp16=False,
    push_to_hub=False,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    save_total_limit=2
)

# 8. Sanity Check: single‐example forward‐pass
batch = data_collator([ tokenized["train"][0] ])
batch = {k: v.to(next(model.parameters()).device) for k, v in batch.items()}
out = model(**batch)
print("sanity loss:", out.loss)  # should be finite (e.g. ~3–5)

# 9. Trainer Instantiation
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 10. Training & Evaluation
trainer.train()
print("Validation metrics:", trainer.evaluate(tokenized["validation"]))
print("Test metrics:",       trainer.evaluate(tokenized["test"]))

Map:   0%|          | 0/2466 [00:00<?, ? examples/s]



Map:   0%|          | 0/297 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Seq2SeqTrainer(
No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


sanity loss: tensor(3.4256, device='cuda:0', grad_fn=<NllLossBackward0>)


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,3.3185,2.830792
2,2.8187,2.299953
3,2.6515,2.164483


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


Validation metrics: {'eval_loss': 2.1644833087921143, 'eval_runtime': 2.8037, 'eval_samples_per_second': 105.931, 'eval_steps_per_second': 13.553, 'epoch': 3.0}
Test metrics: {'eval_loss': 2.1573050022125244, 'eval_runtime': 2.8054, 'eval_samples_per_second': 106.938, 'eval_steps_per_second': 13.546, 'epoch': 3.0}


In [None]:
# ─── Quick Evaluation: Plain-Text Summaries Only ───────────────────────────

# 1. Install & imports (if not already done)
!pip install rouge_score evaluate bert_score --quiet

import pandas as pd
import evaluate
from transformers import T5ForConditionalGeneration

# 2. Load metrics
rouge = evaluate.load("rouge")
berts = evaluate.load("bertscore")

# 3. Prepare test data
violations = data["test"]["violation"]
references = data["test"]["summary"]   # your 1–2 sentence refs

# 4. Generation helper (plain summaries)
def generate_summaries(model, tokenizer, texts,
                       batch_size=16, max_new=128, min_len=20):
    model.eval()
    outs = []
    for i in range(0, len(texts), batch_size):
        chunk = texts[i : i + batch_size]
        enc = tokenizer(
            ["Summarize this inspection violation:\n\n" + t for t in chunk],
            return_tensors="pt",
            truncation=True,
            padding=True
        ).to(model.device)
        gen = model.generate(
            **enc,
            max_new_tokens=max_new,
            min_length=min_len,
            num_beams=4,
            early_stopping=True
        )
        decs = tokenizer.batch_decode(gen, skip_special_tokens=True)
        outs.extend([d.strip() for d in decs])
    return outs

# 5. Load your two models
base_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small").to(model.device)
ft_model   = model   # the LoRA-fine-tuned model you just trained

# 6. Generate summaries
print("Generating with base model…")
base_summaries = generate_summaries(base_model, tokenizer, violations)
print("Generating with fine-tuned model…")
ft_summaries   = generate_summaries(ft_model,   tokenizer, violations)

# 7. Sanity-check a few
for i in range(3):
    print(f"[{i}] BASE :", repr(base_summaries[i]))
    print(f"[{i}] F-T  :", repr(ft_summaries[i]))
    print(f"[{i}] REF  :", repr(references[i]))
    print("---")

# 8. Compute metrics
r_base = rouge.compute(predictions=base_summaries, references=references)
r_ft   = rouge.compute(predictions=ft_summaries,   references=references)
b_base = bermodel = berts.compute(predictions=base_summaries,
                                  references=references,
                                  model_type="microsoft/deberta-xlarge-mnli")
b_ft   = berts.compute(predictions=ft_summaries,
                       references=references,
                       model_type="microsoft/deberta-xlarge-mnli")

# 9. Aggregate & show
df = pd.DataFrame([
    {
      "model":"Base",
      "rouge1": r_base["rouge1"],
      "rouge2": r_base["rouge2"],
      "rougeL": r_base["rougeL"],
      "bertscore": sum(b_base["f1"])/len(b_base["f1"])
    },
    {
      "model":"Fine-tuned",
      "rouge1": r_ft["rouge1"],
      "rouge2": r_ft["rouge2"],
      "rougeL": r_ft["rougeL"],
      "bertscore": sum(b_ft["f1"])/len(b_ft["f1"])
    }
]).set_index("model")

from IPython.display import display
display(df)

Generating with base model…
Generating with fine-tuned model…
[0] BASE : 'DISAPPOINTMENT DURING FOOD PREPARATION, STORAGE & DISPLAY - Comments: 4-101.19 : MUST ELIMINATE CRATES USED FOR FOOD STORAGE IN DRY FOOD & PAPER STORAGE AREA AND WALK IN COOLER'
[0] F-T  : 'The inspection violation is a violation of food safety regulations, including food safety, and food safety, and food safety.'
[0] REF  : 'The inspection revealed issues with food storage practices that could lead to contamination. This violation is related to food safety and may make it unsafe to eat here.'
---
[1] BASE : '#32 CLEAN AND MAINTAIN ALL BUTCHER EQUIPMENT; GRINDERS, MEAT SAWS ETC. #34 CLEAN FLOOR DRAIN IN BUTCHER PREP AREA TO REMOVE DEBIS OBSERVED INSIDE'
[1] F-T  : ''
[1] REF  : 'The inspection revealed several serious violations, including unsanitary conditions and evidence of pest activity, which may pose a risk to food safety. This violation is related to food safety and may make it unsafe to eat here.'
---
[2]



Unnamed: 0_level_0,rouge1,rouge2,rougeL,bertscore
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Base,0.128822,0.031743,0.100526,0.465304
Fine-tuned,0.155449,0.037242,0.121641,0.43456


In [None]:
# 1. Save the fine-tuned model & tokenizer
model.save_pretrained("flan_t5_small_plain_lora")
tokenizer.save_pretrained("flan_t5_small_plain_lora")

# 2. (Optional) zip the folder for easy download/storage
!zip -r flan_t5_small_plain_lora.zip flan_t5_small_plain_lora

  adding: flan_t5_small_plain_lora/ (stored 0%)
  adding: flan_t5_small_plain_lora/spiece.model (deflated 48%)
  adding: flan_t5_small_plain_lora/special_tokens_map.json (deflated 85%)
  adding: flan_t5_small_plain_lora/tokenizer_config.json (deflated 95%)
  adding: flan_t5_small_plain_lora/README.md (deflated 66%)
  adding: flan_t5_small_plain_lora/tokenizer.json (deflated 74%)
  adding: flan_t5_small_plain_lora/adapter_model.safetensors (deflated 7%)
  adding: flan_t5_small_plain_lora/adapter_config.json (deflated 54%)


# Refined Code (Loading Model in-2nd Run)

The Core Issue
Train‐time inputs: "<RAW_VIOLATION_TEXT>"

Train‐time target: {"summary":…, "keywords":[…]}

Eval‐time inputs: "<RAW_VIOLATION_TEXT>" (no prefix)

Eval‐time tried: also tried "Summarize this…” + text

In both cases it didn’t see the big JSON‐instruction at inference the way it did in training, so it just loops or outputs nothing.

In [None]:
# unzip your saved model
!unzip flan_t5_small_plain_lora.zip -d flan_t5_small_plain_lora

In [16]:
# Cell 1: Installs, imports, device
!pip install transformers datasets peft evaluate rouge_score bert_score -q

import torch, json
from transformers import BitsAndBytesConfig, T5ForConditionalGeneration, AutoTokenizer
from peft import PeftModel
import evaluate, pandas as pd
from datasets import load_dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Cell 2: Load test data
data = load_dataset("json", data_files={"test":"violations_test.jsonl"})
texts        = data["test"]["violation"]
ref_summ     = data["test"]["summary"]
ref_keywords = data["test"]["keywords"]

# Cell 3: Load quantized + LoRA model
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
backbone  = T5ForConditionalGeneration.from_pretrained(
    "google/flan-t5-small",
    quantization_config=bnb,
    device_map="auto",
    trust_remote_code=True
)
model_dir = "/content/flan_t5_small_plain_lora/flan_t5_small_plain_lora"
model     = PeftModel.from_pretrained(backbone, model_dir).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

# Cell 4 (updated): Inference (raw violation → JSON or list)
def infer_violation(vio: str):
    enc = tokenizer(vio, return_tensors="pt", truncation=True, padding=True).to(device)
    gen = model.generate(
        **enc,
        max_new_tokens=256,
        num_beams=4,
        early_stopping=True
    )
    raw = tokenizer.decode(gen[0], skip_special_tokens=True)
    try:
        obj = json.loads(raw)
    except json.JSONDecodeError:
        return {"summary": "", "keywords": []}

    # If they gave a dict, great
    if isinstance(obj, dict):
        return obj

    # If they gave a list [summary, keywords], unpack it
    if isinstance(obj, list) and len(obj) == 2:
        summary, keywords = obj
        return {
            "summary": summary if isinstance(summary, str) else "",
            "keywords": keywords if isinstance(keywords, list) else []
        }

    # Otherwise fallback
    return {"summary": "", "keywords": []}
# Cell 5: Run inference & collect predictions (with progress bar)
from tqdm.auto import tqdm

pred_summaries, pred_keywords = [], []
batch_size = 16

for i in tqdm(range(0, len(texts), batch_size), desc="Running inference"):
    batch = texts[i : i + batch_size]
    enc = tokenizer(batch, return_tensors="pt", truncation=True, padding=True).to(device)
    gens = model.generate(
        **enc,
        max_new_tokens=256,
        num_beams=4,
        early_stopping=True
    )
    raws = tokenizer.batch_decode(gens, skip_special_tokens=True)
    for raw in raws:
        try:
            parsed = json.loads(raw)
        except json.JSONDecodeError:
            # nothing valid → empty
            summary, keywords = "", []
        else:
            if isinstance(parsed, dict):
                summary  = parsed.get("summary", "")
                keywords = parsed.get("keywords", [])
            elif isinstance(parsed, list) and len(parsed) == 2:
                s, k = parsed
                summary  = s if isinstance(s, str) else ""
                keywords = k if isinstance(k, list) else []
            elif isinstance(parsed, str):
                # model returned a bare string
                summary, keywords = parsed, []
            else:
                summary, keywords = "", []

        pred_summaries.append(summary)
        pred_keywords.append(keywords)

# Cell 6: Compute and display metrics
# 6a) ROUGE on summaries
rouge        = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=pred_summaries, references=ref_summ)

# 6b) Set‐based F1 for keywords
def keyword_f1(pred, ref):
    p, r = set(pred), set(ref)
    tp   = len(p & r)
    return 2*tp/(len(p)+len(r)) if (p and r) else 0.0

kw_f1s    = [keyword_f1(p, r) for p, r in zip(pred_keywords, ref_keywords)]
avg_kw_f1 = sum(kw_f1s) / len(kw_f1s)

# 6c) Show results
df = pd.DataFrame([{
    "rouge1": rouge_scores["rouge1"],
    "rouge2": rouge_scores["rouge2"],
    "rougeL": rouge_scores["rougeL"],
    "avg_keyword_f1": avg_kw_f1
}], index=["Fine-tuned"])

print(df)


Running inference:   0%|          | 0/19 [00:00<?, ?it/s]

            rouge1  rouge2  rougeL  avg_keyword_f1
Fine-tuned     0.0     0.0     0.0             0.0


# Original Code

In [None]:
!pip install transformers
!pip install datasets
!pip install accelerate

In [None]:
!pip install --upgrade transformers --quiet

In [None]:
import transformers
print(transformers.__version__)  # should be ≥ 4.2.0 (ideally 4.36+)

In [None]:
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from transformers import AutoTokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq

train_path = "/content/violations_train_with_summaries.csv"
val_path   = "/content/violations_val_with_summaries.csv"
test_path  = "/content/violations_test_with_summaries.csv"

# read with latin1 (or cp1252) encoding
train_df = pd.read_csv(train_path, encoding="latin-1")
val_df   = pd.read_csv(val_path,   encoding="latin-1")
test_df  = pd.read_csv(test_path,  encoding="latin-1")

data = DatasetDict({
    "train":     Dataset.from_pandas(train_df),
    "validation":Dataset.from_pandas(val_df),
    "test":      Dataset.from_pandas(test_df),
})

print(data)


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from transformers import (
    AutoTokenizer,
    T5ForConditionalGeneration,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainer,           # ← replace Trainer
    Seq2SeqTrainingArguments  # ← replace TrainingArguments
)
# 1. Choose your FLAN‑T5 checkpoint
model_name = "google/flan-t5-base"

# 2. Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = T5ForConditionalGeneration.from_pretrained(model_name)

# 3. Preprocessing function
max_input_length  = 512
max_target_length = 128

def preprocess_fn(examples):
    # tokenize inputs
    inputs = tokenizer(
        examples["input_text"],
        max_length=max_input_length,
        truncation=True,
    )
    # tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["target_text"],
            max_length=max_target_length,
            truncation=True,
        )
    inputs["labels"] = labels["input_ids"]
    return inputs

# 4. Apply to all splits
tokenized = data.map(
    preprocess_fn,
    batched=True,
    remove_columns=data["train"].column_names,  # drops input_text & target_text
)

# 5. Prepare data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# 6. Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="flan_t5_finetuned",
    eval_strategy="epoch",     # only once, spelled correctly
    save_strategy="epoch",
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    save_total_limit=2,
    logging_steps=50,
    predict_with_generate=True,
)

# 7. Instantiate Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 8. Start fine‑tuning
trainer.train()

# 9. Evaluate on test set
metrics = trainer.evaluate(tokenized["test"])
print(metrics)


In [None]:
sample = tokenized["test"].select(range(5))
inputs  = tokenizer(sample["input_text"], return_tensors="pt", truncation=True).to(model.device)
outs    = model.generate(**inputs, max_length=128)
print(tokenizer.batch_decode(outs, skip_special_tokens=True))

In [None]:
# 1. Grab 5 raw examples
sample = data["test"].select(range(5))

# 2. Tokenize on the fly (so you still have the raw text)
encoded = tokenizer(
    sample["input_text"],
    return_tensors="pt",
    truncation=True,
    padding=True
).to(model.device)

# 3. Generate summaries+keywords
outs = model.generate(**encoded, max_length=128)

# 4. Decode and print
print(tokenizer.batch_decode(outs, skip_special_tokens=True))

['The health inspection revealed that the food storage area was not properly elevated, and there were crates used for food storage in the dry food storage area and the walk-in cooler needed proper shelving units. While these violations indicate some concerns about food safety practices, they do not pose an immediate health risk, so it may still be safe to eat there, but consumers should be aware of these issues. **Keywords:** food storage, crates, food storage, health inspection, safety.', 'The health inspection revealed several serious violations, including a partially smoked condiment on the cutting board, a rusty jar of meat, and a leaking faucet in the butcher prep area. While these issues indicate some concerns about food safety, they do not pose an immediate health risk, making it generally safe to eat there, but consumers should be aware of these issues. **Keywords:** food safety, smoking, condiment, cleanliness, health inspection, violations.', 'The health inspection revealed s

In [None]:
# 1. Grab 5 raw examples
sample = data["test"].select(range(5))

# 2. Inference & post‑processing
def clean_output(raw: str):
    # split summary vs keywords
    parts = raw.split("Keywords:")
    summary = parts[0].replace("Summary:", "").strip().rstrip(".")
    kw_part = parts[1] if len(parts)>1 else ""
    # dedupe & normalize keywords
    kws = [k.strip().lower() for k in kw_part.split(",") if k.strip()]
    kws = list(dict.fromkeys(kws))
    return summary, kws

# 3. Tokenize & generate
encoded = tokenizer(
    sample["input_text"],
    return_tensors="pt",
    truncation=True,
    padding=True
).to(model.device)
outs = model.generate(**encoded, max_length=128)

# 4. Decode + clean
decoded = tokenizer.batch_decode(outs, skip_special_tokens=True)
for i, raw in enumerate(decoded):
    summary, keywords = clean_output(raw)
    print(f"Example {i+1}:\n • Summary: {summary}\n • Keywords: {keywords}\n")


Example 1:
 • Summary: The health inspection revealed that the food storage area was not properly elevated, and there were crates used for food storage in the dry food storage area and the walk-in cooler needed proper shelving units. While these violations indicate some concerns about food safety practices, they do not pose an immediate health risk, so it may still be safe to eat there, but consumers should be aware of these issues. **
 • Keywords: ['** food storage', 'crates', 'food storage', 'health inspection', 'safety.']

Example 2:
 • Summary: The health inspection revealed several serious violations, including a partially smoked condiment on the cutting board, a rusty jar of meat, and a leaking faucet in the butcher prep area. While these issues indicate some concerns about food safety, they do not pose an immediate health risk, making it generally safe to eat there, but consumers should be aware of these issues. **
 • Keywords: ['** food safety', 'smoking', 'condiment', 'cleanli

In [None]:
import re

def clean_output(raw: str):
    # remove any ** markers
    raw = raw.replace("**", "")
    # split summary vs keywords
    parts = re.split(r"Keywords?:", raw, maxsplit=1)
    summary = parts[0].replace("Summary:", "").strip().rstrip(".")
    # ditch the boilerplate sentence starting with “While”
    summary = re.sub(r"\bWhile.*$", "", summary).strip().rstrip(".")
    # extract & dedupe keywords
    kw_part = parts[1] if len(parts)>1 else ""
    kws = [k.strip().lower().rstrip(".") for k in kw_part.split(",") if k.strip()]
    kws = list(dict.fromkeys(kws))
    return summary, kws

# Test again:
decoded = tokenizer.batch_decode(outs, skip_special_tokens=True)
for i, raw in enumerate(decoded,1):
    s, k = clean_output(raw)
    print(f"Example {i}:\n • Summary: {s}\n • Keywords: {k}\n")


Example 1:
 • Summary: The health inspection revealed that the food storage area was not properly elevated, and there were crates used for food storage in the dry food storage area and the walk-in cooler needed proper shelving units
 • Keywords: ['food storage', 'crates', 'health inspection', 'safety']

Example 2:
 • Summary: The health inspection revealed several serious violations, including a partially smoked condiment on the cutting board, a rusty jar of meat, and a leaking faucet in the butcher prep area
 • Keywords: ['food safety', 'smoking', 'condiment', 'cleanliness', 'health inspection', 'violations']

Example 3:
 • Summary: The health inspection revealed several violations, including a gap in the delivery door, damaged door handles, and a leaky faucet in the kitchen, which could pose a risk to food safety
 • Keywords: ['pest control', 'food safety', 'maintenance', 'plumbing', 'health inspection']

Example 4:
 • Summary: The health inspection revealed several serious violation

In [None]:
def summarize_violation(text):
    # tokenize & generate
    encoded = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(model.device)
    raw     = tokenizer.decode(model.generate(**encoded, max_length=128)[0], skip_special_tokens=True)
    # clean & return
    summary, keywords = clean_output(raw)
    return summary, keywords


In [None]:
# ───── New Cell: Inference & Inspection ─────

# 1. Grab 10 random examples from the test split via Dataset.shuffle().select()
sampled = data["test"].shuffle(seed=42).select(range(10))

# 2. Run them through your summarize_violation() helper
for ex in sampled:
    summary, keywords = summarize_violation(ex["input_text"])
    print(f"- Summary: {summary}\n  Keywords: {keywords}\n")


- Summary: The health inspection revealed several significant violations, including the absence of a handwashing sink at the pizza prep station, the presence of a high temperature dish machine, and cleanliness issues with food contact surfaces. Given these issues, it may not be safe to eat there until these issues are resolved
  Keywords: ['handwashing sink', 'temperature violation', 'food contact surfaces', 'cleanliness', 'inspection']

- Summary: The health inspection revealed several cleanliness and maintenance issues, including a buildup of debris under the metal shelving, dust buildup on non-cooking equipment, and food spillage and debris on the cooking units
  Keywords: ['cleanliness', 'food safety', 'maintenance']

- Summary: The health inspection revealed several violations, including dirty floors, damaged floor tiles, and missing wall bases, which could pose a risk to food safety
  Keywords: ['cleanliness', 'maintenance', 'food safety', 'inspection violations', 'health inspect

Evaluation

In [None]:
# Install the evaluate library
!pip install evaluate --quiet
!pip install rouge_score --quiet

import evaluate

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Prepare test subset and references
subset = data["test"].shuffle(seed=42).select(range(300))
refs = [
    ex["target_text"].split("Keywords:")[0]
        .replace("Summary:", "")
        .strip()
    for ex in subset
]

# Batch-generate predictions
texts = [ex["input_text"] for ex in subset]
preds = []
batch_size = 32  # adjust based on GPU memory

for i in range(0, len(texts), batch_size):
    batch = texts[i : i + batch_size]
    enc = tokenizer(
        batch,
        return_tensors="pt",
        truncation=True,
        padding=True
    ).to(model.device)

    out = model.generate(**enc, max_length=128)
    dec = tokenizer.batch_decode(out, skip_special_tokens=True)

    # Extract only the summary portion
    cleaned = [
        raw.split("Keywords:")[0].replace("Summary:", "").strip()
        for raw in dec
    ]
    preds.extend(cleaned)

# Compute ROUGE
results = rouge.compute(predictions=preds, references=refs)
print(results)

# Why this is faster:
# - Generates multiple samples per generate() call
# - Reduces Python loop overhead
# - Executes in seconds instead of minutes



{'rouge1': np.float64(0.5647419955074559), 'rouge2': np.float64(0.33582642641399674), 'rougeL': np.float64(0.4810599358816067), 'rougeLsum': np.float64(0.4815245282645666)}


In [None]:
import numpy as np

# 1. Prepare test subset and ground‑truth keyword sets
subset = data["test"].shuffle(seed=42).select(range(300))
refs_kw = [
    {k.strip().lower().rstrip(".") for k in ex["target_text"].split("Keywords:")[1].split(",")}
    for ex in subset
]

# 2. Batch‑generate predictions
texts = [ex["input_text"] for ex in subset]
preds_kw = []
batch_size = 32

for i in range(0, len(texts), batch_size):
    batch_texts = texts[i : i + batch_size]
    enc = tokenizer(
        batch_texts,
        return_tensors="pt",
        truncation=True,
        padding=True
    ).to(model.device)

    outs = model.generate(**enc, max_length=128)
    dec = tokenizer.batch_decode(outs, skip_special_tokens=True)

    # parse keywords out of each generated string
    for raw in dec:
        # same splitting logic you used before
        kw_part = raw.split("Keywords:")[1] if "Keywords:" in raw else ""
        kws = [k.strip().lower().rstrip(".") for k in kw_part.split(",") if k.strip()]
        preds_kw.append(set(dict.fromkeys(kws)))  # dedupe

# 3. Compute precision, recall, F1
precisions, recalls, f1s = [], [], []
for r, p in zip(refs_kw, preds_kw):
    if not r or not p:
        continue
    tp = len(r & p)
    prec = tp / len(p)
    rec  = tp / len(r)
    f1   = 2 * prec * rec / (prec + rec) if (prec+rec)>0 else 0
    precisions.append(prec)
    recalls.append(rec)
    f1s.append(f1)

print("Keyword P:", np.mean(precisions))
print("Keyword R:", np.mean(recalls))
print("Keyword F1:", np.mean(f1s))


Keyword P: 0.44911904761904764
Keyword R: 0.4006190476190476
Keyword F1: 0.42101927701927705


In [None]:
!pip install tqdm --quiet
from tqdm.auto import tqdm
import numpy as np

# Prepare test subset (you can change 100→300 when you want full eval)
subset = data["test"].shuffle(seed=42).select(range(100))

# Ground‑truth keyword sets
refs_kw = [
    {k.strip().lower().rstrip(".") for k in ex["target_text"].split("Keywords:")[1].split(",")}
    for ex in subset
]

# Inference + collection
preds_kw = []
for ex in tqdm(subset, desc="Generating keywords"):
    s, kws = summarize_violation_limit5(ex["input_text"])
    preds_kw.append(kws)

# Compute P/R/F1
precisions, recalls, f1s = [], [], []
for r, p in zip(refs_kw, preds_kw):
    if not r or not p:
        continue
    tp = len(r & p)
    prec = tp/len(p)
    rec  = tp/len(r)
    f1   = 2*prec*rec/(prec+rec) if (prec+rec)>0 else 0
    precisions.append(prec); recalls.append(rec); f1s.append(f1)

print(f"\nKeyword P: {np.mean(precisions):.3f}")
print(f"Keyword R: {np.mean(recalls):.3f}")
print(f"Keyword F1: {np.mean(f1s):.3f}")


Generating keywords:   0%|          | 0/100 [00:00<?, ?it/s]

ValueError: too many values to unpack (expected 2)

In [None]:
# 1. Update your helper to return both summary & keywords
def summarize_violation_limit5(text):
    instruction = (
        "Summarize this violation in one concise sentence (no boilerplate). "
        "Then list exactly the top 5 most important keywords. "
        "Format: Summary: ... Keywords: kw1, kw2, kw3, kw4, kw5"
    )
    prompt = instruction + "\n\nViolation:\n" + text

    enc = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(model.device)
    out = model.generate(**enc, max_length=128)
    raw = tokenizer.decode(out[0], skip_special_tokens=True)

    # split summary vs keywords
    parts = raw.split("Keywords:")
    summary = parts[0].replace("Summary:", "").strip().rstrip(".")
    kw_part = parts[1] if len(parts)>1 else ""
    kws = [k.strip().lower().rstrip(".") for k in kw_part.split(",") if k.strip()]
    # enforce exactly 5 slots
    kws = (kws + [""]*5)[:5]
    return summary, set(kws)

# 2. Install & import tqdm (if you haven't already)
!pip install tqdm --quiet
from tqdm.auto import tqdm
import numpy as np

# 3. Prepare test subset and ground truth
subset = data["test"].shuffle(seed=42).select(range(100))
refs_kw = [
    {k.strip().lower().rstrip(".") for k in ex["target_text"].split("Keywords:")[1].split(",")}
    for ex in subset
]

# 4. Loop with progress bar
preds_kw = []
for ex in tqdm(subset, desc="Evaluating keywords"):
    _, kws = summarize_violation_limit5(ex["input_text"])
    preds_kw.append(kws)

# 5. Compute P/R/F1
precisions, recalls, f1s = [], [], []
for r, p in zip(refs_kw, preds_kw):
    if not r or not p:
        continue
    tp = len(r & p)
    prec = tp/len(p)
    rec  = tp/len(r)
    f1   = 2*prec*rec/(prec+rec) if (prec+rec)>0 else 0
    precisions.append(prec); recalls.append(rec); f1s.append(f1)

print(f"\nKeyword P: {np.mean(precisions):.3f}")
print(f"Keyword R: {np.mean(recalls):.3f}")
print(f"Keyword F1: {np.mean(f1s):.3f}")


Evaluating keywords:   0%|          | 0/100 [00:00<?, ?it/s]


Keyword P: 0.324
Keyword R: 0.242
Keyword F1: 0.275


In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# 1. Define your updated training arguments
new_args = Seq2SeqTrainingArguments(
    output_dir="flan_t5_finetuned",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    predict_with_generate=True,
    report_to="none",
)

# 2. Re‑instantiate the Trainer with the same model, data, collator, but new args:
trainer = Seq2SeqTrainer(
    model=model,
    args=new_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# 3. Run the light retrain
trainer.train()


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss
1,No log,0.898057


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


TrainOutput(global_step=300, training_loss=1.0952212524414062, metrics={'train_runtime': 106.8609, 'train_samples_per_second': 22.459, 'train_steps_per_second': 2.807, 'total_flos': 1634890276970496.0, 'train_loss': 1.0952212524414062, 'epoch': 1.0})

In [None]:
model.save_pretrained("flan_t5_finetuned_final")
tokenizer.save_pretrained("flan_t5_finetuned_final")

('flan_t5_finetuned_final/tokenizer_config.json',
 'flan_t5_finetuned_final/special_tokens_map.json',
 'flan_t5_finetuned_final/spiece.model',
 'flan_t5_finetuned_final/added_tokens.json',
 'flan_t5_finetuned_final/tokenizer.json')

In [None]:
# 1. Zip the folder containing your model & tokenizer
!zip -r flan_t5_finetuned_final.zip flan_t5_finetuned_final

# 2. Download the zip file to your local machine
from google.colab import files
files.download('flan_t5_finetuned_final.zip')

  adding: flan_t5_finetuned_final/ (stored 0%)
  adding: flan_t5_finetuned_final/tokenizer_config.json (deflated 95%)
  adding: flan_t5_finetuned_final/model.safetensors (deflated 7%)
  adding: flan_t5_finetuned_final/special_tokens_map.json (deflated 85%)
  adding: flan_t5_finetuned_final/config.json (deflated 62%)
  adding: flan_t5_finetuned_final/spiece.model (deflated 48%)
  adding: flan_t5_finetuned_final/tokenizer.json (deflated 74%)
  adding: flan_t5_finetuned_final/generation_config.json (deflated 29%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Need to do Qualitative Evaluation