# NormSense: Full Pipeline 

This notebook runs the complete NormSense pipeline end-to-end:

1. **Phase 1: Dataset & prompts**
2. **Phase 2: Model responses (HF local models)**
3. **Phase 3: LLM-as-a-Judge scoring**
4. **Phase 4: Aggregation of scores**
5. **Phase 5: Plots / figures**
6. **Phase 6: Qualitative error analysis**

The core implementation lives in the `src/normsense/` package and `scripts/` folder.
This notebook calls that code and displays results.

In [1]:
from pathlib import Path
import sys
import os
import subprocess

# Detect project root
ROOT = Path(os.getcwd())
if ROOT.name == "notebooks":
    ROOT = ROOT.parent

print("Project root:", ROOT)

# Add src/ to path
SRC_DIR = ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

print("Using src dir:", SRC_DIR)

# Load .env
from dotenv import load_dotenv
load_dotenv(ROOT / ".env")

def run_script(script_name: str):
    """
    Helper to run one of the scripts in the scripts/ directory and
    show its stdout/stderr in this notebook.
    """
    script_path = ROOT / "scripts" / script_name
    print(f"\n=== Running {script_path} ===\n")

    result = subprocess.run(
        [sys.executable, str(script_path)],
        cwd=ROOT,
        capture_output=True,   
        text=True
    )

    # Print STDOUT from the script
    if result.stdout:
        print(result.stdout)

    # Print STDERR if there was any
    if result.stderr:
        print("\n--- STDERR ---")
        print(result.stderr)

    print(f"\n=== Finished {script_name} with return code {result.returncode} ===\n")


Project root: C:\Users\amrkh\normsense_final
Using src dir: C:\Users\amrkh\normsense_final\src


## Phase 1 ‚Äì Dataset & Prompt Sanity Check

In this phase we:

- Load the full NormSense scenario dataset from `data/raw/normsense_scenarios_v0.3.json`
- Inspect the number of scenarios
- Show how the prompt templates look for different variants:
  - neutral
  - role_primed
  - empathy_primed

This confirms that the dataset and prompt construction are good before running any models.

In [14]:
# Run the Phase 1 test script (loads scenarios, prints prompt variants)
run_script("test_phase1.py")


=== Running C:\Users\amrkh\normsense_final\scripts\test_phase1.py ===

PHASE 1 ‚Äî DATASET INSPECTION
Total scenarios loaded: 112 (version=v0.3)

Dataset Breakdown:
- Domains: {'personal': 28, 'workplace': 28, 'customer_service': 28, 'online_social': 28}
- Norm Types: {'politeness': 28, 'empathy': 28, 'contextual_fit': 28, 'mixed': 28}
- Stakes Levels: {'low': 48, 'moderate': 32, 'high': 32}
- Cultural Tags: {'US': 32, 'India': 16, 'Japan': 24, 'Brazil': 4, 'Germany': 8, 'UK': 8, 'Middle_East': 8, 'Global': 8, 'Cross_cultural': 4}

Sample of First 3 Scenarios

Scenario ID: SC001
{
    "id": "SC001",
    "text": "Your close friend messages you to say they have to cancel your dinner plans because they feel emotionally drained after a long week. What is a polite and respectful way to respond in this situation?",
    "domain": "personal",
    "norm_type": "politeness",
    "cultural_tag": "US",
    "stakes_level": "low",
    "prompt_source": "original",
    "notes": "personal, politeness,

## Phase 2 ‚Äì Generate Model Responses (HF Local Models)

In this phase we:

- Load all **112 scenarios** from Phase 1  
- For each scenario and each prompt variant  
  - `neutral`  
  - `role_primed`  
  - `empathy_primed`  
- We construct the **system + user prompts** and run one or more **Hugging Face local models** (e.g., TinyLlama).

Outputs are written to:

`data/processed/model_responses_hf_local.jsonl`

Each line in that file contains:
- Scenario metadata (id, domain, norm type, stakes, cultural tag, etc.)
- Model name
- Prompt variant
- System + user prompts used
- Model-generated response text
- Timestamp and raw metadata


In [2]:
import json
from collections import Counter

responses_path = ROOT / "data" / "processed" / "model_responses_hf_local.jsonl"
print("Response file path:", responses_path)


Response file path: C:\Users\amrkh\normsense_final\data\processed\model_responses_hf_local.jsonl


In [3]:
import subprocess
import sys
import shlex
import time

def run_phase2_with_output():
    """
    Runs Phase 2 (run_models_hf_local.py) with output streaming.
    """
    script_path = ROOT / "scripts" / "run_models_hf_local.py"
    cmd = f"{sys.executable} {shlex.quote(str(script_path))}"

    print(f"Starting Phase 2:")
    print(f"   Command: {cmd}\n")
    print("‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")

    # Start the process
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        cwd=ROOT,
        text=True,
        bufsize=1,
        universal_newlines=True
    )

    # Live-stream the output
    for line in process.stdout:
        print(line, end="")  # print instantly, don't wait for buffering

    process.wait()
    print("‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ")
    print(f"\nüèÅ Phase 2 finished with return code {process.returncode}")
    return process.returncode


# ---- NOW TRIGGER THE RUN ----

if responses_path.exists() and responses_path.stat().st_size > 0:
    print("‚úÖ Responses already generated. Skipping Phase 2.")
else:
    print("‚ö†Ô∏è Responses missing or empty ‚Äî running Phase 2 now...\n")
    run_phase2_with_live_output()


‚ö†Ô∏è Responses missing or empty ‚Äî running Phase 2 now...



NameError: name 'run_phase2_with_live_output' is not defined

## Phase 3 ‚Äì LLM-as-a-Judge Scoring

In this phase we:

- Use a **judge model** (a local HF model) to evaluate each model response from Phase 2.
- For each (scenario, model, prompt_variant, response), the judge outputs:
  - Politeness (0‚Äì5)
  - Empathy (0‚Äì5)
  - Contextual fit (0‚Äì5)
  - Overall score (0‚Äì5)
  - Short textual rationale

The scores are saved to:

`data/processed/model_scores_v0.3.jsonl`

In [None]:
scores_path = ROOT / "data" / "processed" / "model_scores_v0.3.jsonl"
print("Scores path:", scores_path)

if scores_path.exists():
    print("‚úÖ Scores file already exists, so Phase 3 does not need to be re-run here.")
else:
    print("No scores file found yet. Running Phase 3 scoring...")
    run_script("run_phase3_scoring.py")

# Sanity: how many scored records?
if scores_path.exists():
    num_lines = sum(1 for _ in scores_path.open("r", encoding="utf-8"))
    print(f"Scores file contains {num_lines} lines.")

## Phase 4 ‚Äì Aggregate Scores by Model & Prompt Variant

In this phase we:

- Load `model_scores_v0.3.jsonl`
- Convert it to a pandas DataFrame
- Compute, for each `(model_name, prompt_variant)`:

  - number of scored responses
  - mean politeness
  - mean empathy
  - mean contextual fit
  - mean overall score

We save this summary as:

`data/processed/model_score_summary_by_model_variant.csv`

In [None]:
summary_csv = ROOT / "data" / "processed" / "model_score_summary_by_model_variant.csv"
print("Summary CSV:", summary_csv)

if summary_csv.exists():
    print("‚úÖ Summary CSV already exists, so Phase 4 does not need to be re-run here.")
else:
    print("No summary CSV found yet. Running Phase 4 aggregation...")
    run_script("run_phase4_aggregate.py")

# Show the summary table if it exists
if summary_csv.exists():
    import pandas as pd

    summary_df = pd.read_csv(summary_csv)
    summary_df

## Phase 5 ‚Äì Generate Plots / Figures

In this phase we:

- Load the aggregated summary CSV from Phase 4
- Produce bar plots showing, for each model and prompt variant:
  - Overall mean score
  - Mean politeness
  - Mean empathy
  - Mean contextual fit

Figures are saved under:

`reports/figures/`

In [None]:
plots_dir = ROOT / "reports" / "figures"
print("Plots directory:", plots_dir)

run_script("run_phase5_plots.py")

# List the generated files
if plots_dir.exists():
    print("Generated plots:")
    for p in plots_dir.glob("*.png"):
        print(" -", p.name)

## Phase 6 ‚Äì Qualitative Error Analysis (Best/Worst Examples)

In this phase we:

- Load the scored responses from Phase 3
- Extract:
  - The worst-scoring examples by politeness, empathy, contextual fit, and overall
  - The best-scoring examples on the same dimensions
- Save a human-readable Markdown file summarizing these examples and scores:

`reports/error_analysis/qualitative_examples.md`

This file is meant to support the qualitative analysis section of the final report.

In [None]:
qa_md = ROOT / "reports" / "error_analysis" / "qualitative_examples.md"
print("Qualitative analysis file:", qa_md)

if qa_md.exists():
    print("‚úÖ Qualitative examples file already exists, so Phase 6 does not need to be re-run here.")
else:
    print("No qualitative examples file found yet. Running Phase 6 error analysis...")
    run_script("run_phase6_error_analysis.py")

if qa_md.exists():
    print("You can open this file in an editor for detailed qualitative examples.")