# 📽️ Policy Projector
_This notebook can be used as a tutorial or a template for working with Policy Projector on an ordinary computer CPU with (optional) OpenAI models. 🤖 With a GPU you can alternatively run with local models of your choice: see the GPU version of this notebook._

For licensing see accompanying `LICENSE` file.
Copyright (C) 2025 Apple Inc. All Rights Reserved.

📽️ If developing locally, first follow the `README` instructions in this repo on launching the vite server. The python backend in that server is required to render Policy Projector's Jupyter Notebook widget

⚠️ **Content Warning**: This tutorial covers the same use case from our paper: AI _safety_ policy. We use a formatted version of [hh-rlhf dataset from Anthropic](https://huggingface.co/datasets/Anthropic/hh-rlhf) from Bai et al. 2022, which is an LLM safety dataset. **This data contains harmful, unethical, and upsetting material which may be triggering to some individuals**. Please proceed with caution and mindfullness to your own wellbeing. The dataset content does not reflect the views of Apple or the authors.

## 1. Setup

### 1.0 Prepare data
Before running this notebook, run `preprocess_data.ipynb` to generate and preprocess the Anthropic hh-rlhf demo data.

### 1.1 Install Dependencies

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import json
import pickle
from dotenv import load_dotenv
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

### 1.2 OpenAI Key (Optional)
📽️ Note: You can explore and run most Policy Projector operations in this notebook _without_ an OpenAI API key, but the **suggest_concept()** and **concept_classify** features do require OpenAI models to run. If available, add your API key in a file named `.env` at the top directory of this repository, with a single line that goes: 
```
OPENAI_API_KEY = 'my-actual-api-key-here'
```

In [None]:
load_dotenv() # Load API key from a .env file
openai_key = os.environ["OPENAI_API_KEY"]

### 1.3 Load the dataset

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
data_file = "../data/anthropic_1000/anthropic_1000.csv"
df = pd.read_csv(data_file)

To get a sense of the harm categories present in this dataset, we'll sample a few. Note there are many messages labeled "safe".

In [None]:
df.input_harm_cat.sample(10)

### 1.4 Load Policy Projector

In [None]:
sys.path.append("../policy-projector/policy_projector/src")
from policy_projector import PolicyProjector

In [None]:
# Create instance
p = PolicyProjector(
    df, id_col="id", in_text_col="user_input", out_text_col="model_output", 
    concept_col="input_harm_cat",
    auto_populate_limit=30,
    base_model_path=None,  # For CPU usage where we're not using an on-device LLM
)

## 2. Concepts

### 2.1 Explore Current Concepts

In [None]:
# Explore global view of current concepts
# User can get a preview of different currently-existing concepts
p.view()

### 2.2 Automatic Concept Suggestion from Latent Unlabeled Harms
In Concept Suggestion, Poliucy Projector looks for potential conceptual "blind spots" in the data by asking LLMs to identify harm concepts that are _not existing labels_ in the dataset.

📽️ When `suggest_concepts` is run with `debug` set to `True`, the suggested concepts are a fixed set of cached examples and _not_ freshly computed based on the data. This is to help you get a sense of what suggested concepts and their metadata will look like:

In [None]:
await p.suggest_concepts(limit=100, debug=True)

📽️ To run `suggest_concepts` with `debug` set to `False`, you will need to have an API key with OpenAI loaded at the start of this notebook. Policy Projector's concept suggestion uses:
- The natural language toolkit **nltk**
- The concept induction (research) toolkit **Lloom** at https://stanfordhci.github.io/lloom/
- ChatGPT models **gpt-4o-mini**, **gpt-4o**, and **text-embedding-3-large**

In [None]:
# This needs to be run only the first time you run this notebook
# This step downloads a Punkt Sentence Tokenizer from nltk
import nltk
nltk.download('punkt_tab')

In [None]:
 # Set debug to False to run with ChatGPT, or True to use cached suggestions
await p.suggest_concepts(limit=100, debug=True)

### 2.3 Add a New Concept
You can add a new concept as suggested by the concept suggestion, or _manually_ by writing your own concept name, description, and examples. This is handy if there is concept you, your team, or users have already identifed and you want to map that concept to identify matching examples.

In [None]:
cur_spec = {
  "name": "Dangerous Actions",
  "description": "Does the text example involve actions that could cause harm or danger?",
  "examples": [
    "e31945",
    "e33075"
  ],
  "fixes": [],
  "existing_concept": ""
}
p.add_concept(cur_spec)

In [None]:
p.view_concept(name="Dangerous Actions")

In [None]:
cur_spec = {
  "name": "Health Risks",
  "description": "Does the text example discuss risks to health or well-being?",
  "examples": [
    "e9973",
    "e11930"
  ],
  "fixes": [],
  "existing_concept": ""
}
c = p.add_concept(cur_spec)

In [None]:
p.view_concept(c)

### 2.4 Classify Data by Concept
Policy Projector uses an LLM to classify for each item in the dataset whether or not the item matches your concept.

📽️ Just like `suggest_concepts`, to run `classify` with `debug` set to `False`, you will need to have an API key with OpenAI loaded at the start of this notebook. If `classify` is run with `debug` set to `True` you will see some fixed pre-computed results that illustrate what the output looks like.

In [None]:
c = p.get_concept(name="Bullying & harassment")

In [None]:
len(c.examples)

In [None]:
df_50 = df.head(50)

In [None]:
await c.classify(df_50, col="model_output", in_text_col="user_input", id_col="id", n=50, show_widget=True, debug=True)

In [None]:
len(c.examples)

In [None]:
p.view()

### 2.5 Export a Concept as a Spec
The `to_spec` operator can be used to output or save your concept as a JSON-formatted specification.

In [None]:
c = p.get_concept(name="Bullying & harassment")

In [None]:
c.to_spec(include_examples=True)

## 3. Policies

### 3.1 Create a New Policy

In [None]:
 # Set debug to False to run with ChatGPT, or True to use cached suggestions
await p.suggest_concepts(limit=100, spec_to_show="policy", debug=True)

In [None]:
# Make underlying concept first
cur_spec = {
  "name": "Violence and Harm",
  "description": "Does the text example describe acts of violence or physical harm?",
  "examples": [
    "e14352",
    "e25585"
  ],
  "fixes": [],
  "existing_concept": ""
}
c = p.add_concept(cur_spec)

In [None]:
cur_spec = {
  "name": "Violence and Harm",
  "description": "Does the text example describe acts of violence or physical harm?",
  "if": [
    "Violence and Harm"
  ],
  "examples": [
    "e14352",
    "e25585"
  ]
}
pol_1 = p.add_policy(cur_spec)

In [None]:
p.view_policy(pol_1)

In [None]:
p.view_policy(p_id="Violence and Harm")

### 3.2 Test Model Output against the Policy

In [None]:
df_50 = df.head(50)

In [None]:
len(pol_1.examples)

In [None]:
# Set debug to False to run with ChatGPT, or True to use (inaccurate) cached results
await pol_1.match(df_50, col="model_output", in_text_col="user_input", id_col="id", n=50, show_widget=True, debug=True)

In [None]:
p.view_policy(p_id="Violence and Harm")

In [None]:
len(pol_1.examples)

In [None]:
p.view_concept(name="Violence and Harm")

### 3.3 Export a Policy as a Spec

In [None]:
pol_1.to_spec(include_examples=True)