## End-to-end notebook for processing Anthropic data

_This notebook demonstrates how to processes Anthropic hh-rlhf data to demo Policy Maps, including over-sampling policy relevant samples for clearer policy visualization. The original dataset card can be found at: https://huggingface.co/datasets/Anthropic/hh-rlhf_ 

For licensing see accompanying `LICENSE` file.
Copyright (C) 2025 Apple Inc. All Rights Reserved.

⚠️ **Content Warning**: This tutorial covers the same use case from our paper: AI _safety_ policy. We use a formatted version of [hh-rlhf dataset from Anthropic](https://huggingface.co/datasets/Anthropic/hh-rlhf) from Bai et al. 2022, which is an LLM safety dataset. **This data contains harmful, unethical, and upsetting material which may be triggering to some individuals**. Please proceed with caution and mindfullness to your own wellbeing. The dataset content does not reflect the views of Apple or the authors.

This notebook produces the various files needed for Policy Projector library and web visualization:
* `<dataset>_by_concept.parquet`
* `<dataset>_concepts.json`
* `<dataset>_embeddings.npy`
* `<dataset>_projection.parquet`
* `<dataset>.csv`
* `<dataset>.parquet`

In [None]:
import os
import sys
from datasets import load_dataset
import pandas as pd
import numpy as np
import json
import pickle
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
from tqdm import tqdm
from random import sample

from sentence_transformers import SentenceTransformer
from umap import UMAP

## 0: Parameters to specify

In [None]:
# Edit these parameters:

# Specify path to data
DATA_DIR = "../data/"
DATA_NAME = "anthropic"
INPUT_COL = "user_input"  # Column containing input text to summarize
OUTPUT_COL = "model_output"  # Column containing original output text (will be replaced)
N_TO_GENERATE = 1000  # Number of instances to sample from the full dataset

# Full data folder (not sampled)
DATA_FOLDER = f'{DATA_DIR}/{DATA_NAME}/'
CSV_PATH = f'{DATA_FOLDER}/{DATA_NAME}.csv'
CONCEPT_PATH = f'{DATA_FOLDER}/{DATA_NAME}_concepts.json'
BY_CONCEPT_PATH = f"{DATA_FOLDER}/{DATA_NAME}_by_concept.parquet"

# Assigned name for new dataset (will be visible in folder and UI)
CUR_DATASET = f"{DATA_NAME}_{N_TO_GENERATE}"
DATA_FOLDER_SAMPLED = f'{DATA_DIR}/{CUR_DATASET}'
CSV_PATH_SAMPLED = f'{DATA_FOLDER_SAMPLED}/{CUR_DATASET}.csv'
PARQUET_PATH_SAMPLED = f"{DATA_FOLDER_SAMPLED}/{CUR_DATASET}.parquet"
BY_CONCEPT_PATH_SAMPLED = f"{DATA_FOLDER_SAMPLED}/{CUR_DATASET}_by_concept.parquet"
CONCEPT_PATH_SAMPLED = f'{DATA_FOLDER_SAMPLED}/{CUR_DATASET}_concepts.json'
EMBEDDINGS_PATH = f"{DATA_FOLDER_SAMPLED}/{CUR_DATASET}_embeddings.npy"
PROJECTIONS_PATH = f"{DATA_FOLDER_SAMPLED}/{CUR_DATASET}_projection.parquet"

## 1. Download Anthropic HH-RLHF dataset

This step downloads the dataset from Hugging Face, and takes the training split of the data to produce `anthropic.csv`

_(Skip this step if you've already downloaded the data)_

In [None]:
dataset = load_dataset("Anthropic/hh-rlhf", data_dir="red-team-attempts")
df = dataset['train'].to_pandas()

Since we don't have a single model output for this dataset, we're going to pretend that the transcript is the output for the purposes of this demo:

In [None]:
df['model_output'] = df['transcript']

Next we'll save the raw data to sample from later:

In [None]:
# Set up new folder
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)

if not os.path.exists(DATA_FOLDER):
    os.makedirs(DATA_FOLDER)

# Save dataset
df.to_csv(CSV_PATH, index=False)

## 2. Extract existing concepts from the data

The Anthropic HH-RLHF red teaming dataset comes with harm labels for each red teaming attack attempt. We'll use these harm labels as the initial set of concepts in our policy map. For more on this dataset's harm labels, see Ganguli et al., 2022 https://arxiv.org/pdf/2209.07858

This step produces `anthropic_concepts.json`

In [None]:
df = pd.read_csv(CSV_PATH)

In [None]:
concept = 'tags' # Column containing concept labels

In [None]:
# Fill in unlabeled samples and turn string list into a list
def clean_tags(x, to_str=False):
    none = 'safe' # Concept to indicate no harm label
    if pd.isna(x):
        return none
    tag_arr = x.strip('][').replace('\n', '').split("' '")
    tag_arr = [x.replace('\'', '') for x in tag_arr]
    tag_arr = list(set(tag_arr))
    if to_str:
        return "; ".join(tag_arr)
    return tag_arr

In [None]:
df['tags'] = df['tags'].apply(lambda x: clean_tags(x))

In [None]:
# Get full list of concepts -> 21 concepts
concept_list = list(df['tags'].explode().unique())
print(len(concept_list), concept_list)

In [None]:
# Ignore 'Other', 'N/A - Invalid attempt' --> 19 concepts
to_remove = ['Other', 'N/A - Invalid attempt']
concept_list = [c for c in concept_list if c not in to_remove]
print(len(concept_list), concept_list)

In [None]:
concepts_json = [{'name': h, 'definition': '', 'examples': [], 'display_name': h} for h in concept_list]
all_json = [{'name': 'Harm Category', 'definition': '', 'concepts': concepts_json}]

In [None]:
# save full concept list to file
with open(CONCEPT_PATH, 'w', encoding='utf-8') as f:
    json.dump(all_json, f, ensure_ascii=False, indent=4, sort_keys=True)

## 3. Sort data by concept

Next, we'll produce a one-hot-encoding of each sample by concept. This will make it faster to sort and filter the dataset by concept, and sample the dataset by concept. We use `parquet` since it's more efficient than `csv` to use in a web visualization with DuckDB and Mosaic (https://idl.uw.edu/mosaic/duckdb/)

This step produces `anthropic_by_concept.parquet`

In [None]:
df_concept = pd.DataFrame(index = df.index)
for concept in concept_list:
    df_concept[concept] = df['tags'].apply(lambda x: np.isin(concept, x)).astype(int)

df_concept.index.names = ['rowID']

In [None]:
# There are a majority "safe" samples in this dataset
df_concept['safe'].sum()

In [None]:
# There are much fewer samples containing certain harms
df_concept['Theft'].sum()

In [None]:
df_concept.to_parquet(BY_CONCEPT_PATH)

## 4. Anthropic data sampling

Next, we're going create a sample of the dataset. Since this demo data has a vast majority of `safe` concept samples, we're going to purposefully oversample samples with other concepts to create a dataset that has more policy-relevant cases.

In [None]:
# Set up new folder for sampled data
if not os.path.exists(DATA_FOLDER_SAMPLED):
    os.makedirs(DATA_FOLDER_SAMPLED)

In [None]:
# Copy concept starter files to new folder
! scp  {CONCEPT_PATH} {DATA_FOLDER_SAMPLED}
! mv {DATA_FOLDER_SAMPLED}/{DATA_NAME}_concepts.json {DATA_FOLDER_SAMPLED}/{CUR_DATASET}_concepts.json

In [None]:
# Get n samples of a given concept
def get_sample(df_feat, concept_name, n, selected):
    cur_df = df_feat[(df_feat[concept_name] == 1) & (~df_feat.index.isin(selected))] # Filter to concept matches not yet selected
    ex_ids = cur_df.index.tolist()
    sample_size = min(len(ex_ids), n)
    ex_ids_sample = sample(ex_ids, sample_size) # Sample n examples
    return ex_ids_sample

# Get n samples for each concept
def get_all_samples(df_feat, concepts, n=20):
    selected = []
    for c in concepts:
        ex_ids = get_sample(df_feat, c, n, selected)
        selected.extend(ex_ids)

    return selected

# Get a certain balance of safe and other-concept samples from the data
def sample_from_anthropic(df, feat_path, tax_concepts, n_per_concept, n_safe):
    df_feat = pd.read_parquet(feat_path)

    # Harm category examples
    selected = get_all_samples(df_feat, tax_concepts, n=n_per_concept)
    df_feat_selected = df_feat[df_feat.index.isin(selected)]
    df_selected = df[df.index.isin(selected)]
    print(f"Unsafe: n={len(df_selected)}")

    # Safe examples
    selected_safe = get_all_samples(df_feat, ["safe"], n=n_safe)
    df_feat_selected_safe = df_feat[df_feat.index.isin(selected_safe)]
    df_selected_safe = df[df.index.isin(selected_safe)]
    print(f"Safe: n={len(df_selected_safe)}")

    # Combine results and save
    df_selected2 = pd.concat([df_selected, df_selected_safe])
    df_feat_selected2 = pd.concat([df_feat_selected, df_feat_selected_safe])
    print(f"Combined: n={len(df_feat_selected2)}")

    return df_selected2, df_feat_selected2


In [None]:
df = pd.read_csv(CSV_PATH)
df_selected, df_feat_selected = sample_from_anthropic(df, BY_CONCEPT_PATH, concept_list, n_per_concept=55, n_safe=400)

## 5. Anthropic data formatting & truncating
For the purposes of this demo we'll split the transcript into user input versus model output and shorten the transcript for readability.

In [None]:
def filter_to_role(x, role, n_turns):
    entries = x.strip().split('\n\n')
    entries = [e.split(': ', 1) for e in entries]
    # print(entries)
    role_entries = [e[1] for e in entries if ((e[0] == role) and len(e) == 2)]
    role_entries = role_entries[:n_turns]
    role_str = "\n".join(role_entries)
    return role_str

filter_to_assistant = lambda x: filter_to_role(x, role="Assistant", n_turns=1)
filter_to_human = lambda x: filter_to_role(x, role="Human", n_turns=1)

In [None]:
# original transcript
df_selected_filt = df_selected.assign(model_output_orig = df_selected["model_output"])

In [None]:
# model output only
df_selected_filt = df_selected_filt.assign(model_output = df_selected_filt["model_output_orig"].apply(filter_to_assistant))

In [None]:
# user input only
df_selected_filt = df_selected_filt.assign(user_input = df_selected_filt["model_output_orig"].apply(filter_to_human))

In [None]:
# concepts on input
df_selected_filt = df_selected_filt.assign(input_harm_cat = df_selected_filt["tags"].apply(lambda x: clean_tags(x, to_str=True)))

In [None]:
# sample ID
df_selected_filt = df_selected_filt.assign(id = "e"+df_selected_filt.index.astype('str'))

In [None]:
df_selected_filt.to_csv(f"{CSV_PATH_SAMPLED}")

## 6: Metadata creation for map

Next, we're going to create a `parquet` version of our sampled dataset, for more efficient data viewing in the browser with Mosaic + DuckDB. https://idl.uw.edu/mosaic/duckdb/

This step produces `<dataset>.parquet`

In [None]:
df = pd.read_csv(CSV_PATH_SAMPLED)
df.to_parquet(PARQUET_PATH_SAMPLED)

Next, we're going to produce `<dataset>_by_concept.parquet` for our sampled data:

In [None]:
df_c = df.copy()
none = 'safe'

df_c['input_harm_cat'] = df_c['input_harm_cat'].fillna(none)
df_c['input_harm_cat'] = df_c['input_harm_cat'].apply(lambda x: x.split('; '))

df_c['concept'] = df_c['input_harm_cat']
all_concepts = list(df_c['concept'].explode().unique())

df_concept = pd.DataFrame(index = df_c.index)
for concept in all_concepts:
    df_concept[concept] = df_c['input_harm_cat'].apply(lambda x: np.isin(concept, x)).astype(int)

df_concept.index.names = ['rowID']
df_concept.head()

In [None]:
len(df_concept.columns)

In [None]:
df_concept.columns.tolist()

In [None]:
df_concept.to_parquet(BY_CONCEPT_PATH_SAMPLED)

Finally, we're going to fill in `<dataset>_concepts.json` with additional information about each starter concept from the original Anthropic HH-RLHF dataset paper https://arxiv.org/pdf/2209.07858

In [None]:
with open(CONCEPT_PATH_SAMPLED, 'r', encoding='utf-8') as f:
    concepts = json.load(f)

In [None]:
concepts = concepts[0]["concepts"]

In [None]:
harm_set = [c for c in concepts if c["display_name"] in concept_list]

In [None]:
def get_concept_example_ids(concept_name, df):
    ex_ids = df[df["input_harm_cat"].apply(lambda x: np.isin(concept_name, x))]["id"].tolist()
    return ex_ids

def get_display_name(name):
    # Clean up some of the display names
    if name == "safe":
        return "Safe"
    elif name == "Non-violent unethical behavior (e.g. lying, cheating, etc.)":
        return "Non-violent unethical behavior"
    else:
        return name

# Definitions generated with gpt-4o-mini using the following prompt:
"""
Please write a one-sentence definition criteria for each of the following categories of harmful text. 
CATEGORIES: ['Discrimination & injustice', 'Adult content', 'Hate speech & offensive language', 'Non-violent unethical behavior (e.g. lying, cheating, etc.)', 'Conspiracy theories & misinformation', 'Violence & incitement', 'Sexual exploitation & human trafficking', 'Bullying & harassment', 'Soliciting personally identifiable information', 'Animal abuse', 'Fraud & deception', 'Weapons', 'Terrorism & organized crime', 'Property crime & vandalism', 'Substance abuse & banned substances', 'Other', 'Theft', 'Child abuse', 'Self-harm']

Here's an EXAMPLE of the kind of definition I want:
"Does the text contain disrespectful, inflammatory, abusive, or otherwise intentionally demeaning references that perpetuate perceptions and beliefs, disrespect or offend? This includes the use of nicknames; racist (and other) tropes, epithets, or slurs; plus similarly defamatory terms, especially when hateful, dehumanizing, or demonizing."
"""

anthropic_definitions = {
    'Discrimination & injustice': "Does the text promote or endorse unfair treatment, exclusion, or marginalization of individuals or groups based on characteristics such as race, gender, religion, or socioeconomic status?",
    'Adult content': "Does the text contain sexually explicit material or language that is intended to arouse, including graphic descriptions of sexual acts or nudity?",
    'Hate speech & offensive language': "Does the text include derogatory language or expressions that incite hatred or violence against individuals or groups based on attributes like race, ethnicity, nationality, religion, sexual orientation, or gender identity?",
    'Non-violent unethical behavior (e.g. lying, cheating, etc.)': "Does the text advocate or normalize dishonest or unethical behaviors such as deceit, fraud, cheating, or manipulation?",
    'Conspiracy theories & misinformation': "Does the text propagate false information or unfounded conspiracy theories that distort reality or mislead individuals about significant events or issues?",
    'Violence & incitement': "Does the text advocate for, glorify, or incite acts of violence, harm, or aggression against individuals or groups?",
    'Sexual exploitation & human trafficking': "Does the text facilitate, promote, or describe acts of sexual exploitation or the trafficking of persons for sexual purposes?",
    'Bullying & harassment': "Does the text encourage, endorse, or exhibit behaviors that target individuals with aggressive or intimidating language aimed at causing emotional harm?",
    'Soliciting personally identifiable information': "Does the text request, encourage, or pressure individuals to provide sensitive personal information that could compromise their privacy or security?",
    'Animal abuse': "Does the text promote, describe, or endorse harmful actions that intentionally cause suffering or harm to animals?" ,
    'Fraud & deception': "Does the text engage in or promote dishonest practices aimed at securing unfair or unlawful gain, such as scams or deceitful schemes?",
    'Weapons': "Does the text encourage, promote, or describe the use, manufacture, or distribution of firearms, explosives, or other lethal weapons?",
    'Terrorism & organized crime': "Does the text support, promote, or glorify terrorist activities or organized crime actions that aim to harm individuals or society?",
    'Property crime & vandalism': "Does the text advocate for or describe acts that involve the illegal destruction, theft, or unauthorized alteration of property?",
    'Substance abuse & banned substances': "Does the text promote, glorify, or provide methods for the abuse of controlled substances or illegal drugs?",
    'Other': "Does the text contain harmful content that does not fit into the other specified categories but nevertheless poses a risk or promotes harm?",
    'Theft': "Does the text encourage, describe, or advocate for committing acts of stealing or taking someone else's property without consent?",
    'Child abuse': "Does the text depict, promote, or justify acts of emotional, physical, or sexual harm directed towards minors?",
    'Self-harm': "Does the text encourage, glorify, or promote behaviors that involve intentionally causing harm to oneself, including suicidal ideation or self-injury?",
    'safe': 'Does the text not contain any potentially harmful content?',
}

In [None]:
harm_concepts = [
    {
        'name': h["name"],  
        'display_name': get_display_name(h["display_name"]), 
        'definition': anthropic_definitions[h["name"]], 
        'examples': get_concept_example_ids(h["display_name"], df_c),
    } for h in harm_set
]

concepts = [{
    'name': 'Harm Category', 
    'definition': '', 
    'concepts': harm_concepts
}]

In [None]:
concepts

In [None]:
with open(CONCEPT_PATH_SAMPLED, 'w', encoding='utf-8') as f:
    json.dump(concepts, f, ensure_ascii=False, indent=4, sort_keys=True)

## 7. Embedding & Projection

For creating a policy map visualization, in this step we take the embedding of each sample and its concepts, then project those combined embeddings into a 2D space using UMAP.

This step produces `<dataset>_embeddings.npy` and `<dataset>_projection.parquet`

In [None]:
df = pd.read_parquet(PARQUET_PATH_SAMPLED)

df["text"] = df[OUTPUT_COL].fillna("")
df["harm"] = df["input_harm_cat"].fillna("")
df['harm'] = df['harm'].apply(lambda x: clean_tags(x, to_str=True))

promptRows = list(df["text"])
harmRows = list(df["harm"])

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(promptRows, show_progress_bar=True)
print(embeddings.shape)

embeddings_2 = sentence_model.encode(harmRows, show_progress_bar=True)
print(embeddings_2.shape)

embeddings = embeddings + embeddings_2

In [None]:
np.save(EMBEDDINGS_PATH, embeddings.astype(np.float32))

In [None]:
embeddings = np.load(EMBEDDINGS_PATH)

reducer = UMAP(metric='cosine')
embeddings_2d = reducer.fit_transform(embeddings)

xs = embeddings_2d[:, 0].astype(float)
ys = embeddings_2d[:, 1].astype(float)

df_proj = pd.DataFrame(data={'x': xs, 'y': ys})
df_proj['rowID'] = df_proj.index.astype(int)
df_proj["island"] = "mainland"

In [None]:
df_proj.to_parquet(PROJECTIONS_PATH)

## 8. Add a starter policy for the demo

For the demo, we'll add a single policy to start with.
This step produces `<dataset>_policy.json`

In [None]:
TEMPLATE_POLICY = [
  {
    "name": "Disallow obscenities used to discriminate",
    "id": "p1",
    "index": 0,
    "description": "Text containing obscenities is allowed unless it is used to discriminate against others",
    "if": [
      "Hate speech & offensive language",
      "Discrimination & injustice"
    ],
    "then": {
      "action": "block",
      "concept": []
    },
    "examples": []
  }
]

In [None]:
OUT_PATH = f"{DATA_FOLDER_SAMPLED}/{CUR_DATASET}_policy.json"
with open(OUT_PATH, 'w', encoding='utf-8') as f:
    json.dump(TEMPLATE_POLICY, f, ensure_ascii=False, indent=4, sort_keys=True)

And done! This data is ready to visualize.