Exploration Hacking

TODO

Add Nathalie's frontier auditing experiments to other_experiments/
Add Eyon's countermeasures experiments to other_experiments/
Add unconditional locking experiments (KernelBench and WMDP) to other_experiments/
Final pass: check everything is up to date and clean before publication

Shared codebase for exploration hacking RL experiments, built on top of a custom fork of Verifiers.

Paper Experiments

WMDP (Science) Conditional Tool Misuse

Configs and artifacts in etc/dfalck/science_conditional/final/.

Model: Qwen3-14B with LoRA (rank 32)
Benchmarks: WMDP-bio (biology MCQ) and VCT (virology/clinical)
Pipeline: Data generation → Filtering → SFT → RL elicitation → Evaluation
Conditions: Treatment (SFT + RL) vs Control (RL only)
Trained weights: artifacts/weights/dfalck/science_conditional/final/sft_model/
Evaluation results: artifacts/data/dfalck/science_conditional/final/eval_*.eval
Paper figures: artifacts/plots/paper/

BigCodeBench (Code) Conditional Locking

Configs in etc/jbraun/bcb_conditional/. Same pipeline structure as above but for coding tasks with BigCodeBench.

Benign SFT Elicitation

Tests whether benign fine-tuning can elicit conditional behavior without RL. Configs in etc/dfalck/science_conditional/benign_sft_elicitation/.

Frontier Model Auditing (Discovery)

Agentic experiments testing whether frontier models spontaneously discover and act on training context from environment files. Self-contained in other_experiments/frontier_auditing_discovery/. See the writeup for details.

Repository Structure

src/exploration_hacking/     # Main Python package
  environments/              # RL environment definitions (science, bigcodebench, kernelbench)
  rewards/                   # Reward functions and rubrics
  rubrics/                   # LLM judge configurations
  tools/                     # Tool implementations (web search)
  scripts/                   # Training and evaluation entrypoints
etc/                         # Experiment config files (YAML)
  dfalck/science_conditional/  # WMDP experiment configs (final/ and benign_sft_elicitation/)
  jbraun/bcb_conditional/      # BigCodeBench experiment configs
  archive/                     # Old/iterative experiment configs
other_experiments/           # Standalone experiments outside the main RL codebase
  frontier_auditing_discovery/ # Frontier model auditing experiments
artifacts/                   # Data, weights, logs, and plots
shared_loras/                # Shared LoRA adapters
shared_datasets/             # Shared processed datasets
scripts/                     # Standalone utility scripts
notebooks/                   # Jupyter notebooks for analysis
verifiers/                   # Custom Verifiers fork (git submodule)

VCT Data

The Virology Capabilities Test (VCT) dataset is not included due to contractual restrictions. VCT-specific evaluations and frontier auditing outputs require it. WMDP-bio experiments work without it. If you have access, run:

bash scripts/restore_private_data.sh

This clones from d-falck/exploration-hacking-private-data and places files into gitignored directories so they won't be accidentally committed.

Installation

Clone the repo using git clone [url] --recurse-submodules, cd into it and run uv pip install -e . --group dev (omit the --group dev if you don't need development dependencies like Jupyter).

Setup

Copy .env.example as .env and add your API keys. See the comments in .env.example for which keys are required vs optional.

Environment-specific Installation

BigCodeBench

If you plan to run BigCodeBench evaluations, you'll need additional dependencies for the test sandbox:

# Install BigCodeBench optional dependencies (73 packages)
uv pip install -e ".[bigcodebench]"
# OR using the setup script:
./scripts/setup_bigcodebench.sh

Important: Always run BigCodeBench evaluations from an activated virtual environment, as the test subprocesses inherit the parent's Python environment to access these dependencies.

Usage

Logging Configuration

This project supports two types of logging:

Training Metrics (WandB)

Training runs log metrics to Weights & Biases. You'll need a WandB account and API key in your .env.

Trace Logging (MLflow or Inspect)

For detailed execution traces during evaluation, you can use either:

Inspect Logging (recommended): Saves traces to local JSON files compatible with the Inspect AI format
- No additional setup required
- Use --logging-destination inspect when running evaluation scripts
- Traces saved to outputs/traces/ by default
MLflow Logging (legacy): Sends traces to an MLflow tracking server
- Requires an MLflow tracking server (specify URL in .env as MLFLOW_TRACKING_URI)
- For SQLite backend, set environment variable: MLFLOW_SQLALCHEMYSTORE_POOLCLASS=NullPool
- Use --logging-destination mlflow when running scripts

Running scripts

Use config files like e.g.

python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml

You can override arguments manually if you wish:

python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml --rl.learning-rate 1e-4

Results are saved with timestamps in outputs/eval_results/ as both .pkl and .json files.

Using RunPod

If you're using RunPod, you can use the scripts/run_and_stop.py script to automatically terminate your node after completion:

python scripts/run_and_stop.py --timeout 240 --only-stop-after 5 --log-file -- python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml

This Docker image works well with our experiments (you'll have to conda init after startup and install into the base conda environment using uv). A RunPod template for this is available here.

Hardware

For RL training, you'll need at least 2 GPUs. 4-8 are recommended. Use H100s (or even better, H200s) if possible.

For evals, you'll just need 1 GPU (and currently cannot make use of more than one).

Scripts

The scripts currently available are:

exploration_hacking.scripts.grpo: PEFT agent RL
exploration_hacking.scripts.sft: train a model on example off-policy interaction traces
exploration_hacking.scripts.evaluate: evaluate a model in an RL environment
exploration_hacking.scripts.data.filter: make an SFT dataset by filtering evaluation results
exploration_hacking.scripts.data.mix: combine SFT datasets by interleaving

The read_eval_results.ipynb notebook is pretty useful for comparing model evals visually.

Development

Verifiers trunk

Currently we're using a private fork of Verifiers that has a bunch of logging and other improvements. We've probably missed a bunch of Verifiers updates this way: it's a TODO to merge in upstream Verifiers changes, and possibly pull relevant changes of ours back into that trunk.

Environments

To implement a new RL environment:

Create your environment module with:
- A config class inheriting from BaseEnvironmentConfig (from base.py)
- A loader function with signature: load_your_environment(config: YourEnvConfig, seed: int | None = None) -> vf.ToolEnv
You can either:
- Create a single file: src/exploration_hacking/environments/your_env.py
- Or a module directory: src/exploration_hacking/environments/your_env/__init__.py

Config class should inherit from BaseEnvironmentConfig, which provides:

dataset_names: list[str]           # Required
prompt_prefixes: dict[str, str] | None
system_prompt: str                 # Required
split: SplitConfig                 # Required
segment_rewards: dict[str, RewardConfig] = {}
global_rewards: RewardConfig | None
tokenizer: str = "willcb/Qwen3-14B"
max_turns: int = 4
inline_reasoning: bool = False

Add any environment-specific fields as needed (e.g., tool configurations).

In your loader function:
- Create a Loader instance with prompt/answer extraction functions
- Build dataset sources and merge them
- Setup tools list if needed
- Create rubric using get_rubric() or get_conditional_rubric()
- Build kwargs dict, then add train/test datasets:
```
kwargs = dict(tools=tools, rubric=rubric, parser=parser, ...)
if "train" in dataset: kwargs["dataset"] = dataset["train"]
if "test" in dataset: kwargs["eval_dataset"] = dataset["test"]
return vf.ToolEnv(**kwargs)
```

Register your environment in src/exploration_hacking/environments/__init__.py:

from .your_env import YourEnvConfig, load_your_environment

ENVIRONMENTS["your_env"] = _EnvironmentDefinition(
    config_class=YourEnvConfig,
    loader=load_your_environment
)

What NOT to modify:
- The _EnvironmentDefinition class or load_environment() function
- The dynamic EnvironmentConfig creation logic
- The base.py file (unless adding genuinely common fields)

See science.py or bigcodebench/__init__.py for complete examples.

Style

Use Black for formatting.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bash/dfalck		bash/dfalck
etc		etc
notebooks		notebooks
other_experiments/frontier_auditing_discovery		other_experiments/frontier_auditing_discovery
scripts		scripts
src/exploration_hacking		src/exploration_hacking
verifiers @ 5bfcf64		verifiers @ 5bfcf64
writeup		writeup
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
generate_bcb_summary.py		generate_bcb_summary.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploration Hacking

TODO

Paper Experiments

WMDP (Science) Conditional Tool Misuse

BigCodeBench (Code) Conditional Locking

Benign SFT Elicitation

Frontier Model Auditing (Discovery)

Repository Structure

VCT Data

Installation

Setup

Environment-specific Installation

BigCodeBench

Usage

Logging Configuration

Training Metrics (WandB)

Trace Logging (MLflow or Inspect)

Running scripts

Using RunPod

Hardware

Scripts

Development

Verifiers trunk

Environments

Style

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploration Hacking

TODO

Paper Experiments

WMDP (Science) Conditional Tool Misuse

BigCodeBench (Code) Conditional Locking

Benign SFT Elicitation

Frontier Model Auditing (Discovery)

Repository Structure

VCT Data

Installation

Setup

Environment-specific Installation

BigCodeBench

Usage

Logging Configuration

Training Metrics (WandB)

Trace Logging (MLflow or Inspect)

Running scripts

Using RunPod

Hardware

Scripts

Development

Verifiers trunk

Environments

Style

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages