Skip to content

exploration-hacking/exploration-hacking

Repository files navigation

Exploration Hacking

TODO

  • Add Nathalie's frontier auditing experiments to other_experiments/
  • Add Eyon's countermeasures experiments to other_experiments/
  • Add unconditional locking experiments (KernelBench and WMDP) to other_experiments/
  • Final pass: check everything is up to date and clean before publication

Shared codebase for exploration hacking RL experiments, built on top of a custom fork of Verifiers.

Paper Experiments

WMDP (Science) Conditional Tool Misuse

Configs and artifacts in etc/dfalck/science_conditional/final/.

  • Model: Qwen3-14B with LoRA (rank 32)
  • Benchmarks: WMDP-bio (biology MCQ) and VCT (virology/clinical)
  • Pipeline: Data generation → Filtering → SFT → RL elicitation → Evaluation
  • Conditions: Treatment (SFT + RL) vs Control (RL only)
  • Trained weights: artifacts/weights/dfalck/science_conditional/final/sft_model/
  • Evaluation results: artifacts/data/dfalck/science_conditional/final/eval_*.eval
  • Paper figures: artifacts/plots/paper/

BigCodeBench (Code) Conditional Locking

Configs in etc/jbraun/bcb_conditional/. Same pipeline structure as above but for coding tasks with BigCodeBench.

Benign SFT Elicitation

Tests whether benign fine-tuning can elicit conditional behavior without RL. Configs in etc/dfalck/science_conditional/benign_sft_elicitation/.

Frontier Model Auditing (Discovery)

Agentic experiments testing whether frontier models spontaneously discover and act on training context from environment files. Self-contained in other_experiments/frontier_auditing_discovery/. See the writeup for details.

Repository Structure

src/exploration_hacking/     # Main Python package
  environments/              # RL environment definitions (science, bigcodebench, kernelbench)
  rewards/                   # Reward functions and rubrics
  rubrics/                   # LLM judge configurations
  tools/                     # Tool implementations (web search)
  scripts/                   # Training and evaluation entrypoints
etc/                         # Experiment config files (YAML)
  dfalck/science_conditional/  # WMDP experiment configs (final/ and benign_sft_elicitation/)
  jbraun/bcb_conditional/      # BigCodeBench experiment configs
  archive/                     # Old/iterative experiment configs
other_experiments/           # Standalone experiments outside the main RL codebase
  frontier_auditing_discovery/ # Frontier model auditing experiments
artifacts/                   # Data, weights, logs, and plots
shared_loras/                # Shared LoRA adapters
shared_datasets/             # Shared processed datasets
scripts/                     # Standalone utility scripts
notebooks/                   # Jupyter notebooks for analysis
verifiers/                   # Custom Verifiers fork (git submodule)

VCT Data

The Virology Capabilities Test (VCT) dataset is not included due to contractual restrictions. VCT-specific evaluations and frontier auditing outputs require it. WMDP-bio experiments work without it. If you have access, run:

bash scripts/restore_private_data.sh

This clones from d-falck/exploration-hacking-private-data and places files into gitignored directories so they won't be accidentally committed.

Installation

Clone the repo using git clone [url] --recurse-submodules, cd into it and run uv pip install -e . --group dev (omit the --group dev if you don't need development dependencies like Jupyter).

Setup

Copy .env.example as .env and add your API keys. See the comments in .env.example for which keys are required vs optional.

Environment-specific Installation

BigCodeBench

If you plan to run BigCodeBench evaluations, you'll need additional dependencies for the test sandbox:

# Install BigCodeBench optional dependencies (73 packages)
uv pip install -e ".[bigcodebench]"
# OR using the setup script:
./scripts/setup_bigcodebench.sh

Important: Always run BigCodeBench evaluations from an activated virtual environment, as the test subprocesses inherit the parent's Python environment to access these dependencies.

Usage

Logging Configuration

This project supports two types of logging:

Training Metrics (WandB)

Training runs log metrics to Weights & Biases. You'll need a WandB account and API key in your .env.

Trace Logging (MLflow or Inspect)

For detailed execution traces during evaluation, you can use either:

  1. Inspect Logging (recommended): Saves traces to local JSON files compatible with the Inspect AI format

    • No additional setup required
    • Use --logging-destination inspect when running evaluation scripts
    • Traces saved to outputs/traces/ by default
  2. MLflow Logging (legacy): Sends traces to an MLflow tracking server

    • Requires an MLflow tracking server (specify URL in .env as MLFLOW_TRACKING_URI)
    • For SQLite backend, set environment variable: MLFLOW_SQLALCHEMYSTORE_POOLCLASS=NullPool
    • Use --logging-destination mlflow when running scripts

Running scripts

Use config files like e.g.

python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml

You can override arguments manually if you wish:

python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml --rl.learning-rate 1e-4

Results are saved with timestamps in outputs/eval_results/ as both .pkl and .json files.

Using RunPod

If you're using RunPod, you can use the scripts/run_and_stop.py script to automatically terminate your node after completion:

python scripts/run_and_stop.py --timeout 240 --only-stop-after 5 --log-file -- python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml

This Docker image works well with our experiments (you'll have to conda init after startup and install into the base conda environment using uv). A RunPod template for this is available here.

Hardware

For RL training, you'll need at least 2 GPUs. 4-8 are recommended. Use H100s (or even better, H200s) if possible.

For evals, you'll just need 1 GPU (and currently cannot make use of more than one).

Scripts

The scripts currently available are:

  • exploration_hacking.scripts.grpo: PEFT agent RL
  • exploration_hacking.scripts.sft: train a model on example off-policy interaction traces
  • exploration_hacking.scripts.evaluate: evaluate a model in an RL environment
  • exploration_hacking.scripts.data.filter: make an SFT dataset by filtering evaluation results
  • exploration_hacking.scripts.data.mix: combine SFT datasets by interleaving

The read_eval_results.ipynb notebook is pretty useful for comparing model evals visually.

Development

Verifiers trunk

Currently we're using a private fork of Verifiers that has a bunch of logging and other improvements. We've probably missed a bunch of Verifiers updates this way: it's a TODO to merge in upstream Verifiers changes, and possibly pull relevant changes of ours back into that trunk.

Environments

To implement a new RL environment:

  1. Create your environment module with:

    • A config class inheriting from BaseEnvironmentConfig (from base.py)
    • A loader function with signature: load_your_environment(config: YourEnvConfig, seed: int | None = None) -> vf.ToolEnv

    You can either:

    • Create a single file: src/exploration_hacking/environments/your_env.py
    • Or a module directory: src/exploration_hacking/environments/your_env/__init__.py
  2. Config class should inherit from BaseEnvironmentConfig, which provides:

    dataset_names: list[str]           # Required
    prompt_prefixes: dict[str, str] | None
    system_prompt: str                 # Required
    split: SplitConfig                 # Required
    segment_rewards: dict[str, RewardConfig] = {}
    global_rewards: RewardConfig | None
    tokenizer: str = "willcb/Qwen3-14B"
    max_turns: int = 4
    inline_reasoning: bool = False

    Add any environment-specific fields as needed (e.g., tool configurations).

  3. In your loader function:

    • Create a Loader instance with prompt/answer extraction functions
    • Build dataset sources and merge them
    • Setup tools list if needed
    • Create rubric using get_rubric() or get_conditional_rubric()
    • Build kwargs dict, then add train/test datasets:
      kwargs = dict(tools=tools, rubric=rubric, parser=parser, ...)
      if "train" in dataset: kwargs["dataset"] = dataset["train"]
      if "test" in dataset: kwargs["eval_dataset"] = dataset["test"]
      return vf.ToolEnv(**kwargs)
  4. Register your environment in src/exploration_hacking/environments/__init__.py:

    from .your_env import YourEnvConfig, load_your_environment
    
    ENVIRONMENTS["your_env"] = _EnvironmentDefinition(
        config_class=YourEnvConfig,
        loader=load_your_environment
    )
  5. What NOT to modify:

    • The _EnvironmentDefinition class or load_environment() function
    • The dynamic EnvironmentConfig creation logic
    • The base.py file (unless adding genuinely common fields)

See science.py or bigcodebench/__init__.py for complete examples.

Style

Use Black for formatting.

About

Code for the paper "Exploration Hacking: Can LLMs Learn to Resist RL Training?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors