- Add Nathalie's frontier auditing experiments to
other_experiments/ - Add Eyon's countermeasures experiments to
other_experiments/ - Add unconditional locking experiments (KernelBench and WMDP) to
other_experiments/ - Final pass: check everything is up to date and clean before publication
Shared codebase for exploration hacking RL experiments, built on top of a custom fork of Verifiers.
Configs and artifacts in etc/dfalck/science_conditional/final/.
- Model: Qwen3-14B with LoRA (rank 32)
- Benchmarks: WMDP-bio (biology MCQ) and VCT (virology/clinical)
- Pipeline: Data generation → Filtering → SFT → RL elicitation → Evaluation
- Conditions: Treatment (SFT + RL) vs Control (RL only)
- Trained weights:
artifacts/weights/dfalck/science_conditional/final/sft_model/ - Evaluation results:
artifacts/data/dfalck/science_conditional/final/eval_*.eval - Paper figures:
artifacts/plots/paper/
Configs in etc/jbraun/bcb_conditional/. Same pipeline structure as above but for coding tasks with BigCodeBench.
Tests whether benign fine-tuning can elicit conditional behavior without RL. Configs in etc/dfalck/science_conditional/benign_sft_elicitation/.
Agentic experiments testing whether frontier models spontaneously discover and act on training context from environment files. Self-contained in other_experiments/frontier_auditing_discovery/. See the writeup for details.
src/exploration_hacking/ # Main Python package
environments/ # RL environment definitions (science, bigcodebench, kernelbench)
rewards/ # Reward functions and rubrics
rubrics/ # LLM judge configurations
tools/ # Tool implementations (web search)
scripts/ # Training and evaluation entrypoints
etc/ # Experiment config files (YAML)
dfalck/science_conditional/ # WMDP experiment configs (final/ and benign_sft_elicitation/)
jbraun/bcb_conditional/ # BigCodeBench experiment configs
archive/ # Old/iterative experiment configs
other_experiments/ # Standalone experiments outside the main RL codebase
frontier_auditing_discovery/ # Frontier model auditing experiments
artifacts/ # Data, weights, logs, and plots
shared_loras/ # Shared LoRA adapters
shared_datasets/ # Shared processed datasets
scripts/ # Standalone utility scripts
notebooks/ # Jupyter notebooks for analysis
verifiers/ # Custom Verifiers fork (git submodule)
The Virology Capabilities Test (VCT) dataset is not included due to contractual restrictions. VCT-specific evaluations and frontier auditing outputs require it. WMDP-bio experiments work without it. If you have access, run:
bash scripts/restore_private_data.shThis clones from d-falck/exploration-hacking-private-data and places files into gitignored directories so they won't be accidentally committed.
Clone the repo using git clone [url] --recurse-submodules, cd into it and run uv pip install -e . --group dev (omit the --group dev if you don't need development dependencies like Jupyter).
Copy .env.example as .env and add your API keys. See the comments in .env.example for which keys are required vs optional.
If you plan to run BigCodeBench evaluations, you'll need additional dependencies for the test sandbox:
# Install BigCodeBench optional dependencies (73 packages)
uv pip install -e ".[bigcodebench]"
# OR using the setup script:
./scripts/setup_bigcodebench.shImportant: Always run BigCodeBench evaluations from an activated virtual environment, as the test subprocesses inherit the parent's Python environment to access these dependencies.
This project supports two types of logging:
Training runs log metrics to Weights & Biases. You'll need a WandB account and API key in your .env.
For detailed execution traces during evaluation, you can use either:
-
Inspect Logging (recommended): Saves traces to local JSON files compatible with the Inspect AI format
- No additional setup required
- Use
--logging-destination inspectwhen running evaluation scripts - Traces saved to
outputs/traces/by default
-
MLflow Logging (legacy): Sends traces to an MLflow tracking server
- Requires an MLflow tracking server (specify URL in
.envasMLFLOW_TRACKING_URI) - For SQLite backend, set environment variable:
MLFLOW_SQLALCHEMYSTORE_POOLCLASS=NullPool - Use
--logging-destination mlflowwhen running scripts
- Requires an MLflow tracking server (specify URL in
Use config files like e.g.
python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml
You can override arguments manually if you wish:
python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml --rl.learning-rate 1e-4
Results are saved with timestamps in outputs/eval_results/ as both .pkl and .json files.
If you're using RunPod, you can use the scripts/run_and_stop.py script to automatically terminate your node after completion:
python scripts/run_and_stop.py --timeout 240 --only-stop-after 5 --log-file -- python -m exploration_hacking.scripts.grpo --config etc/dfalck/science_conditional/final/rl_elicit_treatment.yaml
This Docker image works well with our experiments (you'll have to conda init after startup and install into the base conda environment using uv). A RunPod template for this is available here.
For RL training, you'll need at least 2 GPUs. 4-8 are recommended. Use H100s (or even better, H200s) if possible.
For evals, you'll just need 1 GPU (and currently cannot make use of more than one).
The scripts currently available are:
exploration_hacking.scripts.grpo: PEFT agent RLexploration_hacking.scripts.sft: train a model on example off-policy interaction tracesexploration_hacking.scripts.evaluate: evaluate a model in an RL environmentexploration_hacking.scripts.data.filter: make an SFT dataset by filtering evaluation resultsexploration_hacking.scripts.data.mix: combine SFT datasets by interleaving
The read_eval_results.ipynb notebook is pretty useful for comparing model evals visually.
Currently we're using a private fork of Verifiers that has a bunch of logging and other improvements. We've probably missed a bunch of Verifiers updates this way: it's a TODO to merge in upstream Verifiers changes, and possibly pull relevant changes of ours back into that trunk.
To implement a new RL environment:
-
Create your environment module with:
- A config class inheriting from
BaseEnvironmentConfig(frombase.py) - A loader function with signature:
load_your_environment(config: YourEnvConfig, seed: int | None = None) -> vf.ToolEnv
You can either:
- Create a single file:
src/exploration_hacking/environments/your_env.py - Or a module directory:
src/exploration_hacking/environments/your_env/__init__.py
- A config class inheriting from
-
Config class should inherit from
BaseEnvironmentConfig, which provides:dataset_names: list[str] # Required prompt_prefixes: dict[str, str] | None system_prompt: str # Required split: SplitConfig # Required segment_rewards: dict[str, RewardConfig] = {} global_rewards: RewardConfig | None tokenizer: str = "willcb/Qwen3-14B" max_turns: int = 4 inline_reasoning: bool = False
Add any environment-specific fields as needed (e.g., tool configurations).
-
In your loader function:
- Create a
Loaderinstance with prompt/answer extraction functions - Build dataset sources and merge them
- Setup tools list if needed
- Create rubric using
get_rubric()orget_conditional_rubric() - Build kwargs dict, then add train/test datasets:
kwargs = dict(tools=tools, rubric=rubric, parser=parser, ...) if "train" in dataset: kwargs["dataset"] = dataset["train"] if "test" in dataset: kwargs["eval_dataset"] = dataset["test"] return vf.ToolEnv(**kwargs)
- Create a
-
Register your environment in
src/exploration_hacking/environments/__init__.py:from .your_env import YourEnvConfig, load_your_environment ENVIRONMENTS["your_env"] = _EnvironmentDefinition( config_class=YourEnvConfig, loader=load_your_environment )
-
What NOT to modify:
- The
_EnvironmentDefinitionclass orload_environment()function - The dynamic
EnvironmentConfigcreation logic - The base.py file (unless adding genuinely common fields)
- The
See science.py or bigcodebench/__init__.py for complete examples.
Use Black for formatting.