Skip to content

dr-darryl-wright/offense-defense

Repository files navigation

AI Offense-Defense Balance: A Systems Dynamics Model

Code for the paper "Modeling Offense-Defense Balance in AI Safety: Temporal Dynamics of the Jailbreak-Safeguard Arms Race", presented at TAIS 2026.

Overview

Current AI risk frameworks assess capability at a point in time but cannot forecast how the balance between offensive capability (jailbreaking) and defensive safeguards evolves after deployment. This repository provides a systems dynamics model of that co-evolution, implemented as a coupled ODE system with Monte Carlo uncertainty quantification.

The model tracks four state variables over a 24-month post-release window:

Variable Range Description
O(t) [0, 10] Offensive capability
S(t) [0, 1] Safeguard effectiveness
D(t) [0, 10] Defensive capability
A(t) [0, 1] Defensive adoption rate

The O-D ratio(O × (1−S)) / (D × A) — serves as the leading risk indicator. Values above 2.0 signal that offensive capability is advancing more than twice as fast as the defensive ecosystem.

Three model archetypes are provided, representing different safety investment profiles:

Key Description
high_safety High safety investment
industry_std Industry standard
open_weights Open weights model

Repository Structure

od_model.py                      # Core model: ODEs, Monte Carlo, archetypes
policy_experiments.py # Monte Carlo governance policy analysis
investment_curve_analysis.py      # Investment level vs. sustained safety
sensitivity_analysis.py           # OAT parameter sensitivity
fixed_param_sensitivity.py        # Sensitivity of architectural constants
historical_validation.py          # Reference mode validation against observed data
high_safety_reference_mode.json   # Historical validation data (high safety archetype)

Requirements

numpy
scipy
matplotlib

Install with:

pip install numpy scipy matplotlib

Usage

Core model

Run the Monte Carlo ensemble for a single archetype:

python od_model.py --model industry_std --n-samples 1000 --output-dir results

Options:

  • --modelindustry_std, open_weights, or high_safety
  • --n-samples — number of Monte Carlo samples (default: 1000)
  • --duration — simulation duration in months (default: 24)
  • --threshold-od — O-D ratio threshold for risk classification (default: 2.0)
  • --output-dir — directory for results and plots

Governance policy experiments

Tests four interventions — staged rollout, restricted access, defensive adoption acceleration, and a combined strategy — each with Monte Carlo uncertainty quantification:

python policy_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir policy_experiments/

Options:

  • --models — one or more of industry_std, open_weights, high_safety
  • --n-samples — samples per policy (default: 1000)
  • --quick — faster run with reduced resolution
  • --output-dir — directory for results and plots

Investment curve analysis

Sweeps investment multipliers (0.5×–5×) and reports success rates and O-D trajectories with 95% confidence intervals:

python investment_curve_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir investment_curve_experiments

Options:

  • --models — one or more archetypes
  • --n-samples — samples per investment level (default: 1000)
  • --quick — faster run with reduced resolution
  • --output-dir — directory for results and plots

Sensitivity analysis

One-at-a-time (OAT) analysis across 15 parameters, including all initial conditions. Produces tornado diagram and parameter ranking:

python sensitivity_analysis.py --baseline median
python sensitivity_analysis.py --baseline high_safety

Options:

  • --baselinedefault, median, industry_std, open_weights, or high_safety
  • --variation — fractional variation per parameter (default: 0.20 = ±20%)
  • --output-dir — directory for results
  • --quick — sequential rather than parallel execution

Architectural parameter sensitivity

Extends the OAT analysis to parameters held constant across archetypes (community dynamics, safeguard bounds, learning rates):

python sensitivity_analysis_fixed_params.py --baseline median --focus-on all --output-dir s_min_test/fixed_param_sensitivity
python sensitivity_analysis_fixed_params.py --baseline median --focus-on median --output-dir s_min_test/fixed_param_sensitivity

Options:

  • --focus-onall, safeguard, defensive, scaffolding, or community
  • --output-dir — directory for results

Historical validation

Compares Monte Carlo predictions against observed safeguard effectiveness data using three parameter modes:

python historical_validation.py --n-samples 1000 --evidence-based
python historical_validation.py --n-samples 1000 --calibrated
python historical_validation.py --n-samples 1000

Options:

  • --evidence-based — parameters derived from published evidence
  • --calibrated — parameters fitted to observed data
  • (neither flag) — uncalibrated baseline
  • --reference-file — path to reference data JSON (default: high_safety_reference_data.json)
  • --output-dir — directory for plots, reports, and JSON results

Outputs

Each script writes to its --output-dir. Key outputs:

Script Outputs
od_model.py mc_{model}_results.json, ensemble plot
policy_experiments_streamlined.py comparison plots, LaTeX table, results JSON
investment_curve_analysis.py trajectory plots, investment table, results JSON
sensitivity_analysis.py tornado diagram, proactive research analysis, results JSON
fixed_param_sensitivity.py safeguard impact plot, parameter ranges plot, results JSON
historical_validation.py validation plot, diagnostic panel, report, results JSON

Paper Results Replication

1. 24-month OD evolution (Section 5.1)

$ python od_model.py --monte-carlo --model industry_std
$ python od_model.py --monte-carlo --model open_weights
$ python od_model.py --monte-carlo --model high_safety

2. Policy Functions (Section 5.2; Appendix C)

Investment experiments

$ python investment_curve_experiments.py --models high_safety --n-samples 1000 --output-dir results/investment_experiments

Policy experiments

$ python policy_experiments.py --models industry_std open_weights high_safety --n-samples 1000 --output-dir ./results/policy_experiments/

3. Robustness Analysis (Appendix B)

Variable Parameters (Appendix B.1)

$ python sensitivity_analysis.py --baseline median --output-dir ./results/sensitivity_analysis/

Fixed Parameters (Appendix B.2)

$ python sensitivity_analysis_fixed_params.py --baseline median --output-dir ./results/sensitivity_analysis_fixed_params/

4. Historical Validation (Appendix D)

$ python historical_validation.py --n-samples 1000 --output-dir ./results/historical_validation/ --evidence-based

About

Accompanying code for "Modeling Offense-Defense Balance in AI Safety: Temporal Dynamics of the Jailbreak-Safeguard Arms Race"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages