Code for the paper "Modeling Offense-Defense Balance in AI Safety: Temporal Dynamics of the Jailbreak-Safeguard Arms Race", presented at TAIS 2026.
Current AI risk frameworks assess capability at a point in time but cannot forecast how the balance between offensive capability (jailbreaking) and defensive safeguards evolves after deployment. This repository provides a systems dynamics model of that co-evolution, implemented as a coupled ODE system with Monte Carlo uncertainty quantification.
The model tracks four state variables over a 24-month post-release window:
| Variable | Range | Description |
|---|---|---|
| O(t) | [0, 10] | Offensive capability |
| S(t) | [0, 1] | Safeguard effectiveness |
| D(t) | [0, 10] | Defensive capability |
| A(t) | [0, 1] | Defensive adoption rate |
The O-D ratio — (O × (1−S)) / (D × A) — serves as the leading risk indicator. Values above 2.0 signal that offensive capability is advancing more than twice as fast as the defensive ecosystem.
Three model archetypes are provided, representing different safety investment profiles:
| Key | Description |
|---|---|
high_safety |
High safety investment |
industry_std |
Industry standard |
open_weights |
Open weights model |
od_model.py # Core model: ODEs, Monte Carlo, archetypes
policy_experiments.py # Monte Carlo governance policy analysis
investment_curve_analysis.py # Investment level vs. sustained safety
sensitivity_analysis.py # OAT parameter sensitivity
fixed_param_sensitivity.py # Sensitivity of architectural constants
historical_validation.py # Reference mode validation against observed data
high_safety_reference_mode.json # Historical validation data (high safety archetype)
numpy
scipy
matplotlib
Install with:
pip install numpy scipy matplotlibRun the Monte Carlo ensemble for a single archetype:
python od_model.py --model industry_std --n-samples 1000 --output-dir resultsOptions:
--model—industry_std,open_weights, orhigh_safety--n-samples— number of Monte Carlo samples (default: 1000)--duration— simulation duration in months (default: 24)--threshold-od— O-D ratio threshold for risk classification (default: 2.0)--output-dir— directory for results and plots
Tests four interventions — staged rollout, restricted access, defensive adoption acceleration, and a combined strategy — each with Monte Carlo uncertainty quantification:
python policy_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir policy_experiments/Options:
--models— one or more ofindustry_std,open_weights,high_safety--n-samples— samples per policy (default: 1000)--quick— faster run with reduced resolution--output-dir— directory for results and plots
Sweeps investment multipliers (0.5×–5×) and reports success rates and O-D trajectories with 95% confidence intervals:
python investment_curve_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir investment_curve_experimentsOptions:
--models— one or more archetypes--n-samples— samples per investment level (default: 1000)--quick— faster run with reduced resolution--output-dir— directory for results and plots
One-at-a-time (OAT) analysis across 15 parameters, including all initial conditions. Produces tornado diagram and parameter ranking:
python sensitivity_analysis.py --baseline median
python sensitivity_analysis.py --baseline high_safetyOptions:
--baseline—default,median,industry_std,open_weights, orhigh_safety--variation— fractional variation per parameter (default: 0.20 = ±20%)--output-dir— directory for results--quick— sequential rather than parallel execution
Extends the OAT analysis to parameters held constant across archetypes (community dynamics, safeguard bounds, learning rates):
python sensitivity_analysis_fixed_params.py --baseline median --focus-on all --output-dir s_min_test/fixed_param_sensitivity
python sensitivity_analysis_fixed_params.py --baseline median --focus-on median --output-dir s_min_test/fixed_param_sensitivityOptions:
--focus-on—all,safeguard,defensive,scaffolding, orcommunity--output-dir— directory for results
Compares Monte Carlo predictions against observed safeguard effectiveness data using three parameter modes:
python historical_validation.py --n-samples 1000 --evidence-based
python historical_validation.py --n-samples 1000 --calibrated
python historical_validation.py --n-samples 1000Options:
--evidence-based— parameters derived from published evidence--calibrated— parameters fitted to observed data- (neither flag) — uncalibrated baseline
--reference-file— path to reference data JSON (default:high_safety_reference_data.json)--output-dir— directory for plots, reports, and JSON results
Each script writes to its --output-dir. Key outputs:
| Script | Outputs |
|---|---|
od_model.py |
mc_{model}_results.json, ensemble plot |
policy_experiments_streamlined.py |
comparison plots, LaTeX table, results JSON |
investment_curve_analysis.py |
trajectory plots, investment table, results JSON |
sensitivity_analysis.py |
tornado diagram, proactive research analysis, results JSON |
fixed_param_sensitivity.py |
safeguard impact plot, parameter ranges plot, results JSON |
historical_validation.py |
validation plot, diagnostic panel, report, results JSON |
$ python od_model.py --monte-carlo --model industry_std
$ python od_model.py --monte-carlo --model open_weights
$ python od_model.py --monte-carlo --model high_safety$ python investment_curve_experiments.py --models high_safety --n-samples 1000 --output-dir results/investment_experiments$ python policy_experiments.py --models industry_std open_weights high_safety --n-samples 1000 --output-dir ./results/policy_experiments/$ python sensitivity_analysis.py --baseline median --output-dir ./results/sensitivity_analysis/$ python sensitivity_analysis_fixed_params.py --baseline median --output-dir ./results/sensitivity_analysis_fixed_params/$ python historical_validation.py --n-samples 1000 --output-dir ./results/historical_validation/ --evidence-based