AI Offense-Defense Balance: A Systems Dynamics Model

Code for the paper "Modeling Offense-Defense Balance in AI Safety: Temporal Dynamics of the Jailbreak-Safeguard Arms Race", presented at TAIS 2026.

Overview

Current AI risk frameworks assess capability at a point in time but cannot forecast how the balance between offensive capability (jailbreaking) and defensive safeguards evolves after deployment. This repository provides a systems dynamics model of that co-evolution, implemented as a coupled ODE system with Monte Carlo uncertainty quantification.

The model tracks four state variables over a 24-month post-release window:

Variable	Range	Description
O(t)	[0, 10]	Offensive capability
S(t)	[0, 1]	Safeguard effectiveness
D(t)	[0, 10]	Defensive capability
A(t)	[0, 1]	Defensive adoption rate

The O-D ratio — (O × (1−S)) / (D × A) — serves as the leading risk indicator. Values above 2.0 signal that offensive capability is advancing more than twice as fast as the defensive ecosystem.

Three model archetypes are provided, representing different safety investment profiles:

Key	Description
`high_safety`	High safety investment
`industry_std`	Industry standard
`open_weights`	Open weights model

Repository Structure

od_model.py                      # Core model: ODEs, Monte Carlo, archetypes
policy_experiments.py # Monte Carlo governance policy analysis
investment_curve_analysis.py      # Investment level vs. sustained safety
sensitivity_analysis.py           # OAT parameter sensitivity
fixed_param_sensitivity.py        # Sensitivity of architectural constants
historical_validation.py          # Reference mode validation against observed data
high_safety_reference_mode.json   # Historical validation data (high safety archetype)

Requirements

numpy
scipy
matplotlib

Install with:

pip install numpy scipy matplotlib

Usage

Core model

Run the Monte Carlo ensemble for a single archetype:

python od_model.py --model industry_std --n-samples 1000 --output-dir results

Options:

--model — industry_std, open_weights, or high_safety
--n-samples — number of Monte Carlo samples (default: 1000)
--duration — simulation duration in months (default: 24)
--threshold-od — O-D ratio threshold for risk classification (default: 2.0)
--output-dir — directory for results and plots

Governance policy experiments

Tests four interventions — staged rollout, restricted access, defensive adoption acceleration, and a combined strategy — each with Monte Carlo uncertainty quantification:

python policy_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir policy_experiments/

Options:

--models — one or more of industry_std, open_weights, high_safety
--n-samples — samples per policy (default: 1000)
--quick — faster run with reduced resolution
--output-dir — directory for results and plots

Investment curve analysis

Sweeps investment multipliers (0.5×–5×) and reports success rates and O-D trajectories with 95% confidence intervals:

python investment_curve_experiments.py --models industry_std high_safety open_weights --n-samples 1000 --output-dir investment_curve_experiments

Options:

--models — one or more archetypes
--n-samples — samples per investment level (default: 1000)
--quick — faster run with reduced resolution
--output-dir — directory for results and plots

Sensitivity analysis

One-at-a-time (OAT) analysis across 15 parameters, including all initial conditions. Produces tornado diagram and parameter ranking:

python sensitivity_analysis.py --baseline median
python sensitivity_analysis.py --baseline high_safety

Options:

--baseline — default, median, industry_std, open_weights, or high_safety
--variation — fractional variation per parameter (default: 0.20 = ±20%)
--output-dir — directory for results
--quick — sequential rather than parallel execution

Architectural parameter sensitivity

Extends the OAT analysis to parameters held constant across archetypes (community dynamics, safeguard bounds, learning rates):

python sensitivity_analysis_fixed_params.py --baseline median --focus-on all --output-dir s_min_test/fixed_param_sensitivity
python sensitivity_analysis_fixed_params.py --baseline median --focus-on median --output-dir s_min_test/fixed_param_sensitivity

Options:

--focus-on — all, safeguard, defensive, scaffolding, or community
--output-dir — directory for results

Historical validation

Compares Monte Carlo predictions against observed safeguard effectiveness data using three parameter modes:

python historical_validation.py --n-samples 1000 --evidence-based
python historical_validation.py --n-samples 1000 --calibrated
python historical_validation.py --n-samples 1000

Options:

--evidence-based — parameters derived from published evidence
--calibrated — parameters fitted to observed data
(neither flag) — uncalibrated baseline
--reference-file — path to reference data JSON (default: high_safety_reference_data.json)
--output-dir — directory for plots, reports, and JSON results

Outputs

Each script writes to its --output-dir. Key outputs:

Script	Outputs
`od_model.py`	`mc_{model}_results.json`, ensemble plot
`policy_experiments_streamlined.py`	comparison plots, LaTeX table, results JSON
`investment_curve_analysis.py`	trajectory plots, investment table, results JSON
`sensitivity_analysis.py`	tornado diagram, proactive research analysis, results JSON
`fixed_param_sensitivity.py`	safeguard impact plot, parameter ranges plot, results JSON
`historical_validation.py`	validation plot, diagnostic panel, report, results JSON

Paper Results Replication

1. 24-month OD evolution (Section 5.1)

$ python od_model.py --monte-carlo --model industry_std
$ python od_model.py --monte-carlo --model open_weights
$ python od_model.py --monte-carlo --model high_safety

2. Policy Functions (Section 5.2; Appendix C)

Investment experiments

$ python investment_curve_experiments.py --models high_safety --n-samples 1000 --output-dir results/investment_experiments

Policy experiments

$ python policy_experiments.py --models industry_std open_weights high_safety --n-samples 1000 --output-dir ./results/policy_experiments/

3. Robustness Analysis (Appendix B)

Variable Parameters (Appendix B.1)

$ python sensitivity_analysis.py --baseline median --output-dir ./results/sensitivity_analysis/

Fixed Parameters (Appendix B.2)

$ python sensitivity_analysis_fixed_params.py --baseline median --output-dir ./results/sensitivity_analysis_fixed_params/

4. Historical Validation (Appendix D)

$ python historical_validation.py --n-samples 1000 --output-dir ./results/historical_validation/ --evidence-based

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Offense-Defense Balance: A Systems Dynamics Model

Overview

Repository Structure

Requirements

Usage

Core model

Governance policy experiments

Investment curve analysis

Sensitivity analysis

Architectural parameter sensitivity

Historical validation

Outputs

Paper Results Replication

1. 24-month OD evolution (Section 5.1)

2. Policy Functions (Section 5.2; Appendix C)

Investment experiments

Policy experiments

3. Robustness Analysis (Appendix B)

Variable Parameters (Appendix B.1)

Fixed Parameters (Appendix B.2)

4. Historical Validation (Appendix D)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
claude_35_sonnet_reference_data.json		claude_35_sonnet_reference_data.json
historical_validation.py		historical_validation.py
investment_curve_experiments.py		investment_curve_experiments.py
od_model.py		od_model.py
policy_experiments.py		policy_experiments.py
sensitivity_analysis.py		sensitivity_analysis.py
sensitivity_analysis_fixed_params.py		sensitivity_analysis_fixed_params.py

Folders and files

Latest commit

History

Repository files navigation

AI Offense-Defense Balance: A Systems Dynamics Model

Overview

Repository Structure

Requirements

Usage

Core model

Governance policy experiments

Investment curve analysis

Sensitivity analysis

Architectural parameter sensitivity

Historical validation

Outputs

Paper Results Replication

1. 24-month OD evolution (Section 5.1)

2. Policy Functions (Section 5.2; Appendix C)

Investment experiments

Policy experiments

3. Robustness Analysis (Appendix B)

Variable Parameters (Appendix B.1)

Fixed Parameters (Appendix B.2)

4. Historical Validation (Appendix D)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages