Interactive Python simulations to understand how distributed systems fail and how to prevent those failures.
Watch retry storms create self-sustaining system collapse. You'll see how a small spike in traffic can trigger a cascade that persists even after the spike ends.
You'll learn: Why backoff matters, how coordination overhead kills systems, when circuit breakers save you.
Compare quorum-based consistency (R+W>N) against eventual consistency. See the real latency costs of coordination.
You'll learn: When to choose consistency over latency, how network partitions affect availability, why most systems use eventual consistency.
Watch CRDTs converge without any coordination. See how mathematical properties guarantee eventual consistency.
You'll learn: G-Counters, PN-Counters, OR-Sets, LWW-Maps, when CRDTs work (and when they don't).
See how two working systems fail when connected. Based on research showing 20% of cloud incidents are CSI failures.
You'll learn: Why schema mismatches crash systems, how config incoherence causes failures, why testing in isolation isn't enough.
git clone https://github.com/bhatti/simulators
cd simulators
pip install -r requirements.txtOption 1: Interactive Menu
python run_all_simulators.pyOption 2: Run Individual Simulators
streamlit run metastable_simulator.py
streamlit run cap_consistency_simulator.py
streamlit run crdt_simulator.py
streamlit run csi_failure_simulator.pyThe simulators will open in your browser with interactive controls, real-time charts, and educational content.
Each simulator has:
- Sliders: Adjust system parameters
- Run Button: Start the simulation
- Charts: Watch failures happen in real-time
- Expandable Sections: Learn the theory behind what you're seeing
- Experiment Ideas: Guided explorations to build intuition
- Run
streamlit run metastable_simulator.py - Set load pattern to "spike"
- Disable "Enable Exponential Backoff"
- Click "Run Simulation"
- Watch P99 latency shoot up and stay high even after the spike ends
- Enable backoff and run again - see the difference!
- Why small traffic spikes cause permanent degradation
- How retry storms amplify problems
- Why your retry logic might be making things worse
- When to use circuit breakers vs backoff
- Real latency costs of strong consistency (3x slower!)
- When eventual consistency is good enough
- How network partitions force you to choose availability or consistency
- Why most systems pick latency over consistency in normal operation
- How to get consistency without coordination
- Why order doesn't matter for some operations
- When you can avoid distributed locks
- Where CRDTs break down (spoiler: constraints and uniqueness)
- Why 20% of cloud incidents involve working systems failing together
- How schema mismatches cause crashes
- Why configuration is a cross-system problem
- How to test systems together, not just alone
Built with:
- SimPy: Discrete event simulation
- Streamlit: Interactive web UI
- Plotly: Real-time charts
- NumPy/Pandas: Statistical analysis
Research Papers:
- AWS HotOS 2025: Metastability prediction
- Abadi 2012: PACELC framework
- Shapiro et al. 2011: CRDTs
- EuroSys 2023: Cross-system interaction failures
Industry Blogs:
MIT License - Use freely for education and research.