Clone this wiki locally
What is DevOps GameDay?
devops [dev-ops] -noun
- A cultural and professional movement of developers, system administrators, and network engineers who choose to lock arms, work together to solve problems, and genuinely care about the success of their organization.
gameday [game-day] -noun
- A tournament of trial by fire.
When you put the two together, you get a group of devops gurus integrating the tools and systems used by some of the world's largest web sites, putting them to the test through a series of this-actually-happened failure scenarios, and experimenting with how we can all maximize our uptime.
New Ventures at DevOps GameDay
We've created a tool (or system) to eliminate application down time when a server drops off the network by using Amazon EC2.
How it works:
Initially, we have a web application called Reserve.Dyntini.com running in Amazon EC2 where load is shared among two WordPress servers.
One of the WordPress node will drop off line and our system automatically detects the failure.
Then automatically go into an "available but degraded" state, and then finally recover into a fully operational state.
For more information, check out your video.
We've web application called WordPress running at Reserve.Dyntini.com in Amazon EC2. We all so have the source code in Ruby for download!
Where and When?
- Where: Velocity 2011 in Santa Clara, CA - Ballroom AB
- When: 9:30pm Wednesday, 06/15/2011
Engineers from the following organizations:
- Dyn Inc. (the managed DNS and failover folks behind the Dynect Platform)
- OpsCode (the creators of the automation and configuration management tool Chef)
- Zenoss (the open source server and network monitoring experts)
What will we be doing?
We've created a simple web application running WordPress at http://reserve.dyntini.com in Amazon EC2. The system in nominal state looks like the diagram below.
Initially, the load is shared among two WordPress servers. During the GameDay event, we'll simulate failures to the system (e.g., bring down a WordPress node), and we'll experiment with ways to have our system automatically detect the failure, automatically go into an "available but degraded" state, and then finally automatically recover into a fully operational state.
For more information, check out our video walkthrough of the scenario.
Once we get this scenario working, we'll continue to add complexity and seek real-time suggestions from the audience on what other failures can occur and what can be done to automate their mitigation.
Want more info?
Ask questions via Twitter to @davenielsen
Want to join the effort and contribute?
Join the DevOps GameDay google group