Skip to content
Brandon O'Connor edited this page Jan 14, 2015 · 19 revisions

AWS_see_spots_run (AWS_SSR)

What is AWS_SSR?

AWS_SSR is a Chef cookbook designed to manage AWS autoscaling groups (ASGs) that utilize spot instances via their launch configurations. It aims to keep your spot instances up and running within your SSR configuration parameters by adjusting the availability zones (AZs) and bid prices on your behalf as the market changes. Do less work managing your infrastructure; pay less at the month's end.

AWS_SSR is primarily tag-driven, in that it tries to limit the number of calls to AWS APIs aside from retrieving tags. Tags are persistent, and once in place, can be modified on a per-ASG basis should this be necessary. When an ASG's SSR_config tag is adjusted manually, so long as the new values are valid, those configuration changes will take hold.

Why build AWS_SSR?

The struggle is real when trying to manage spot instances out of the box without automation. The goals of this project are several fold:

  1. Eliminate the need for manual tuning of ASGs using spot instances in reaction to market prices, instance availability, or misconfiguration of resources.
  2. Increase availability when operating under spot instances by introducing demand fallback, price adjustment, and AZ management.
  3. Decrease overall operational cost by enabling spots as a more reliable solution, essentially never having to pay above the on demand price for the groups managed by SSR.

Spot instances are potentially a huge boon to AWS users as you can cut EC2 costs by 80%-85% in a best-case scenario, but they carry with them some serious pitfalls:

  1. Some instance types are not able to be acquired in all AZs for that particular region. If an ASG is misconfigured with incorrect AZs during creation, the group is likely to stop scaling when this bogus AZ is tried. In this case, new instances cannot be provisioned as this one zone blocks further scaling.
  2. Even in zones where an instance type can be acquired, there is no guarantee that capacity is sufficient on the spot market to fulfill the request. This is invisible to users until a spot request is placed and begins to fail. Spot requests will hang indefinitely in this scenario.
  3. Bid prices change constantly but ASGs don't respond to this change aside from losing instances and trying to spin them up (hopefully elsewhere). A proactive response to an outbid AZ would be better.
  4. ELB configurations can prohibit instances from launching healthy into ASGs if configured improperly. Part of that configuration can be accomplished via AWS_SSR.

Quick start

  1. Find a solo host in your chef controlled infrastructure (e.g. a cron standalone host) and add the recipe[AWS_see_spots_run::cron_jobs] to its runlist. Ideally this host has an instance profile with all necessary AWS permissions but if not, you can arrange for the credentials to be picked up either via environment variables on the cron jobs or a flat file on the system as outlined on the boto config docs.
  2. Write a small wrapper cookbook to change any recipe resources or override any attributes. (e.g. adding --verbose flag and redirecting stderr/stdout to a log file for insight on actions taken)
  3. Suggested: recreate any launch configurations that specify a bid price and set it to the current on demand value for its attributes. These should be applied to the relevant autoscaling group(s). We do this all via cloudformation, which I would also recommend.
  4. Run chef-client to push the crons, scripts, and packages. This would be a good time to try each of the scripts by hand with --dry-run or --verbose flags, and check the output to see what the code is up to.
  5. Tell your CFO you just saved him a chunk of change and enjoy a tasty beverage.

Requirements

AWS_SSR comes mostly free but you'll need some available AWS resources in order to work properly.

  • 2 (of the default limit 10) unused tags for each ASG to be SSR managed. These are required for tracking SSR configuration options and health status of AZs and as much data as possible is crammed into these. Unless you're a heavy user of tags on ASGs, this isn't likely a constraint. I'm unsure if this is a hard AWS limit or if it can be increased on request.
  • An IAM role to run the script under. Details for this are covered in the AWS permissions doc.
  • A Linux host with python2.7 or python3.x and a chef client to put the bits in place.
  • Hosts running in EC2 classic under ASGs with spot_price specified launch configurations.

TODO

  • testing could be introduced throughout (rubocop, foodcritic, ChefSpec, unittest, ect.). Is there a mock-AWS test suite anywhere out there?
  • test on different versions of python. All chef clients should be able to handle the simple resources used.

Potential future features

  • instance type adjuster - The ability to bump up instance types of the same family if that option is available and within the originally specified price point.
  • basic ondemand ASG/ELB management - extend some of the ELB and ASG management of AZs to all ASGs (on demand included). Some of the principles are not spot-specific and could stand to benefit on demand groups just the same.
  • Introduce VPC compatibility.

Gotchas/caveats/known issues

  • The AWS autoscaling GUI doesn't play nice when you start using a user created launch configuration on an ASG. For whatever reason, it doesn't show up in the list of valid launch configs and (worse still), if you edit the ASG from here, the launch configuration will default to whatever shows up first alphabetically. If you don't cancel the change at this point, you're going to end up with the wrong launch configuration attached to your ASG and you're going to have a bad time. Get good at the CLI if you need to alter values like desired capacity manually. I've filed a support ticket for AWS to fix. - this has since been resolved.
  • The chunk of code that determines demand price for a given instance type in a given region relies on a set of deprecated not-quite-json files (e.g. [linux current gen] (http://a0.awsstatic.com/pricing/1/ec2/linux-od.min.js)) since AWS doesn't provide an API for these rates. I nabbed that set of URLs from this excellent repo which runs this excellent service. Thus far, that service has remained current and correct in its price values. We'll see how long AWS keeps this document up to date. If this document becomes out of date or is dropped entirely, a static dictionary of prices (though tedious) could be added and updated within this repo.
You can’t perform that action at this time.