Skip to content
This repository has been archived by the owner on Dec 21, 2018. It is now read-only.

Machine Image Management

Julian C. Dunn edited this page Oct 5, 2014 · 1 revision

Machine Image Management

Friday, Issaquah, 14:00

Convener

Don

Participants

~ 1 dozen. John Cook, John Kaiser, Galen Emery, Charles Johnson, Karl, Blake, Claude.

Summary of Discussions

Where do images come from? How do we uniquely identify them?

Workflow 1) Use a default image, patch and do everything with automation after boot.

Workflow 2) Use a "baked" image, patches and other items included in the image. -Known snapshot in time.

Example: AMI in AWS, base CentOS 6.5, apply chef-run on-top of it, then repackage into another AMI.

Challenge: How do I find out what images are spun up with that AMI?

Q: "Do you have cruft problems?" A: "Not from the perspective of the workflow. Some other teams do in fact have old images out there, found a lot weren't actively being used. Eventually as more teams adopt we'll have a bigger problem." Q: "Any thoughts about what to do? We have lots of one-offs." A: "We're not using it for active testing & dev. We just re-provision over virtualbox. Just for baking pre-prod or prod images."

Discussion of differential changes - if you're willing to mutate prod, you can have chef-client updating existing machines and the build creating new images based on the new code as well.

John K: "Maybe there's a place between immutable and just-in-time compiles. You'll find out that some stuff won't work anymore on your existing AMI's. A possible workflow is something is always building the AMI. Always remaking it from scratch every time, always up to the latest. Machine-image resource helps with that - We should back up."

"There is a resource that's just like a machine - you put recipes on it - called machine-image. Give it a name, say what recipes you want to run, what it will do is if the image doesn't exist or if cookbooks have been modified, then it will go provision an instance with those recipes, then save it to an AMI and destroy the instance. Use that for the workflow. New machines get that, and stragglers can age out."

"If you're continuously building, you get patches without having to destroy your infrastructure constantly. Most everything is still up to date most of the time, but you do age stuff out."

Don: "I'd look at something less commit-based and more time-based. Say build once a week - completely arbitrary - and run chef-client to pick up changes that made it through external workflow this is connected to, bake a new AMI at the end of the week. Anything based on the old AMI gets expired & re-launched."

John K: "The interval is - if you wait a week, the oldest machine in your infra is going to be 5 weeks old, rather than 1 month. If you do it on every commit, everyone has the latest AMI.

Don: "It's really motivated by the fact that as these changes pile up, your converge time will increase."

Q: Would you have to expire everything on the old template?

A: "I guess not, if they're going to be similar anyway."

Q: "So update the spin-up of new machines to use the new one instead of the old one."

Galen: "You'll have similar converge times, running boxes don't care about the size of the commit, they'll just update to the new desired state. It's the same as the new AMI, so what's the difference?"

A: "Call it DR testing."

Galen: "Yeah, Chaos Monkey!"

(description of Chaos Monkey happens.)

John K: "If you're gonna Chaos Monkey and kill boxes anyway, why not make your reaper be your chaos monkey?"

A: "Reaper monkey!"

Q: "My only thing about controlling the timeframe would be the branch - once you merge, that's the kickoff. Develop in a dev branch, bake an image on merge."

(Nods & murmurs of agreement. This is a good idea.)

Galen: "Do you guys do the AMI thing to reduce initial convergence time?"

Don: "More to have security that there won't be a one-off issue during converge that disables things. No unidentified blip and there's a converge problem we spin our gears with before we get back in the running."

John K: "Here's a machine_image example from Chef Metal." (whiteboard)

machine_image 'base' do
  recipe 'security_packages'
end

machine_image 'apache' do
  recipe 'apache2'
  from_image 'base'
end

machine 'myapp' do
  recipe 'myapp'
  from _image 'apache'
end

John K: "The actual expiration of a machine needs to be managed outside. It should warn. We don't destroy unless you say "destroy.""

Don: "Tell me more."

John K: "You can put the context of what provider to use outside the recipe, or put it in there, but by default it'll figure out from the driver what you want to do."

Galen: "This feels very similar to Docker."

John K: "Yeah! Main reason you do it this way is to have it updated with different frequencies. There's a docker driver, machine_image works fine with it because it's the same model. They feel like the same thing."

Don: "Sounds like the takeaway is that a lot of people don't feel there's a need to have an always available pre-baked image at all times, okay to do dynamic provisioning on top of something that already exists."

John K: "Feels necessary to me! Need a mechanism that can update quickly between when you're destroying machines. You just don't want it doing a ton of work."

Galen: "If you built it fully, when I make a commit to master, we build it, we don't touch the image ever again, next time we have a commit we go through the whole process... there's a slow build, but if you don't need to be there immediately. I'm encapsulating all this in an object rather than being the state of an object."

Q: "The main benefit seems to be the converge time is shorter. What's the acceptable length of a converge time?"

Galen: "How's a silver image model? Get things close? Pre-bake slow things, but do last-mile config at the last second. Then we still run Chef in the infrastructure with smaller changes."

John K: "It's all about how fast you want to be able to apply changes. Other benefit of images isn't just speed, if you start everyone from the same place it makes things really feel good to not update the image a lot. BECAUSE it decreases the amount of weird shit that can happen while you're doing initial setup. Say a race condition where MySQL is broken, you find that out when you build the first time. Now everyone has working mysql. Stuff does happen at compile time, image can help."

Charles: Compile time argument - what do you pre-compile and ship binary (in the image), what do you just-in-time compile (Chef)?

John: "Also there's a risk mitigation argument. For a fraught compile, build the image 10 times, find the one that worked."

Q: "For a larger organization, handing off the base image to one team and then have a java, tomcat, whatever on other teams, having a canonical image is really helpful. Allowing disparate teams to consume the same image is great."

John K: "I'd love to talk about that, I've been thinking about that. In large orgs, the people building the base packages - in my imagination there's a team that's really good, monitoring security patches, they know how to make the base OS. Others might be good at it, but these are the core people. Having the OS - actual help doing these kinds of expiration policies, a base image/policy. Those guys in central org ought to be able to update their cookbooks to say this is the right secure thing, use these cookbooks instead. they have a package in their Chef Org, what everybody else does is - if they build their own images, they - instead of copying it down, they could just reference it remotely. The security guys give everybody keys to talk to it, write your recipe to say "I wanna do from_image, but add an endpoint."

(Charles paints a bikeshed, says we should put it on Supermarket).

John K: "I like the idea of the chef-client being able to say "Get that node over there, not the node on my Chef Server."

Galen: "We want people in central IT to enforce a base image, and you could put it on supermarket, or on Chef Server, and someone says "here are the base things we have, pick the one you guys need to use, we'll take care of the security on that for you." Or enforce it, "You must use this."

John K: "Depends on organizational trust. In the org where they're enforcing what you want is a slightly different form where nobody gets the AWS keys except for the security org, and they use a proxy driver. "You're allowed to have a machine here with the image you're supposed to have, and here you go." Then they can define policies about flavors of machines, keep track of them. Central asset tracking for AWS stuff.

"The other thing that would be interesting about that is you could build in that. Say 'Hey could you make one like that but with Apache too?"

Galen: "I'm a Windows guy, so this is domain trust. One or two-way trust between organizations or Chef Servers. Let people play in each other's land."

John K: "Even without the primitives you can machine_image things today because you can say download this org from this remote org. There are Cheffish primitives with chef node name, dump it into your own, base yourself off that. For that one there's another interesting thing about a proxy driver: If these guys have a smaller set of base images, 5 or 6, they could even provide extra benefits that their scale allows that nobody else does. They could keep hot spares or just frozen spares up, and instant machines configured correctly. They protect the keys and prove a service better than what people could do for themselves."

Galen: "Similar to what we're doing on learnchef, with the small pool."

John K: "Exactly! Those are just things around that - feels important to me for the security base. That's an important job."

Q: "We've got lots of pets with not everything being off-the-shelf software. We have silos like middleware guys, mq guys, and then converges take forever. But really all we care about is I need my catalina_base, my java_app thing. It'd be great because bam, there we'd go."

John K: It's interesting because the way the base_image thing works, we do the recipes and stuff it in the node. Next guy who does the run on that image trusts that - What they could do instead is say I built that with 20 minutes of security test recipes, but what I'm donna stuff in the node is security::trust, and they have a way of delivering just the small stuff. The half of it that is still un-solved is that you need a policy for expiring the machines based on dead images or that security can't patch. "It's Windows 95!" They have to have an age policy, or a point at which you may not have an AMI behind this, and you might want to start expiring AMIs before this. If you could find out which machines are based on which AMIs you could go check on it.

Charles: That's totally a thing.

John K: Sweet! So they could find out if anyone has cheated the system and also they could go find old insecure AMIs sorted by date.

Galen time checks.

Don: "We've been thinking about what to do after a machine is spun up from an image. It sounds like what people are saying so far is that your image and your instance would have the same run_list. Can you think of times when you would want to bake in things that aren't converged?"

John K: "Security patching seems to match that."

Don: "What I mean is that the security package would be part of the run_list."

John: "What they could do even now is say "this is the run_list to infect the next machine with." Could be patches, 0days, or whatever. Don't have to do the 20 minutes of tests they did on their image."

Galen: "Works in the central org owns the images & aws stuff model. My only concern is that if I say I want to change my firewall. If you hand off the image and you're not doing the run_list on it anymore, how do you know they don't revert what you said to do?"

John: "You could have a run_list with periodic surprised. I don't know the implementation, but - every 20 days I'll run the whole 20 minutes."

Don: "I would argue for abstracting that out of Chef completely."

Charles: "Put it into specs / serverspec."

John: "However you do it, do it periodically."

Q: "Just not during business hours!"

Galen: "If that happens, then you should fix it. Chaos Monkey runs during business hours, it runs all the time. The whole point is resilience. The 4am break worry is really stressful."

John: "The kind of thing you want to test for, there's a few categories: External stuff and also stuff you can only test from the machine, or that are less severe. A port scan from outside is more likely to cause problems than someone asking Windows for a list of open ports."

Galen: "External testing: Is our stuff actually secure? Inside stuff: Is it doing what we told it to do."

John: "Do we THINK it's secure."

Galen: "Is it conforming to our policy vs. is it actually secure. Those things can break networks, wreak havoc, very scary. But the internal check can be less scary."

(yep. Yep! Yep.)

Don: "There are things Whyrun can't do for you. Can't run the bash exploit task."

John: "It'll help you a lot but there are other tests you need. Like if you don't have a specific policy on every port in the firewall, having a small suite of tools you run every so often on the box, one of which polls open ports - There's - an actual pokey test. Inexpensive you probably want."

Don: "Yep, difference between active & passive testing for security."

John: "Doesn't have to be done in Chef, it's just that now you have a delivery mechanism for this stuff."

Galen loves Choas Monkey. Boy howdy.

A: "That's a hard sell."

Galen: "Yeah, I'm really lazy, so if I think about like - if I could build my infrastructure to heal itself? Fuck yes."

Galen: "Start with the passive, then do the active."

John Cook: "We've been pushing our security guys to use ohai. Stuff you can go back, search, and get that data out."

Q: "This is the data ohai has and these are the attributes that it has? That's a nightmare."

John Cook: Commiserate. "What do we need to disable to return it back? We whitelisted ohai and it became a lot more readable. There's a bunch of stuff that goes up."

John K: "A JSON viewing tool?"

John Cook: "I loaded it into an IDE and collapsed it down."

(Everyone agrees, this is suck.)

Q: When we have different teams, what's the best practice to have an image to deploy or patch. We have a SQL server done by my team, and OS patching done by an OS team. What's the best practice?

John Cook: "This idea of images that inherit is kind of the best way. Then you can consume images. Come up with a naming convention after that."

Q: Then do you have an OS patch cookbook with 0days & periodics? and a SQL server cookbook that does the same thing for SQL?

John Cook: "We have whole teams that do this - they do this right now. One team locks things down but it is cumbersome. Every machine is running it all the time. In vmware we kickstart the machine, hand it off, and have a whole giant infrastructure of pets, we never touch 'em. Nightmare hand patching. And then in openstack we have all ephemerals, and we just kill 'em off. I really like the idea that as a middleware publisher, I can say "this is what we blessed," everyone else's code drops to something tiny. They give us the WAR and they're done. We've got like 500 devs so where we can centralize and say this is the blessed thing, it'd be really nice."

Q: So image could be on HyperV or ESX farm?

A: Yep!

John K: "That stuff works with Vagrant which supports vmware so you could create & publish images if you wanted. Use the same recipe to point it. If you have multiple clouds, you could say I got this recipe that tells me how to build the base image and give me this name. It'll point them to the AMI. Now what you could do is have a recipe called aws that sets up credentials and environment. Another called VMWare. Another called Openstack. You could run that recipe, build the box with the same recipe shared among those other drivers."

What will we do now? What needs to happen next?

  • Galen is going to talk to people within Chef about security orgs, and an overarching way to talk about that problem set. John K would like to be involved.
Clone this wiki locally