New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit mode proposal #69

Merged
merged 3 commits into from Dec 11, 2014

Conversation

Projects
None yet
@mcquin
Copy link

mcquin commented Nov 13, 2014

No description provided.

@sethvargo

This comment has been minimized.

Copy link

sethvargo commented Nov 13, 2014

I'm a little confused as to how this is different from ChefSpec?

@mcquin

This comment has been minimized.

Copy link

mcquin commented Nov 13, 2014

@sethvargo I'm not super familiar with ChefSpec, but from the README I can say that audit mode is an additional phase to Chef Client and can run with or without converging the node. It looks like ChefSpec is intended for local testing using Chef Solo, whereas audit mode is intended to be used in testing/preprod and prod.

@lamont-granquist

This comment has been minimized.

Copy link
Contributor

lamont-granquist commented Nov 13, 2014

The closest project to this out there is minitest-chef-handler -- it runs, "in production" on the node actually being converged, it isn't development testing like chefspec or test-kitchen. Used either to have a set of audit cookbooks to inspect a system (similar to why-run but able to report things like "port 80 is/is-not open" or "host www.google.com correctly resolves" in policy language).

# Audit Mode for Chef Client

Add an auditing feature to core Chef which will allow users to develop a test
drivin infrastructure, as well as evaluate and monitor its state, without the

This comment has been minimized.

@tyler-ball

As a maintainer of a Chef-managed infrastructure,
I want the ability to write custom tests inside my cookbook recipes
so that I can have a test-driven infrastructure.

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

The combination of these last 2 lines seem like they duplicate each other - it might just be because they both say 'test'. How do you feel about something like:

As a maintainer of a Chef-managed infrastructure,
I want the ability to write infrastructure audits
so that I can validate state in a soon-to-be-managed infrastructure
and validate resources after they are idempotently converged

Nevermind, don't use what I wrote, that looks like word vomit 😄 But I think we can make this motivation both more concise and better illustrate why we want audits.

* `controls(description_str, &block)`: Used to denote a group of audits, or controls,
to be evaluated on the node.
* `control(description, &block)`: Used to denote a subset of audits.
Only available within the context of `controls`. `control` groups can be nested.

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

I think we could call out here that control is an alias for describe or context.

@tyler-ball

This comment has been minimized.

Copy link

tyler-ball commented Nov 14, 2014

I'm a little ways through this, but I wanted to clarify something up front: this RFC is not for the same auditors that we have talked about before, correct? It seems like we're free to use the words spec, test and make comparisons to regular RSpec/Serverspec code, right?

### Chef client runner
An audit phase will be performed after client converges. By default, audit mode
will be enabled. Only audits included in recipes in the expanded run list will
be evaluated on the node.

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

As soon as we complete our include_recipe card I think we should update this to explain how we're going to deal with controls blocks that are included multiple times.

This comment has been minimized.

@sersut

sersut Nov 14, 2014

Member

I don't think that we will need to call this out explicitly. We will include a recipe only once with the existing logic in include_recipe.

This comment has been minimized.

@tyler-ball
be evaluated on the node.

Converging Chef and evaluating audits can occur independently of each other
using the `--[no-]audit-mode` CLI option. If neither form of the flag is passed,

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

There is a mix of subjects here - sometimes, the client is running the phases, and sometimes the phases are just being ran. There is some literature term for this possessive vs non-possessive thing-a-ma-jig but I don't know what its called.

If neither form of the flag is passed, chef client will run both phases.  If passed
`--no-audit-mode` chef client will skip audits after converging the resources.  If passed
`--audit-mode` chef client will skip converging resources but still perform the audits.

Written this way, it is the chef client acting on the subject the whole time.

This comment has been minimized.

@sersut

sersut Nov 14, 2014

Member

We should also include that this functionality can be controlled via Chef::Config

audits.

Errors in the converge phase do not affect running the audit phase, and vice versa.
These phases are run independently and errors are collected and provided to the

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

These phases are run independently; errors are collected and provided to the error handlers once each phase completes.


The following events will be added to `Chef::EventDispatch::Base`

* `converge_failed(exception)`: called if the converge phase fails

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

Is it worth mentioning that previously there was only converge_complete called in either a success or failure scenario?

* `converge_failed(exception)`: called if the converge phase fails
* `audit_phase_start(run_status)`: called before audit phase starts
* `audit_phase_complete`: called when the audit phase successfully finishes
* `audit_phase_failed(exception)`: called if there is an uncaught exception

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

I'm bad about this personally, but I'm trying to get in the habit of writing error instead of exception since that seems more ruby-ish. @danielsdeleo @lamont-granquist Is that correct?

This comment has been minimized.

@danielsdeleo

danielsdeleo Nov 14, 2014

Member

I use them sort of interchangeably. I think the term "error" is good for an "end-user-y" context, like a CLI, so you print "the thing you tried to do failed with error 'foo'", but in a more developer-y context "exception" makes sense because you're raising and rescuing exception types. But I can't confidently say that I apply this logic very consistently.

This comment has been minimized.

@lamont-granquist

lamont-granquist Nov 14, 2014

Contributor

Yeah, error/warning/fatal generally refer to end-user-ish things. Exception is an object that gets raised. I wouldn't be surprised if I'm inconsistent as well.

during the audit phase.
* `control_group_started(name)`: signifies the start of a `controls` group with
a defined `name`
* `control_example_success(control_group_name, example_data)`: an example in a

This comment has been minimized.

@tyler-ball

tyler-ball Nov 14, 2014

I think we should provide some documentation about what will be contained in example_data for these last 2 event handlers.

Unless an RFC is not the place for this? Maybe just an example of what is in example_data?

This comment has been minimized.

@sersut

sersut Nov 14, 2014

Member

I agree. We should document how the example_data and error looks here.

@sersut

This comment has been minimized.

Copy link
Member

sersut commented Nov 14, 2014

Other than the minor tweaks this looks sweet to me @mcquin. 👍

Our goal has been to simplify the directory structure of cookbooks. Writing
audits in recipes means users don't need to add an additional directory to their
cookbooks, and more tightly couples testing and development which helps users
develop good test driven infrastructure practices.

This comment has been minimized.

@danielsdeleo

danielsdeleo Nov 14, 2014

Member

I'm okay with this as a pragmatic way to get things done in the short term, but I think our long-term goal should be to allow tests/controls to be written in other files, as soon as the server APIs support it. I mean, if writing tests inline was always better, why don't we do that in Chef itself?

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 14, 2014

I feel like I'm missing some context on this whole idea. Why do we want to put test code in recipes? At a minimum this seems like it should be a thing that isn't in recipe code, but I'm guessing there has been some internal discussion of this feature previously.

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 14, 2014

@coderanger my interpretation: right now the "segments" stuff in the Server APIs kinda screws us from including files in different directories, and we want to ship this without having fixing that first.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 14, 2014

Given the impact adding stuff to the recipe DSL will have on compat (read: we'll likely never be able to change this) I don't think I like that as enough reason to not do the work to create a new segment.

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 14, 2014

My opinion is that we should support both, as both styles are common enough in other software development contexts.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 14, 2014

In the abstract I'm not against having test-ish code in recipes, but I think we would need to figure out a better way to adapt RSpec to work more like the existing resource format. Having two super-different DSLs in one file seems like a fast way to confuse people, especially if they don't already know RSpec (most ops folks).

@sethvargo

This comment has been minimized.

Copy link

sethvargo commented Nov 14, 2014

both styles are common enough in other software development contexts.

Can you provide an example? I'm having a hard time finding another paradigm where the tests are inside the code instead of a separate file...

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 14, 2014

http://doc.rust-lang.org/guide.html#writing-tests is one, I've also seen it in ruby where you wrap the test suite in something like if $0 == __FILE__ as an introduction to unit testing (so that the directory structure of the test framework can be taught later).

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 14, 2014

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 14, 2014

I don't think you are correct that Rust does this, from my reading of the linked doc it seems to put them under tests/. The inline format does happen in Python, but it is generally discouraged and used only for "toy"-ish programs. Similarly I've never seen a gem that used that trick for its real tests.

@sethvargo

This comment has been minimized.

Copy link

sethvargo commented Nov 14, 2014

I agree that it could be useful for "learning", but after hearing more, my vote is 👎

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 14, 2014

As an example, I would want a minimum to be enclosing the test code in a resource to itself so it is clearer when it will execute and that the DSL is only available inside that block. I would also want to see that implemented outside of core before we talk about adding it since that part wouldn't need any core mods that I can think of. Adding the audit events system seems like the most interesting part of this since it will be useful in a broad range of final forms and must be in core to make any sense.

@ranjib

This comment has been minimized.

Copy link
Member

ranjib commented Nov 14, 2014

@mcquin @tyler-ball can you provide the recipe example as well?

Can you elaborate on why modeling audits on top of a testing framework? How this will interact with lwrp's or other dynamically defined resources , in why_run mode, with nesting etc.
It will be nice to test this as a module or add on gem(similar to metal/zero) first, as noah mentioned,

Currently we have several other situation where a multi-phase chef run might be useful, and if we at all take this route, it will better have the multi phase/stage support to core chef before implementing features using it, and then provide a middle ware & DI mechanism on top of it, and then building audit or other functionality, but also allow multiple implementations to pick and choose from.

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 14, 2014

I think adding more phases to chef runs is an interesting idea, but blocking this on it is scope creep IMO.

Also, the controls/tests aren't a resource, so making them look like one isn't going to make anything less confusing. We already have compile/converge, where resources represent actions that get added to a queue; this is the same concept, but the queue is different.

@sersut

This comment has been minimized.

Copy link
Member

sersut commented Nov 14, 2014

I think there are a few things going on in parallel in our discussion:

  • I think a built-in way of running custom written rules is useful for smoke testing your deployments and opens the door for some auditing scenarios. Having arbitrary stages is definitely a cool idea but has more ambitious goals than the proposal here and this proposal doesn't in any way block to have more stages in the future.
  • Distributing these custom written rules via recipes enables us to leverage many things like having access to run information, versioning, being run on demand via run_list etc. Having a separate segment for them is definitely an option but I think having segments impose unnecessary complexity for the cookbook structure and we should have less not more.
  • why-run mode is a great point @ranjib. I think we shouldn't run these rules in why-run mode. I don't see how dynamically defined resources or LWRPs are relevant. controls is a keyword for recipe DSL. The idea is to collect all the controls during compile phase and run them after the converge phase. We don't need the ability to write rules from within resources 😄.
  • We should note that the tests mentioned here are not tests for your resources or cookbooks. These tests should exist to test the infrastructure. IMO if I write a package "mysql" resource I wouldn't need to write an audit rule to check if mysql is installed. But I should definitely write an audit rule that ensures I can connect to your DB.
@sersut

This comment has been minimized.

Copy link
Member

sersut commented Nov 20, 2014

I won't be able to write an ohai plugin that can check if the database is connectable since ohai runs before converge and since we don't have a built-in way of distributing ohai plugin (we really should but that's a different topic).

I agree with @tyler-ball that even if the template validation works, it doesn't mean that my apache / nginx is running because maybe it doesn't have access to log location, or the service template is not correct, etc...

I agree with you @coderanger. One shouldn't need to write a rule for all Chef resources they add. But all of our resources doesn't cover all the things one might want to check / set in an idempotent way.

@lamont-granquist

This comment has been minimized.

Copy link
Contributor

lamont-granquist commented Nov 20, 2014

@coderanger this is more useful to assert things like "nothing is listening on port 111". And that doesn't map well to a service resource because it may be named "portmap" or "rpcbind" based on which distro I'm using and when they decided to rename that service, and it assumes I get the service provider correct and there's other things that can go wrong with that (it could be running, but the init script could be removed which would defeat any service resource :stop action). As a SOX/PCI-DSS auditor or sufferer what I want to be able to say is "nothing is listening on port 111" and "nothing is listening on port 23". Serverspec does that well. The Chef DSL does not do that well, and neither does minitest-chef-handler (because that rides on the back of Chef's resources and doesn't have its own matchers like serverspec does).

@lamont-granquist

This comment has been minimized.

Copy link
Contributor

lamont-granquist commented Nov 20, 2014

in chef, the cross platform code to shut down port 111:

portmap_service =
  case node['platform_family']
  when "debian"
    case node['platform']
    when "ubuntu"
      node['platform_version'].to_f >= 13.10 ? "rpcbind" : "portmap"
    when "debian", "raspbian"
      node['platform_version'].to_f >= 7.0 ? "rpcbind" : "portmap"
    else
      "rpcbind"
    end
  when "rhel"
    node['platform_version'].to_f >= 6.0 ? "rpcbind" : "portmap"
  else
    "rpcbind"
  end

service portmap_service do
  action :stop
end

in serverspec the cross-platform code to assert that the previous code did the right thing:

describe port(111) do
  expect(it).to_not be_listening
end
@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 20, 2014

I understand the point, and agree it is a useful thing to have. Trying to make sure this is done in a way that allows enough flexibility in development to find the best syntax and whatnot since RSpec is a complicated beast. Serverspec is definitely a compelling argument to stick to that DSL, but I would much rather see the DSL done in such a way that you can depend on audit-rspec in your metadata and then it JFWs (famous last words).

end
end
end
end

This comment has been minimized.

@randomcamel

randomcamel Nov 20, 2014

Chef's whole premise is that Chef will converge on a system state; these tests look like precisely the kind of thing Chef is supposed to save users from writing. If the root user wasn't created, or MySQL didn't start, why didn't the converge fail?

Are there examples that would better illustrate audit mode's purpose?

This comment has been minimized.

@mcquin

mcquin Nov 20, 2014

Lamont has a good comment here that I will incorporate into the next revision.

This comment has been minimized.

@coderanger

coderanger Nov 20, 2014

Contributor

@randomcamel Think of it as end-to-end testing. Chef installed a package, wrote an auth config, and started the service. The test is "can log in and run a query". This catches stuff like a new version of the software changing things or whatnot in an unexpected fashion.

This comment has been minimized.

@tyler-ball

tyler-ball Nov 21, 2014

@randomcamel The other use case is specifically when chef ISN'T configured yet. I agree with you - once you have a mysql::install recipe converging successfully you should no longer need an audit that checks package("mysql").

But the code snippet above is useful for an organization trying to convert to Chef for the first time - they can ensure that all the systems are in a known state before adding the mysql recipe. What if one of the nodes (this is a bad example) has oracle installed instead of mysql? Audits would highlight that.

This comment has been minimized.

@coderanger

coderanger Nov 21, 2014

Contributor

@tyler-ball I don't think I would call that out as the right use case, that is far far better served by plain serverspec.

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Nov 25, 2014

A few things that this RFC makes me think:

  1. We need a way to add segments elegantly, because I (like @sethvargo and others) really don't like the inline style.

  2. We need a way to mark an RFC, and its implementation, as stable/unstable/experimental. For example, I think we should move ahead with this RFC, but I think we need to mark it as experimental - because we need more experience to really understand what a "final" style really is.

@coderanger and others, what do you think about adding a marker like that to the RFC metadata, and making it clear a feature is experimental (its syntax subject to change, and maybe even deep implementation)?

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 25, 2014

@adamhjk I think that is a very solid indication something shouldn't be an RFC yet. Go implement something externally from Chef, learn the lessons of doing it wrong a few times, and then write an RFC for a design you think is workable. We don't need an RFC to agree something is a good idea or worth of experimentation, we need them to have a way to discuss detailed plans.

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Nov 25, 2014

I'm not sure I agree. One of the things I would like to see us do is move faster in the core. I'll noodle more.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 25, 2014

I guess I agree for cases where not being in core would substantially complicate things, like I've been saying about adding an audit events framework as detailed here. I also think it is telling that that piece seems much less controversial :) Bundling a DSL extension as a cookbook is, comparatively speaking, very easy to do and would detract very little IMO.

In the more general case, I worry that moving fast in core can easily put a lot of momentum behind design choices that are being made "because it seemed like a good idea". For example, rspec vs. minitest in this case. Adding something experimental to core could work, with sufficient documentation and output to indicate the immaturity, but then the experimental feature is tied to Chef's release cycle. Imagine if something like chef-metal had to wait for Chef releases to get feedback on new design changes.

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Nov 25, 2014

@coderanger yeah, I also think we need to to release every piece of software we have at least daily, and ship from those to the "stable" release channel.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 25, 2014

@adamhjk I think people would be much more willing to run an experimental plugin for Chef than to deploy a big ball labeled "unstable" or "daily". The former has much more scope to what is unstable and what is not. Chef-metal is a great example of this at work.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 25, 2014

(though I'm not against daily releases so people can try out bug fixes sooner)

@sersut

This comment has been minimized.

Copy link
Member

sersut commented Nov 25, 2014

IMO the fact that something is implemented inside the core or as a plugin doesn't matter too much as long as it is shipped daily. I think the choice should be on a case by case basis. For example Chef Metal is easy to implement outside Chef because it doesn't change any of the core logic but Policyfile related features in Chef Client is tough to implement outside because it changes the very core of it.

As long as we keep the daily releases healthy and have a way to turn on - off the feature completely I don't see a reason why we should take the extra complexity to implement it outside core.

@tyler-ball

This comment has been minimized.

Copy link

tyler-ball commented Nov 25, 2014

@coderanger I'm not sure that consumers would be more likely to run experimental features as a plugin vs a feature-built omnibus install or gem install. We have a system for installing knife plugins, but not core chef feature plugins.

Lets say I am an infrastructure maintainer and I have a way to install Chef. If they release a new plugin, I need to maintain one workflow for installing Chef and another workflow for installing/uninstalling/updating plugins in the chef install.

If they just release an omnibus named chef-12-experimental-audit then I can use my normal deploy process to deploy that omnibus rather than my regular chef-12 deploy.

I've personally found plugin maintenance to be a lot of work (curse you, OSGI). And because Chef typically runs on a scheduled interval it is easy to just re-install the entire package without disrupting normal converge workflow. I would love for someone to show me how to make plugin management easy :) Ironically, Chef is the best tool I have found for doing this.

I think the separation of these experimental features seems more a developer concern. Personally, Git branches work for me.

@tyler-ball

This comment has been minimized.

Copy link

tyler-ball commented Nov 25, 2014

I also agree - daily releases FTW

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 25, 2014

@tyler-ball In this context, plugins would just be cookbooks that register extensions of some kind. In this case it would just be adding new DSL methods. No process is needed per se, just the existing cookbook dependencies system and the tooling around that.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented on new/audit-mode.md in cb1125b Nov 25, 2014

Is there a reason the copyright section was removed?

This comment has been minimized.

Copy link

mcquin replied Nov 25, 2014

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Nov 25, 2014

I think we can boil some of this down to two remaining concerns:

  1. We want to ensure that, to some extent, the specifics of the DSL and location of controls are pluggable. We all recognize this means there will be changes to chef core. Please make sure that either the first implementaiton, or shortly thereafter, lets us engage in plugabillity for syntax.

  2. We need to be specific about marking the feature as actively unstable within the current release cycle. People who use it should know that it's under active development, at the very least when they hit the documentation. I want us to be able to break semver for this feature.

More things outstanding? @coderanger, @jonlives?

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Nov 26, 2014

Exactly, @sersut. We need to move to making it safe to innovate in core, and let people opt-in or out to those experiments.

@coderanger

This comment has been minimized.

Copy link
Contributor

coderanger commented Nov 26, 2014

I think that just about covers it. I'm cool with core moving fast and breaking things as long as non-core can come along for the ride in some way. If the internal APIs are based on some kind of registration system (a la dialects, which could provide this kind of plugability) then we can have code in core that happens to register itself as an audit provider but the same code can be called just as well from a gem or cookbook. The existing dialects code also has a concept of "quality" levels so you can also flat out override built-in providers if you don't like them or want to disable them. As long as those capabilities exist, I'm happy :)

@danielsdeleo

This comment has been minimized.

Copy link
Member

danielsdeleo commented Nov 26, 2014

Based on my experiences working on similar extension points, and a desire to eventually break Chef (the git repo) into smaller components, I think a totally sane way of doing this is to develop it in core, with an unstable marker of some sort (e.g., policyfile mode currently emits a warning saying the feature is experimental), and then spin some parts of it out to a gem that chef would depend on. Developing it in core lets you adjust the interfaces between classes and such without having to modify two projects until you reach a place where the interfaces are sane and stable, which is a good indication that it's probably usable by other people/projects.

@zuazo

This comment has been minimized.

Copy link

zuazo commented Nov 26, 2014

If this can be implemented independently from Chef internals, I think an external gem makes more sense. Bearing in mind that there are independent projects that do something similar, it may be possible, but cannot tell. Either way, if chef-client needs to expose a new API (or experimental code) for this to work properly, I totally agree with @danielsdeleo.

@sersut

This comment has been minimized.

Copy link
Member

sersut commented Dec 10, 2014

Here is the plan of record I'd like to propose to move this RC forward:

  1. Merge audit-mode branch into master by disabling audit mode by default and adding a warn message saying that "This feature is under experimental and under heavy development. Expect APIs for it to change and use it at your own risk."
  2. Once we have an idea of how the feature looks, update the RFC with some requirements to choose a different audit dsl / language.
  3. Merge the RFC.
  4. Implement the pluggability in master.

We can chat more about this tomorrow in our IRC meeting and get to a consensus. Any thoughts?

/cc: @adamhjk

@adamhjk

This comment has been minimized.

Copy link
Contributor

adamhjk commented Dec 11, 2014

Per the rfc meeting, we are going to merge this RFC, and put its state to "accepted" but not "final". We need to add a section letting potential readers know it is in flux, but it is our intent to accept this feature. Paging @jonlives! 👍

adamleff added a commit that referenced this pull request Dec 11, 2014

@adamleff adamleff merged commit bfec256 into master Dec 11, 2014

@adamleff adamleff deleted the mcquin/audit-mode branch Dec 11, 2014

@tyler-ball tyler-ball referenced this pull request Dec 18, 2014

Merged

Audit mode #2674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment