Delayed notifications running after failed chef run can cause broken configuration #2084

micmac · 2014-09-18T09:56:21Z

If I generate a config file calling the template's :create action delayed after collecting all the needed parts during the chef run, calling the template after a failed chef run can cause some config parts missing, turning a mostly harmless chef breakage into a potential service outage.

I've created a proof-of-concept fix, creating an additional option :only_on_success for notifications that causes them to not run if there was an exception earlier: prezi#1

lamont-granquist · 2014-09-18T15:50:32Z

Gathering information throughout recipes to feed into a single template resource is generally better done via definitions:

https://docs.getchef.com/essentials_cookbook_definitions.html

See the "Many Recipes, One Definition" section. Since you're not throwing a delayed notification there it will not run if the chef-client run fails before it gets to it.

micmac · 2014-09-19T11:06:42Z

Unfortunately - as I understand based on behaviour - the solution with definitions only works because definitions are processed during recipe load/compile time and compile-time exceptions seem to be fatal, causing delayed notifications to be skipped. I feel that if this behavior changes or someone decides to write a resource that tries to call these definitions based on a lazy attribute, things will break again.

What I'm trying to do: we collect all the monitoring items to be registered for that machine and send them to the monitoring service, then at the end, fetch the list of registered services and delete the ones no longer defined in Chef. This means I have to call the cleanup once, at the end of the chef run, if the chef run was successful (otherwise I may end up deleting a lot of important monitoring items).

Did I misunderstand something?

lamont-granquist · 2014-09-19T18:35:07Z

Well the point is that you avoid calling a delayed notification. It moves the collection of monitoring things to compile stage, and once you're done with that you are done collecting things. This uses compile mode more correctly as a "prep" stage. Then you have resources that run normally in converge mode which take the collected stuff and use them to do the thing. If you use any notifications for idempotency then you would use :immediate ones instead of :delayed. Either way you've fully collected all your information by the time that compile mode is done so you'll never push partial data. Also this way you're closer to atomic -- all of your recipes run their compile time code and assemble the information, any failures happen in compile mode and there's no possibility that you've even scheduled a delayed notification yet since no resource have run. Then you assemble your resource collection and the resources associated with using and pushing that data should run 'together' as much as possible and either failures beforehand will cause the whole thing to abort, or failures afterwards don't cause any corruption.

If you are assembling things in converge mode with ruby_blocks and LWRPs modifying the data which later needs to be presented, then that's an indication that code needs to get refactored.

I'm not sure what you mean by someone tries to call these definitions based on a lazy attribute since definitions are compile-time and that means that lazy doesn't make any sense and isn't supported. You can compile your data (probably in the run_state) and then inject that prepared data into a resource and use lazy on the resource so that the data is fully compiled before its handed off to the resource, in fact you should do that if you expect later recipes to be calling your definition, but that won't affect any of this analysis -- since you use lazy to delay consuming the data you are assured that all your compile-time code in the definitions are evaluated before the lazy evaluation, so you will never have partial data.

micmac · 2014-09-22T10:22:49Z

When using notifications I can only use an immediate one if I'm reloading some service. I cannot just rewrite a configfile immediately when the collected content changes.

There are cases when I cannot collect all the information in compile time - think for example of creating a user with a resource then using the user information later.

What you're suggesting boils down to rewriting most of my LWRPs as resources. Before I start doing that, do you know about any guidelines about how these use cases should be implemented in a "proper" way? Like, when should one use an LWRP and when a definition, or what should happen during "prep" stage?

Also, with a definition, I have to explicitly implement only_if/not_if logic.

I could not find a use case for lazy, however I have plenty of resources creating other resources creating even further resources that in turn need some data collected for monitoring. Maybe it's not how I should use chef but again, all these need is a method of doing things at the end of a successful converge.

In the end I still feel that avoiding a delayed notification just because I cannot call it in a failsafe way is a workaround and having the option to run something at the end of a successful chef run is good.

smurawski · 2015-04-29T17:11:15Z

@micmac You are welcome to submit a PR addressing this behavior, but this feels more like a workflow solution than a change in code. This is not something we intend to pick up as a change. Thanks!

lamont-granquist · 2015-04-29T17:11:43Z

So when you're creating a user with a resource, you should really be asserting the state that you desire, and you can therefore assert the state in the other resources. Creating a user, then reading the pid (using a converge-time ruby block) then using a lazy attribute on a template is code smell. In that case you should just select a pid at compile_time somewhere and assert it in both resources. Unless you've got a more concrete example.

Also since you're using a definition to create a template resource you can hang a not_if/only_if off of the template resource that you are constructing. In chef 12 if your definition returns the resource then it will correctly get returned in the dsl and you can use the absence of an argument to the definition to create a no-op that will return your resource so you can easily chain notifications/guards/etc off of the resource.

And again, I think the direction you're going is fighting the tool. If you've got resources creating other resources that is going to end badly. You want to be using definitions to iteratively create resources.

mcquin added the Chef Core label Sep 18, 2014

smurawski closed this as completed Apr 29, 2015

martinb3 mentioned this issue Nov 5, 2015

Add firewall rules to resource in compile phase? sous-chefs/firewall#106

Closed

martinb3 mentioned this issue Feb 24, 2016

Make sure the firewall service can't restart without parsing all firewall rules sous-chefs/firewall#122

Closed

chef locked and limited conversation to collaborators Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed notifications running after failed chef run can cause broken configuration #2084

Delayed notifications running after failed chef run can cause broken configuration #2084

micmac commented Sep 18, 2014

lamont-granquist commented Sep 18, 2014

micmac commented Sep 19, 2014

lamont-granquist commented Sep 19, 2014

micmac commented Sep 22, 2014

smurawski commented Apr 29, 2015

lamont-granquist commented Apr 29, 2015

Delayed notifications running after failed chef run can cause broken configuration #2084

Delayed notifications running after failed chef run can cause broken configuration #2084

Comments

micmac commented Sep 18, 2014

lamont-granquist commented Sep 18, 2014

micmac commented Sep 19, 2014

lamont-granquist commented Sep 19, 2014

micmac commented Sep 22, 2014

smurawski commented Apr 29, 2015

lamont-granquist commented Apr 29, 2015