Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudFormation signaling #1581

Closed
gabegorelick opened this issue May 18, 2021 · 28 comments
Closed

CloudFormation signaling #1581

gabegorelick opened this issue May 18, 2021 · 28 comments
Labels
area/core Issues core to the OS (variant independent) type/enhancement New feature or request

Comments

@gabegorelick
Copy link

What I'd like:
It would be nice if there was an easy way to call CloudFormation's SignalResource when booting a Bottlerocket instance. This is typically considered a best practice when creating an ASG in CloudFormation so that it can roll back to an earlier LaunchTemplate or LaunchConfig if the instances don't come online.

See, for example, the ECS CloudFormation reference architecture, which uses the cfn-signal CLI: https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87

In Bottlerocket's case, a typical boot issue I've encountered is passing malformed user data. In such a case, Bottlerocket's early-boot-config.service will fail. But if you don't signal CloudFormation, CloudFormation will still consider the deploy a success, potentially leaving you with no working instances.

Any alternatives you've considered:

Running cfn-signal in a bootstrap container would probably work. But it's not clear to me that bootstrap containers run late enough in the boot sequence to verify that all services are up.

@samuelkarp samuelkarp added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels May 18, 2021
@jpculp jpculp added this to the oncall milestone Jul 8, 2021
@jpculp jpculp added area/core Issues core to the OS (variant independent) priority/p1 status/research This issue is being researched and removed status/needs-triage Pending triage or re-evaluation labels Jul 8, 2021
@mello7tre
Copy link
Contributor

Problem is even bigger if you use ASG CreationPolicy:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

Having no way to signal CF (usually you did this using cfn-signal), you need to either remove policy or set MinSuccessfulInstancesPercent to zero.

@jhaynes jhaynes added status/notstarted and removed status/research This issue is being researched labels Jul 19, 2021
@jhaynes jhaynes modified the milestones: oncall, backlog Jul 19, 2021
@Vaishvenk Vaishvenk added this to Feature Backlog in Bottlerocket Roadmap Jul 28, 2021
@mello7tre
Copy link
Contributor

Just one implementation idea:

[settings.aws.cloudformation]
"signal" = true/false
"stack-name" = ""
"logical-resource-id" = ""
  • Create a service, cf-signal.service:
[Unit]
Description=Send signal to CloudFormation Stack
Wants=network-online.target
After=multi-user.target

[Service]
Type=simple
RemainAfterExit=true
EnvironmentFile=/etc/cf-signal.env
ExecStart=/bin/sh -c "STATUS=$(/usr/bin/systemctl --wait is-system-running) /usr/bin/cf-signal"

[Install]
WantedBy=multi-user.target
  • environment file, cf-signal.env, should look like:
SIGNAL={{settings.aws.cloudformation.signal}}
STACK_NAME={{settings.aws.cloudformation.stack-name}}
LOGICAL_RESOURCE_ID={{settings.aws.cloudformation.logical-resource-id}}
  • change release.spec to add installation section for cf-signal.env

Note

  • Systemctl --wait option assure that execution is delayed until boot process is complete.
  • cf-signal need to use STATUS variable to know if boot has been successful (man systemctl for details):
    • running = success
    • any other state = failure

@webern
Copy link
Contributor

webern commented Aug 2, 2021

Thank you @gabegorelick for bringing this use case to our attention and @mello7tre for providing a design! We are taking a look at this (both the use case and proposal).

@mello7tre
Copy link
Contributor

mello7tre commented Aug 3, 2021

Just one note:
regarding @gabegorelick specific issue related to the case of a malformed user-data, proposed solution cannot work; and probably there is no solution at all. It's an chicken egg problem.

To signal CloudFormation we need to read user-data to know StackName and LogicalResourceId.

We can acquire those information by looking at instance tags:

  • aws:cloudformation:logical-id
  • aws:cloudformation:stack-name

but to do this instances need to have IAM permission:
ec2:DescribeTags

and we cannot presume it.

Details

Cloudformation AutoScalingGroup Resource [ASG] can use two policies:

  • UpdatePolicy
  • CreationPolicy

The first is used during a RollingUpdate where ASG instances are replaced with updated ones.
We have the property MinSuccessfulInstancesPercent that specify the percentage of instances that must signal success to consider the update successful.
If an instance do not signal success within the configured time period, it's considered as a failure signal.

The second Policy is used in two different cases:

  • Creation of a new resource by a replacement update.
  • Creation of a new resource.

The first use the property MinSuccessfulInstancesPercent as the UpdatePolicy.

The second use the property Count to specify the number of success signal to receive to consider the resource creation successful.
But as AWS documentation we have that:

If the resource receives a failure signal or doesn't receive the specified number of signals before the timeout period expires, the resource creation fails and CloudFormation rolls the stack back.

Just one failure signal is sufficient to consider the creation as failed and default Count value is 1.

Recap

Rolling/Replacement Update of an AutoScalingGroup using an Update/Creation Policy.

If an instance do not signal success within the timeout, CloudFormation consider the instance as a failure (only problem should be that it need to wait longer to know this).

Creation of an AutoScalingGroup using a Creation Policy

  • if Count property is 0 and we have a malformed user-data.
  • If Count property is lower than ASG DesiredCapacity and we have a transient problem only on one instance that do not permit proper creation of file with information needed by signal program or if multi-user.target is not activated.

In both cases we can have the problem described by @gabegorelick :

CloudFormation will still consider the deploy a success

as we have no way to signal failure.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.
This way CreationPolicy use the same logic of the other ones (assuming MinSuccessfulInstancesPercent equals to 100).

@gabegorelick
Copy link
Author

To signal CloudFormation we need to read user-data to know StackName and LogicalResourceId

Sure, but in practice what tends to happen is that if you messed up your instance configuration such that it can't call the CFN API (bad IAM permissions, not passing the correct parameters to SignalResource, etc), you'll just timeout and it will consider it failed. That seems like acceptable behavior for the malformed user data case.

We can acquire those information by looking at instance tags

AFAIK, no Amazon Linux instances do this. They all expect you to pass in the stack and resource names. See https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87, for example. I would expect Bottlerocket to behave similarly, and not do any fancy introspection to determine this info.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.

I'm not sure I understand your point, but IIRC Count is the number of signals each instance must send before it's marked as successfully created. If user data is malformed, you'll get 0 signals and timeout. Whether CFN considers the ASG creation to be a failure at that point depends on MinSuccessfulInstancesPercent.

In any event, this seems like a core CFN question and not specific to Bottlerocket.

@mello7tre
Copy link
Contributor

AFAIK, no Amazon Linux instances do this. They all expect you to pass in the stack and resource names. See https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87, for example. I would expect Bottlerocket to behave similarly, and not do any fancy introspection to determine this info.

Totally agree with you, maybe i explained bad, but my final words regarding needed permissions:

and we cannot presume it.

where just saying: we cannot used this solution because it need extra permissions.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.

I'm not sure I understand your point, but IIRC Count is the number of signals each instance must send before it's marked as successfully created. If user data is malformed, you'll get 0 signals and timeout. Whether CFN considers the ASG creation to be a failure at that point depends on MinSuccessfulInstancesPercent.

Reading AWS documentation seems that MinSuccessfulInstancesPercent is used only for an Auto Scaling replacement update when a WillReplace policy is used.
And not for the first creation of an ASG.
But it's not clear if Count is used too (only way to know this is by experiment).
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

In any event, this seems like a core CFN question and not specific to Bottlerocket.

Partially.
I think should be clear if a solution is adopted, which events it can cover.
And it should help to choose when/how to trigger the cf-signal service too.
I put it in multi-user.target. But probably should be better to insert it previous in the boot chain, just as the user-data is processed and the environment file is created, to be able to cover more failing events.

@samuelkarp samuelkarp modified the milestones: backlog, next Aug 5, 2021
@mello7tre
Copy link
Contributor

mello7tre commented Aug 8, 2021

Made some tests, mine assumptions where wrong, but..
AWS documentation is misleading and partially wrong too.

Whole Creation Policy is used for both: ASG new creation and ASG update using Replacement Update (that after all is just a new ASG creation followed by the deletion of the old one).

But:
Count represent the number of signal that need to be received both success or failure (and not only success).
Creation do not automatically fail for just one FAILURE signal!
Lack of a success signal in Timeout is considered a FAILURE.

In detail:

When creating an ASG Cloudformation wait until it receive Count signals (success or failure) or until Timeout time ends.
Once that happen it process received signal taking in account MinSuccessfulInstancesPercent and choose if creation as been successful or not.

Ex. (Timeout = 20m)

  • Count = 2 and MinSuccessfulInstancesPercent = 100
    • we send 2 SUCCESS signal . No wait and creation is successful
    • we send only 1 SUCCESS signal . We need to wait 20min and creation is failed.
    • we send 1 SUCCESS and 1 FAILURE. No wait and creation is failed.
  • Count = 2 and MinSuccessfulInstancesPercent = 50
    • we send 2 SUCCESS signal . No wait and creation is successful
    • we send only 1 SUCCESS signal . We need to wait 20min but creation is successful
    • we send 1 SUCCESS and 1 FAILURE. No wait and creation is successful.
Recap:

Only difference in signaling a FAILURE or not signaling at all is the time needed to wait.

So bottlerocket signaling implementation should focus on signaling a success if all goes well so that ASG creation should complete.
If we are able to cover all failure events too, much better, but if not (Ex malformed user-data) only difference is that ASG creation will fail after Timeout expire.

@samuelkarp
Copy link
Contributor

Thank you @gabegorelick for opening this issue and @mello7tre for providing a design! We’re really glad to see so much excitement around this enhancement.

@mello7tre’s design looks fairly straightforward to me. I can understand concern around failing quickly rather than waiting for a timeout, especially in the case of configuration applied via bootstrap containers rather than settings. One possible way for a bootstrap container to indicate that it has failed to complete configuration of the host (for example, formatting and mounting block devices) might be to add an additional settings.aws.cloudformation.success or settings.aws.cloudformation.signal-value setting. If the bootstrap container failed, it could flip this setting to false (or failure) to indicate that the FAILURE signal should be sent rather than the SUCCESS signal.

Would either of you be interested in contributing to Bottlerocket and implementing this feature? We’d be happy to assist if you run into any roadblocks with it.

@mello7tre
Copy link
Contributor

Thanks @samuelkarp for the offer.
But at the moment i have some personal family problems that take on all my spare time.

Before begin to apply mine implementation ideas i need to build a vanilla bottlerocket and this is on my todo list from some time, but i had no time to do it.

Second, as i said, i think the best approach should be to write a little rust program for signaling, we need just few lines of code, but problem is that rust-sdk at the moment do not support getting credentials from instance role.

I will continue to follow this issue, and if i will be able to find some time i will begin building bottlerocket and doing some experimenting.
But, at the moment, i am not the right person for a quick development.

@samuelkarp
Copy link
Contributor

Thanks for letting us know! We'll update this issue when we're able to start work on it, but in the meantime if anyone is interested in contributing here please let us know.

@mello7tre
Copy link
Contributor

just an update:
I had some time to build bottlerocket and begin experimenting.
Looking at metricdog, i have seen that you already execute systemctl commands, so best approach is to call systemctl --wait is-system-running directly inside cfsignal program.
I removed the systemd.unit environment file too and use a cfsignal.toml configured by reading the user-data.

But i am not a rust programmer, i am a cloud architect and devops.
At the moment i have a very basic running program "inspired" by metricdog code but when i will open a PR a rust expert should give it a look and make the relative needed changes (and have pity of the code i wrote).

I had to use rusoto in place of official alfa aws-rust-sdk for 2 problems:

I am still doing some tests to check when signals are sent and to find out when during the boot process we are able to send a FAILURE signal.
I will update you on this.

@mello7tre
Copy link
Contributor

cfsignal need configured toml file, so it depends on settings-applier.service.
It cannot send a signal for a failure happening before settings-applier.service and network-online.target are started.

It is able to send a failure signal for any other service starting from (included):
activate-multi-user.service

i am ready to open a RP.

@rcoh
Copy link

rcoh commented Sep 23, 2021

Quick update from the SDK: this will go out in v0.0.19-alpha either late this week or early next.

@jhuntwork
Copy link

Really nice work on this, good to see some forward momentum! Just curious what's left before this is usable?

@gabegorelick
Copy link
Author

It is able to send a failure signal for any other service starting from (included):
activate-multi-user.service

Does this include sending a failure signal if we couldn't successfully join a Kubernetes or ECS cluster? That's the main thing I'm looking for in this feature.

@mello7tre
Copy link
Contributor

It should.
Both ECS and Kube are WantedBy multi-user.target and depend on configured.target
CfnSignal service is WantedBy preconfigured.target and depend on network-online.target settings-applier.service.
If you look at:
https://github.com/bottlerocket-os/bottlerocket/tree/develop/sources/api
configured.target depends on settings-applier and represents the point at which the system is fully configured.
So in the worst case cfsignal and configured.target should be started at the same time.
But cfsignal should always be started before multi-user.target wanted services.

In the past i have done some test making activate-multi-user.service fail, and cfsignal properly signaled the ASG that instance have failed.
So cfsignal signaling should work for every service started by systemd and wanted by multi-user.service.

@gabegorelick
Copy link
Author

Both ECS and Kube are WantedBy multi-user.target and depend on configured.target

Do those services reliably fail when they can't join a cluster, or do they retry indefinitely?

@mello7tre
Copy link
Contributor

dunno about kube.
For ECS i made a test using a non existent cluster-name in ecs configuration and ecs service fail at systemd level, so in that case cfsignal should work.
I do not think that there could be another configuration that could make joining a cluster fail apart non existing cluster...
(i tried to put non existent option in /etc/ecs/ecs.config and ecs service/agent simply ignore them)

@kdaula kdaula removed this from the next milestone Feb 4, 2022
@gabegorelick
Copy link
Author

Another use case: waiting for instances to register and be healthy with a load balancer. Similar to what's mentioned in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html:

To have instances wait for an Elastic Load Balancing health check before they signal success, add a health-check verification by using the cfn-init helper script. For an example, see the verify_instance_health command in the Auto Scaling rolling updates sample template.

@mello7tre
Copy link
Contributor

yes, but usually the instances of an ecs cluster do not directly register with a LB, are the ecs services running on them that do this.

@gabegorelick
Copy link
Author

yes, but usually the instances of an ecs cluster do not directly register with a LB, are the ecs services running on them that do this.

True, that is the standard setup. In theory you could use instance targets without ECS managing it, although I don't know if anyone ever does that (I certainly have never).

But for Kubernetes, it's definitely reasonable to have instance targets that directly register with the LB, e.g. to expose a NodePort service.

@mello7tre
Copy link
Contributor

maybe in future, if PR will ever be merged, can be added a settings to manually specify a target-group-arn to query using elbv2 describe-target-health (i currently do this for other EC2 Stacks using cfn-init).
(if i have correctly understood your suggestion...)

@gabegorelick
Copy link
Author

maybe in future, if PR will ever be merged, can be added a settings to manually specify a target-group-arn to query using elbv2 describe-target-health (i currently do this for other EC2 Stacks using cfn-init)

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

@gabegorelick
Copy link
Author

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

One small hiccup: I think Bottlerocket restarts enabled host containers indefinitely, which is not what I want.

@etungsten
Copy link
Contributor

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

One small hiccup: I think Bottlerocket restarts enabled host containers indefinitely, which is not what I want.

One thing I believe you can do is to set settings.host-containers.<your-custom-container>.enabled to false from within your custom host container once it's done with its work.
All custom host containers mount in the Bottlerocket API socket and the apiclient, see:

// Mount in the API socket for the Bottlerocket API server, and the API
// client used to interact with it
{
Options: []string{"bind", "rw"},
Destination: "/run/api.sock",
Source: "/run/api.sock",
},
// Mount in the apiclient to make API calls to the Bottlerocket API server
{
Options: []string{"bind", "ro"},
Destination: "/usr/local/bin/apiclient",
Source: "/usr/bin/apiclient",
},

So within your custom host container, you can run something like apiclient set settings.host-containers.<custom-host-container>.enabled=false to prevent the host-container from restarting again. Lemme know if that works for you!

@gabegorelick
Copy link
Author

One thing I believe you can do is to set settings.host-containers..enabled to false from within your custom host container once it's done with its work.

Would that disable the host container for all future instances, or just the enclosing instance? I need future instance rollouts to still run that host container.

@etungsten
Copy link
Contributor

etungsten commented Feb 9, 2022

One thing I believe you can do is to set settings.host-containers..enabled to false from within your custom host container once it's done with its work.

Would that disable the host container for all future instances, or just the enclosing instance? I need future instance rollouts to still run that host container.

It would just be for that enclosing instance. Each instance has their own set of settings they configure and use.

etungsten pushed a commit to etungsten/bottlerocket that referenced this issue Feb 11, 2022
Created a new rust program, cfsignal to send signal to CloudFormation
Stack.
Program is a sort of cfn-signal
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html
but as cfn-signal need python cannot be used by bottlerocket.

cfsignal read configuration from a cfsignal.toml file configured reading
user-data, so it depends on settings-applier.service.
It cannot send a signal for a failure happening before
settings-applier.service and network-online.target are started.

It is able to send a failure signal for any other service starting from
(included):
activate-multi-user.service

It use systemctl action is-system-running with --wait option.
This way we can know if any service, after systemd boot process
finished, is in a failure status.

Requested changes:

* removed author
* signal parameter renamed to should_signal (is more specific that
should_send)
* added README.md
* removed commented out lines
* use imdsclient in place of ec2_instance_metadata
* refactor service_check.rs and renamed to system_check.rs

use weak dependency (WantedBy)for cfsignal.service

use tokio LTS, only with needed features

restart command

some code refactor

* use directly signal_resource as function
* code simplification in system_check.rs
* use standard boilerplate for main function

semaphore file and migration

* Use semaphore file to only run on first boot
* Add migration file for downgrading
* client.signal_resource collapsed
* Fix to packages/os/os.spec: toml file is not copyed (introduced during
  rebase)

Readme changes
etungsten added a commit that referenced this issue Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Issues core to the OS (variant independent) type/enhancement New feature or request
Projects
Bottlerocket Roadmap
Feature Backlog
Development

No branches or pull requests