Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hybrid environments #12

Merged
merged 3 commits into from
Apr 5, 2021

Conversation

jpculp
Copy link
Member

@jpculp jpculp commented Mar 23, 2021

Issue number:

N/A

Description of changes:

This change allows users running on-premises or in a hybrid environment to register with SSM as a managed-instance. SSM's local state directory (/var/lib/amazon/ssm) has moved to a persistent directory in /.bottlerocket/host-containers/current/. Activations parameters are parsed from a base64-encoded JSON block in user-data.

Example JSON:

{
  "ssm":{
    "activation_id":"foo",
    "activation_code":"bar",
    "region":"us-west-2",
  }
}

Also bumped the SSM Agent version to 3.0.882.0.

And bumped container version to v0.5.0.

Testing done:

  • Created an ssm hybrid activation.

  • Launched aws-ecs-1 ami with added [settings.host-containers.control] user-data.

  • Instance connected to ecs cluster.

  • Test task deployed successfully.

  • Check AWS Systems Manager to see if my activation was used.

  • Fetched instance-id from AWS Systems Manager and verified it began with mi- prefix.

  • Connected to control container via ssm session.

  • Verified that host-containers.control.user-data contained a base64-encoded block.

  • Disabled the control container and enabled a current container.

  • Verified that /.bottlerocket/host-containers/current/ contained an ssm directory.

  • Enabled and launched admin container.

  • Connected to admin container via ssh.

  • Ran sudo sheltie to verify root shell was still available.

  • Checked for failed systemd units.

  • Launched additional instance with no user data and verified everything still works.

  • Launched additional instances with purposefully malformed user data detailed below.

Some possible error messages:

Bad Region: us-weast-2

[   25.190387] host-ctr[447]: 2021-03-31 04:38:44 ERROR [processRegistration @ agent_parser.go.128] Registration failed due to error registering the instance with AWS SSM. RequestError: send request failed
[   25.192567] host-ctr[447]: caused by: Post "https://ssm.us-weast-2.amazonaws.com/": dial tcp: lookup ssm.us-weast-2.amazonaws.com on 172.31.0.2:53: no such host
[   25.212203] host-ctr[447]: Failed to register with AWS Systems Manager (SSM)

Bad Activation ID: character missing when copy/pasting

[   41.240253] host-ctr[445]: 2021-03-31 04:39:04 ERROR [processRegistration @ agent_parser.go.128] Registration failed due to error registering the instance with AWS SSM. InvalidParameter: 1 validation error(s) found.
[   41.244048] host-ctr[445]: - minimum field size of 36, RegisterManagedInstanceInput.ActivationId.
[   41.245795] host-ctr[445]: Failed to register with AWS Systems Manager (SSM)

Empty Element: "activation-code":""

[   13.964770] host-ctr[441]: Failed to fetch value for .["ssm"]["activation-code"] from /.bottlerocket/host-containers/current/user-data

Missing Element: no region specified

[   13.195573] host-ctr[445]: Failed to fetch value for .["ssm"]["region"] from /.bottlerocket/host-containers/current/user-data

SSM Agent changes since v0.4.2:

3.0.882.0

  • Added jitter to first control channel call
  • Added dedicated folder for plugins
  • Added option to overwrite corrupt shared credentials

3.0.854.0

  • Added $HOME env variable for root user when runAsElevated is true in session
  • Added CREAD flag in serial port control flags on linux
  • Added PlatformName and PlatformVersion as env variables for aws:runShellScript
  • Added support for macOS updater
  • Added v2.2 document support in updater
  • Added defer recover statements
  • Fixed inventory error log when dpkg is not available
  • Fixed ssm-cli logging to stdout
  • Removed consideration of unimportant error codes in service side
  • Updated ec2 credential caching time to ~1 hour
  • Updated service query logic for Windows
  • Updated golang sys package dependency

3.0.755.0

  • Fix fallback logic for MGS endpoint generation
  • Fix regional endpoint generation

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Dockerfile Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
@jpculp
Copy link
Member Author

jpculp commented Mar 24, 2021

  • Replaced all instances of /.bottlerocket/host-containers/control with /.bottlerocket/host-containers/current ("current" directory introduced in bottlerocket #1416).
  • Removed failed_ssm_params and replaced error handling with exit 1.
  • Removed explicit main function.
  • Shuffled some constants around for easier reading.
  • Small tweaks in README and comments.

CHANGELOG.md Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
start_control_ssm.sh Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
@jpculp
Copy link
Member Author

jpculp commented Mar 25, 2021

  • Moved some logic back out of the Dockerfile to a function in start_control_ssm.sh.
  • Replaced get_user_data_ssm with a more generic fetch_from_json which takes the user-data file as the 2nd argument.
  • Replaced "current" with environment variable HOST_CONTAINER_NAME, which will get set by host-ctr.
  • Added error handling logic if jq returns an empty value.
  • Small tweaks to enhance readability of conditionals leading to enable_hybrid_ssm.
  • Small tweaks to README.

CHANGELOG.md Outdated Show resolved Hide resolved
@jpculp jpculp requested a review from tjkirch March 25, 2021 22:21
start_control_ssm.sh Outdated Show resolved Hide resolved

if [[ -s "${USER_DATA}" ]] \
&& [[ ! -s "${SSM_AGENT_LOCAL_STATE_DIR}/registration" ]] \
&& jq --exit-status '.ssm' "${USER_DATA}" &>/dev/null ; then
Copy link
Contributor

@tjkirch tjkirch Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically fetch_from_json; we'd probably want to fail here, too, if .ssm was empty inside.

I think it might be easier to use set -e for the whole script. Then you could use fetch_from_json here too. I think then you could remove the set -e in prepare_persistent_state_dir, make fetch_from_json return 1 instead of exit 1, and change the main call to if ! amazon-ssm-agent -bla instead of checking $? afterward, and it'd be fine? If there are any lines we expect could fail, we could use if ! bla; log explanation; exit 1; fi for extra clarity, but I wouldn't expect the simple mkdir/chmod/rmdir/ln to fail.

README.md Outdated Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
@jpculp
Copy link
Member Author

jpculp commented Mar 30, 2021

  • Removed reliance on a HOST_CONTAINER_NAME environment variable and instead hard-coded to /.bottlerocket/host-containers/current which was added in bottlerocket #1416
  • Fixed quoting issues when fetching data from jq
  • Add set -e to the top of the script
  • Replaced $? return code check with an if ! for amazon-ssm-agent -register
  • Replaced single quotes with double quotes around key variable
  • Split prepare_persistent_state_dir function back into two pieces (Dockerfile and script)
  • Bumped SSM version from 3.0.755.0 to 3.0.882.0

@jpculp jpculp requested review from tjkirch and webern March 30, 2021 21:59
Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes look good to me. The main thing I'm hoping to see is more testing of failure cases - #12 (comment). I want to make sure it's going to be possible to troubleshoot issues with activations using console output. (Also a couple nit comments open.)

start_control_ssm.sh Outdated Show resolved Hide resolved
@jpculp
Copy link
Member Author

jpculp commented Mar 31, 2021

  • Replaced '."ssm"."activation-code"' format with .["ssm"]["activation-code"] for readability, maintainability, and consistency with the admin container
  • Rather than print the last line of the error file, read the whole file line-by-line.

For some context on the last piece, @tjkirch and I were concerned that if we printed the whole log file it might print errors from previous boots. In my testing, I found this not to be the case as the log file was fresh each time.

@jpculp
Copy link
Member Author

jpculp commented Mar 31, 2021

Updated the PR description to include the output of several failure cases. (Click each case to reveal EC2 console output)

@jpculp jpculp requested a review from tjkirch March 31, 2021 05:04
@jpculp jpculp requested a review from webern March 31, 2021 05:04
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! 🧩

@jpculp
Copy link
Member Author

jpculp commented Mar 31, 2021

Updated the PR description to include the SSM Agent changes since 3.0.732.0 (the version we included in v0.4.2)

Comment on lines +25 to +27
"activation-id": "foo",
"activation-code": "bar",
"region":"us-west-2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names aren't consistent with the user data used in the admin container (foo-bar vs foo_bar). Was this intentional? As much as we can, I'd prefer that the "convention" stays consistent for users. This aligns with Bottlerocket's user-data.toml naming (which uses skewers) so one could argue that this one is the "right way" and the admin is in the wrong.. but we already have the admin change out, so I'm more inclined to align with that user data as this user data and the admin's, to me, roughly feels like it'll be treated as same namespace and with similar conventions by end users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the other settings I see when I run apiclient -u /settings I would say that the admin container is the weird one, and that foo-bar is the "correct" convention, but I do not have strong feelings. I think the it might be worth getting some additional opinions. I think the admin container is the way it is because the authorized keys are typically stored in a file named authorized_keys, but that may have been a poor decision on my part. To make this conundrum even more fun, when you generate an activation token with aws ssm create-activation it returns JSON with a FooBar convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some offline discussion, we're going to keep the admin container convention and use foo_bar (snake_case) for the settings inside host-container user data in general. I'll add this change.

Copy link
Contributor

@jhaynes jhaynes Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of sticking with the skewer (foo-bar) implementation here and adding a backlog item to fix the admin container to accept the skewer also.

start_control_ssm.sh Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
@jpculp
Copy link
Member Author

jpculp commented Apr 1, 2021

  • Switched from kebab-case to snake_case for user-data settings
  • Moved constants to the top
  • Set the scope of enable_hybrid_env_ssm vars to local
  • Replaced line-by-line error log print with a single cat
  • Split up the value jq logic

@jpculp jpculp requested review from jahkeup, zmrow and webern April 1, 2021 01:15
Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set the permission bits on the script before adding - and then.. ship it!

🚢 :shipit: 👍🏽

Dockerfile Show resolved Hide resolved
start_control_ssm.sh Outdated Show resolved Hide resolved
This change allows users running on-premises or in a hybrid environment
to register with SSM as a managed-instance. SSM's local state directory
(/var/lib/amazon/ssm) has moved to a persistent directory in
/.bottlerocket/host-containers/current/. Activations parameters are
parsed from a base64-encoded JSON block in the user-data setting.

Example JSON:
{
  "ssm":{
    "activation-id":"foo",
    "activation-code":"bar",
    "region":"us-west-2",
  }
}
@jpculp
Copy link
Member Author

jpculp commented Apr 1, 2021

  • Switched #!/bin/bash to #!/usr/bin/env bash
  • Reverted back to kebab-case as it is our API standard (after further offline discussion)

@jpculp jpculp requested review from jahkeup and jhaynes April 1, 2021 21:46
Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a suggest edit for the README - script looks great!

Comment on lines +32 to +34
Once you've created your JSON, you'll need to base64-encode it and put it in the control host container's `user-data` setting in your [instance user data](https://github.com/bottlerocket-os/bottlerocket#using-user-data).

For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at rewording this because it took a pass or two to full parse the sentence. Now that its split up I think this conveys the same and reads a bit easier (IMO) - what do you think about this instead?

Suggested change
Once you've created your JSON, you'll need to base64-encode it and put it in the control host container's `user-data` setting in your [instance user data](https://github.com/bottlerocket-os/bottlerocket#using-user-data).
For example:
Once you've created your JSON, you'll need to provide it via the control container's user data.
The host-container `user-data` must be base64-encoded and provided - in its encoded form - using your [instance's user data](https://github.com/bottlerocket-os/bottlerocket#using-user-data):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the original, but I appreciate that you're looking at the documentation 😃


```
[settings.host-containers.control]
# ex: echo '{"ssm":{"activation-id":"foo","activation-code":"bar","region":"us-west-2"}}' | base64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my system, I need base64 -w0 to avoid newlines in the output.

Suggested change
# ex: echo '{"ssm":{"activation-id":"foo","activation-code":"bar","region":"us-west-2"}}' | base64
# ex: echo '{"ssm":{"activation-id":"foo","activation-code":"bar","region":"us-west-2"}}' | base64 -w0

Copy link
Member Author

@jpculp jpculp Apr 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we avoided it since not all systems may necessarily have -w. (MacOS, for example, does not)

@jpculp
Copy link
Member Author

jpculp commented Apr 5, 2021

Ran the tests again both with and without a hybrid activation in user-data using an AMI built off the latest develop branch. Still all green!

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants