Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsys: added SSM parameter validation #2969

Merged
merged 1 commit into from
Apr 6, 2023

Conversation

mjsterckx
Copy link
Contributor

@mjsterckx mjsterckx commented Mar 30, 2023

Inspired by #2617
Related to #2621

Description of changes:

Added a validate_ssm mod to the pubsys crate. This mod adds a subcommand validate-ssm with the following signature:

pubsys-validate-ssm 0.1.0
Validates SSM parameters and AMIs

USAGE:
    pubsys --infra-config-path <infra-config-path> validate-ssm [OPTIONS] --validation-config-path <validation-config-path>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
        --validation-config-path <validation-config-path>    File holding the validation configuration
        --write-results-path <write-results-path>            
        --write-results-filter <write-results-filter>...     
        --log-level <log-level>
            How much detail to log; from least to most: ERROR, WARN, INFO, DEBUG, TRACE [default: INFO]
  • validation-config-path is the path to the file containing the validation configuration. This file should look like this:
{
    "validation_regions": [ "us-west-2", "us-east-1" ],
    "expected_parameter_lists": [ "./canary/ssm/ami_lists/1.11.0.json", "./canary/ssm/ami_lists/1.11.1.json", "./canary/ssm/ami_lists/1.12.0.json", "./canary/ssm/ami_lists/latest.json" ]
}

Each expected_parameter_list should have the following structure:

{
  "us-west-2": {
    "ami-12345678": {
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/arm64/1.12.0-6ef1139f/image_id": "ami-12345678",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/arm64/1.12.0-6ef1139f/image_version": "1.12.0-abcdefgh",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/arm64/1.12.0/image_id": "ami-12345678",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/arm64/1.12.0/image_version": "1.12.0-abcdefgh"
    },
    "ami-082d59d8979d777a6": {
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/1.12.0-6ef1139f/image_id": "ami-87654321",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/1.12.0-6ef1139f/image_version": "1.12.0-hgfedcba",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/1.12.0/image_id": "ami-87654321",
      "/aws/service/bottlerocket/aws-ecs-1-nvidia/x86_64/1.12.0/image_version": "1.12.0-hgfedcba"
    },`
    ...
  },
  "us-east-1": {
    ...
  }
}
  • write-results-path is the path to the file where the validation results will be written. The file will look like this:
[
  {
    "name": "/aws/service/bottlerocket/aws-k8s-1.22-nvidia/arm64/latest/image_id",
    "expected_value": "ami-12345678",
    "actual_value": "ami-23456789",
    "region": "eu-west-2",
    "ami_id": "ami-12345678",
    "status": "Incorrect"
  },
  {
    "name": "/aws/service/bottlerocket/aws-k8s-1.25-nvidia/arm64/latest/image_id",
    "expected_value": "ami-34567890",
    "actual_value": "ami-45678901",
    "region": "eu-central-1",
    "ami_id": "ami-34567890",
    "status": "Incorrect"
  },
  ...
]
  • write-results-filter is a vec of potential statuses, which limits the validation results written to the above file. If the vec contains Correct and Incorrect, then only the validation results with those statuses will be written to the file and Missing and Unexpected validation results will not.

The command outputs a tabled summary of the validation results. This table will look like this:

+----------------+---------+-----------+---------+------------+------------+
| String         | correct | incorrect | missing | unexpected | accessible |
+----------------+---------+-----------+---------+------------+------------+
| ap-southeast-2 | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| eu-west-1      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| eu-north-1     | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| us-east-1      | 2232    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| us-west-2      | 2232    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ca-central-1   | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| us-east-2      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| eu-west-3      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ap-south-1     | 2222    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ap-northeast-3 | 1820    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ap-southeast-1 | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| eu-west-2      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| eu-central-1   | 2232    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ap-northeast-2 | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| sa-east-1      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| us-west-1      | 2212    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+
| ap-northeast-1 | 2232    | 40        | 0       | 200        | true       |
+----------------+---------+-----------+---------+------------+------------+

The meaning of the different columns is this:

  • correct: the expected value of the parameter is equal to the retrieved value
  • incorrect: the expected value of the parameter is different from the retrieved value
  • missing: the parameter was expected in that region but not retrieved
  • unexpected: the retrieved parameter was not expected in that region
  • accessible: SSM parameters were successfully retrieved from that region. If an invalid region was given, this would say false and all other columns in that row would show -1

Testing done:

  • Unit tests
  • Manually retrieved public SSM parameters and used them as input for the subcommand. Before the 1.13.0 release, all parameters showed as correct. After the 1.13.0 and 1.13.1 releases, each monitored region has 40 incorrect parameters (which is latest, because I didn't update my local copies of the parameters) and 200 unexpected parameters (which are the 1.13.0 and 1.13.1 parameters).

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@mjsterckx
Copy link
Contributor Author

^ clippy fixes

@mjsterckx mjsterckx requested a review from cbgbt March 30, 2023 19:07
@mjsterckx mjsterckx marked this pull request as ready for review March 30, 2023 21:13
tools/pubsys/src/aws/validate_ssm/mod.rs Outdated Show resolved Hide resolved
#[structopt(long, parse(from_os_str))]
validation_config_path: PathBuf,

// Optional path where the validation results should be written
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually important so that structopt will add this information to the help messages.

Suggested change
// Optional path where the validation results should be written
/// Optional path where the validation results should be written

tools/pubsys/src/aws/validate_ssm/mod.rs Outdated Show resolved Hide resolved
validation_regions: Vec<String>,
}

/// Structure of an SSM parameter value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have to do something strange, it's a good idea to explain in detail.

Suggested change
/// Structure of an SSM parameter value
/// A structure that allows us to store a parameter value along with the AMI ID it refers to. In
/// some cases, then AMI ID *is* the parameter value and both fields will hold the AMI ID. In other
/// cases the parameter value is not the AMI ID, but we need to remember which AMI ID it refers to.

tools/pubsys/src/aws/validate_ssm/mod.rs Show resolved Hide resolved
tools/pubsys/src/aws/validate_ssm/results.rs Show resolved Hide resolved
tools/pubsys/src/aws/validate_ssm/results.rs Outdated Show resolved Hide resolved
tools/pubsys/src/aws/validate_ssm/results.rs Outdated Show resolved Hide resolved
tools/pubsys/src/aws/validate_ssm/results.rs Show resolved Hide resolved
tools/pubsys/src/aws/ssm/ssm.rs Show resolved Hide resolved
/// Structure of the validation configuration file
#[derive(Debug, Deserialize)]
pub(crate) struct ValidationConfig {
// Vec of paths to JSON files containing expected parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as important as the structopt comments @webern pointed out, but it's good form to use docstring comments on attributes as well, since they would be added to any generated rustdocs.

Comment on lines 181 to 187
for parameter in page
.context(error::GetParametersByPathSnafu {
path: ssm_prefix,
region: region.to_string(),
})?
.parameters()
.unwrap_or_default()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think this can be easier to read.

Suggested change
for parameter in page
.context(error::GetParametersByPathSnafu {
path: ssm_prefix,
region: region.to_string(),
})?
.parameters()
.unwrap_or_default()
let retrieved_parameters = page.context(error::GetParametersByPathSnafu {
path: ssm_prefix,region: region.to_string(),
})?
.parameters()
.unwrap_or_default();
for parameter in retrieved_parameters {}

validation_regions: &[String],
) -> Result<HashMap<Region, HashMap<SsmKey, SsmValue>>> {
let mut parameter_map: HashMap<Region, HashMap<SsmKey, SsmValue>> = HashMap::new();
for list in parameter_lists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I would prefer to give it to a more expressive variable name.

Suggested change
for list in parameter_lists {
for parameter_list in parameter_lists {

);
}
}
info!("SSM parameters in {} retrieved", region.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info!("SSM parameters in {} retrieved", region.to_string());
info!("SSM parameters in {} have been retrieved", region.to_string());

@@ -41,6 +41,7 @@ serde_json = "1"
simplelog = "0.12"
Copy link
Member

@gthao313 gthao313 Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and tests look good to me! awesome!!

I think it might be better to change the commit message to

pubsys: added SSM parameter validation

"a short paragraph to describe what changes you have and those
changes will help to build canary something like that"

@mjsterckx mjsterckx changed the title canary: added SSM parameter validation pubsys: added SSM parameter validation Apr 6, 2023
@mjsterckx
Copy link
Contributor Author

^ Addressed all comments and suggested changes

Copy link
Member

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the description to tag a couple of related issues.

I left one new comment, to use path.display().

#[snafu(display("Error reading config: {}", source))]
Config { source: pubsys_config::Error },

#[snafu(display("Error reading validation config at path {:?}: {}", path, source))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths are weird because they aren't guaranteed to be UTF-8 encoded, here's how you do it:

Suggested change
#[snafu(display("Error reading validation config at path {:?}: {}", path, source))]
#[snafu(display("Error reading validation config at path {}: {}", path.display(), source))]

#[snafu(display("Infra.toml is missing {}", missing))]
MissingConfig { missing: String },

#[snafu(display("Failed to validate SSM parameters: {}", missing))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK will retry retryable failures twice by default. I don't think any additional retry policies are necessary on pubsys' end.

Ok, I wonder if we want to increase that number of retries. Any guesses @cbgbt or @etungsten as to how many times we should retry on SSM calls?

@cbgbt
Copy link
Contributor

cbgbt commented Apr 6, 2023

@webern @mjsterckx

Any guesses as to how many times we should retry on SSM calls?

I usually set up retries using random jitter and exponential backoff (to avoid a thundering herd) for some reasonable amount of time based on the use-case, rather than some static retry count. Maybe 30 seconds is within reason here?

We don't have great discipline with this in pubsys today. I think the functions to set parameters do their own retries, but most GETs rely on the SDK retry logic.

@mjsterckx
Copy link
Contributor Author

I usually set up retries using random jitter and exponential backoff (to avoid a thundering herd) for some reasonable amount of time based on the use-case, rather than some static retry count. Maybe 30 seconds is within reason here?

It seems like that logic should be added to all get_parameters functions and not just the ones added in this PR. Would that be acceptable to move to a later PR?

Added a validate-ssm command to validate SSM parameters, given
a JSON config file with regions and paths to files containing
the expected parameters.
@mjsterckx mjsterckx merged commit b59757b into bottlerocket-os:develop Apr 6, 2023
@mjsterckx mjsterckx deleted the canary-ssm-validation branch April 6, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants