-
Notifications
You must be signed in to change notification settings - Fork 47
Calculate metrics of credential fetching from Pods & upload to s3 #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,10 +84,22 @@ spec: | |
| default: "200" | ||
| - name: cl2-uniform-qps | ||
| default: "100" | ||
| - name: cl2-metric-dimension-name | ||
| description: "default metric dimension name" | ||
| default: "ClusterName" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have the same defaults at task level, we don't have to carry this to pipeline. Pipeline would take task level defaults when not supplied. Same for other params as well.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but when we make a runpipeline, that's on pipeline level? So we don't need to make another code change here to change the name
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but run pipeline will take task level defaults when not supplied. |
||
| - name: cl2-metric-namespace | ||
| description: "default metric namespace for pod identity" | ||
| default: "EKSPodIdentityScalabilityTests" | ||
| - name: cl2-metric-latency-name | ||
| description: "default metric latency name for pod identity" | ||
| default: "CredentialFetchLatency" | ||
| - name: cl2-metric-period | ||
| description: "default metric period" | ||
| default: "300" | ||
| - name: timeout-pia-pod-creation | ||
| default: "20m" | ||
| default: "80s" | ||
| - name: timeout-pia-pod-startup | ||
| default: "5m" | ||
| default: "60s" | ||
| - name: launch-template-ami | ||
| default: "" | ||
| description: "Launch template ImageId value, which may be an AMI ID or resolve:ssm reference. By default resolve to the lates AL2023 ami for cluster version" | ||
|
|
@@ -189,7 +201,7 @@ spec: | |
| value: "$(params.kubernetes-version)" | ||
| - name: endpoint | ||
| value: $(params.endpoint) | ||
| - name: node-role-name | ||
| - name: node-role-name | ||
| value: $(params.cluster-name)-node-role | ||
| - name: ami | ||
| value: $(params.launch-template-ami) | ||
|
|
@@ -279,12 +291,22 @@ spec: | |
| value: $(params.cl2-default-burst) | ||
| - name: cl2-uniform-qps | ||
| value: $(params.cl2-uniform-qps) | ||
| - name: cl2-metric-dimension-name | ||
| value: $(params.cl2-metric-dimension-name) | ||
| - name: cl2-metric-namespace | ||
| value: $(params.cl2-metric-namespace) | ||
| - name: cl2-metric-latency-name | ||
| value: $(params.cl2-metric-latency-name) | ||
| - name: cl2-metric-period | ||
| value: $(params.cl2-metric-period) | ||
| - name: results-bucket | ||
| value: $(params.results-bucket) | ||
| - name: nodes | ||
| value: $(params.desired-nodes) | ||
| - name: cluster-name | ||
| value: $(params.cluster-name) | ||
| - name: endpoint | ||
| value: $(params.endpoint) | ||
| - name: namespace-prefix | ||
| value: $(params.namespace-prefix) | ||
| - name: namespace-count | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -25,6 +25,18 @@ spec: | |||||||||||||||||
| - name: cl2-uniform-qps | ||||||||||||||||||
| description: "uniform qps" | ||||||||||||||||||
| default: "100" | ||||||||||||||||||
| - name: cl2-metric-dimension-name | ||||||||||||||||||
| description: "default metric dimension name" | ||||||||||||||||||
| default: "ClusterName" | ||||||||||||||||||
| - name: cl2-metric-namespace | ||||||||||||||||||
| description: "default metric namespace for pod identity" | ||||||||||||||||||
| default: "EKSPodIdentityScalabilityTests" | ||||||||||||||||||
| - name: cl2-metric-latency-name | ||||||||||||||||||
| description: "default metric latency name for pod identity" | ||||||||||||||||||
| default: "CredentialFetchLatency" | ||||||||||||||||||
| - name: cl2-metric-period | ||||||||||||||||||
| description: "default metric period" | ||||||||||||||||||
| default: "300" | ||||||||||||||||||
| - name: nodes | ||||||||||||||||||
| description: "number of dataplane nodes to run the load test against" | ||||||||||||||||||
| default: "1000" | ||||||||||||||||||
|
|
@@ -35,6 +47,8 @@ spec: | |||||||||||||||||
| description: The region where the cluster is in. | ||||||||||||||||||
| - name: cluster-name | ||||||||||||||||||
| description: "The name of the EKS cluster you want to spin" | ||||||||||||||||||
| - name: endpoint | ||||||||||||||||||
| default: "" | ||||||||||||||||||
| - name: namespace-prefix | ||||||||||||||||||
| default: "default" | ||||||||||||||||||
| description: "The prefix of namespaces for EKS Pod Identity test." | ||||||||||||||||||
|
|
@@ -89,6 +103,11 @@ spec: | |||||||||||||||||
| CL2_DEFAULT_QPS: $(params.cl2-default-qps) | ||||||||||||||||||
| CL2_DEFAULT_BURST: $(params.cl2-default-burst) | ||||||||||||||||||
| CL2_UNIFORM_QPS: $(params.cl2-uniform-qps) | ||||||||||||||||||
| CL2_CLUSTER_NAME: $(params.cluster-name) | ||||||||||||||||||
| CL2_METRIC_DIMENSION_NAME: $(params.cl2-metric-dimension-name) | ||||||||||||||||||
| CL2_METRIC_NAMESPACE: $(params.cl2-metric-namespace) | ||||||||||||||||||
| CL2_METRIC_LATENCY_NAME: $(params.cl2-metric-latency-name) | ||||||||||||||||||
| CL2_METRIC_PERIOD: $(params.cl2-metric-period) | ||||||||||||||||||
| CL2_NAMESPACE_PREFIX: $(params.namespace-prefix) | ||||||||||||||||||
| CL2_NAMESPACE_COUNT: $(params.namespace-count) | ||||||||||||||||||
| CL2_TIMEOUT_EKS_POD_IDENTITY_POD_CREATION: $(params.timeout-pia-pod-creation) | ||||||||||||||||||
|
|
@@ -172,9 +191,90 @@ spec: | |||||||||||||||||
| image: amazon/aws-cli | ||||||||||||||||||
| workingDir: $(workspaces.results.path) | ||||||||||||||||||
| script: | | ||||||||||||||||||
| yum install -y jq | ||||||||||||||||||
|
|
||||||||||||||||||
| S3_RESULT_PATH=$(cat $(results.s3_result.path)) | ||||||||||||||||||
| echo "S3 Path: $S3_RESULT_PATH" | ||||||||||||||||||
| aws sts get-caller-identity | ||||||||||||||||||
|
|
||||||||||||||||||
| REGION=$(params.region) | ||||||||||||||||||
| ENDPOINT_FLAG="" | ||||||||||||||||||
| if [ -n "$(params.endpoint)" ]; then | ||||||||||||||||||
| ENDPOINT_FLAG="--endpoint $(params.endpoint)" | ||||||||||||||||||
| fi | ||||||||||||||||||
|
|
||||||||||||||||||
| CLUSTER_NAME=$(params.cluster-name) | ||||||||||||||||||
| NAMESPACE=$(params.cl2-metric-namespace) | ||||||||||||||||||
| DIMENSION_NAME=$(params.cl2-metric-dimension-name) | ||||||||||||||||||
| DIMENSION_VALUE=$CLUSTER_NAME | ||||||||||||||||||
| METRIC_LATENCY_NAME=$(params.cl2-metric-latency-name) | ||||||||||||||||||
| PERIOD=$(params.cl2-metric-period) | ||||||||||||||||||
|
|
||||||||||||||||||
| # since the scalability test is running with the same cluster name, with cluster recreation | ||||||||||||||||||
| # it is important to know the range of start and end time to query metrics for the current run | ||||||||||||||||||
| # here we use cluster creation time start as start time and the current time as end time | ||||||||||||||||||
| START_TIME=$(aws eks $ENDPOINT_FLAG --region $REGION describe-cluster \ | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you add comments on why you consider this as start time ? Also, please add comments overall to make it more readable for future users/consumers on your team, especially wherever you are making assumptions. |
||||||||||||||||||
| --name "$CLUSTER_NAME" \ | ||||||||||||||||||
| --query "cluster.createdAt" \ | ||||||||||||||||||
| --output text) | ||||||||||||||||||
|
|
||||||||||||||||||
| END_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ") | ||||||||||||||||||
|
|
||||||||||||||||||
| response=$(aws cloudwatch get-metric-statistics \ | ||||||||||||||||||
| --region "$REGION" \ | ||||||||||||||||||
| --namespace "$NAMESPACE" \ | ||||||||||||||||||
| --metric-name "$METRIC_LATENCY_NAME" \ | ||||||||||||||||||
| --dimensions Name="$DIMENSION_NAME",Value="$DIMENSION_VALUE" \ | ||||||||||||||||||
| --start-time "$START_TIME" \ | ||||||||||||||||||
| --end-time "$END_TIME" \ | ||||||||||||||||||
| --period "$PERIOD" \ | ||||||||||||||||||
| --extended-statistics p50 p99 p99.95 \ | ||||||||||||||||||
| --output json) | ||||||||||||||||||
|
|
||||||||||||||||||
| # extract p50 p99 p99.95 of credential fetching | ||||||||||||||||||
| latest=$(echo "$response" | jq -r '.Datapoints | sort_by(.Timestamp) | last') | ||||||||||||||||||
| p50=$(echo "$latest" | jq -r '."ExtendedStatistics"."p50" // "N/A"') | ||||||||||||||||||
| p99=$(echo "$latest" | jq -r '."ExtendedStatistics"."p99" // "N/A"') | ||||||||||||||||||
| p9995=$(echo "$latest" | jq -r '."ExtendedStatistics"."p99.95" // "N/A"') | ||||||||||||||||||
|
|
||||||||||||||||||
| response=$(aws cloudwatch get-metric-statistics \ | ||||||||||||||||||
| --region "$REGION" \ | ||||||||||||||||||
| --namespace "$NAMESPACE" \ | ||||||||||||||||||
| --metric-name "$METRIC_LATENCY_NAME" \ | ||||||||||||||||||
| --dimensions Name="$DIMENSION_NAME",Value="$DIMENSION_VALUE" \ | ||||||||||||||||||
| --start-time "$START_TIME" \ | ||||||||||||||||||
| --end-time "$END_TIME" \ | ||||||||||||||||||
| --period "$PERIOD" \ | ||||||||||||||||||
| --statistics SampleCount \ | ||||||||||||||||||
| --output json) | ||||||||||||||||||
|
|
||||||||||||||||||
| total_samples=$(echo "$response" | jq '[.Datapoints[].SampleCount] | add // 0') | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to capture what is the Currently you have some kind of measuring by able to control the This will only give some kind of approximation but not fully accurate throughput of your service using this kind of measurement. |
||||||||||||||||||
| rate=$(params.cl2-default-qps) | ||||||||||||||||||
|
|
||||||||||||||||||
| # save metric results for s3 upload | ||||||||||||||||||
| cat <<EOF > eks_pod_identity_test_summary.json | ||||||||||||||||||
| { | ||||||||||||||||||
| "start_time": "$START_TIME", | ||||||||||||||||||
| "end_time": "$END_TIME", | ||||||||||||||||||
| "total_samples": $total_samples, | ||||||||||||||||||
| "rate": $rate, | ||||||||||||||||||
| "p50": $p50, | ||||||||||||||||||
| "p99": $p99, | ||||||||||||||||||
| "p99.95": $p9995 | ||||||||||||||||||
| } | ||||||||||||||||||
| EOF | ||||||||||||||||||
|
|
||||||||||||||||||
| # we expect to see all files from loadtest that clusterloader2 outputs here in this dir | ||||||||||||||||||
| ls -larth | ||||||||||||||||||
| aws s3 cp . s3://$S3_RESULT_PATH/ --recursive | ||||||||||||||||||
|
|
||||||||||||||||||
| # if p99.95 is equal to or more than 1 second, exit with failure | ||||||||||||||||||
| int_p9995=$(echo "$p9995" | awk '{printf "%d", $1}') | ||||||||||||||||||
| if [ "$int_p9995" -lt 1 ]; then | ||||||||||||||||||
| echo "p99.95 is less than 1 second" | ||||||||||||||||||
| echo "1" | tee $(results.datapoint.path) | ||||||||||||||||||
| else | ||||||||||||||||||
| echo "p99.95 is 1 second or more" | ||||||||||||||||||
| echo "0" | tee $(results.datapoint.path) | ||||||||||||||||||
| exit 1 | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't want to exit and fail the test, we need to capture the test result like this kubernetes-iteration-toolkit/tests/tekton-resources/tasks/generators/clusterloader/load-slos.yaml Lines 171 to 177 in 3f5523d
and use that to emit result like how we do it here - Line 360 in 3f5523d
So it can used configure alarm and cut tickets to your team. |
||||||||||||||||||
| fi | ||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its better to verify the role being used to make sure we are not using the instance role.
aws sts get-caller-identity | grep <role_name>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about sts quota we consume running in this account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand about the limits but i think we need a way to make sure that the command is successful because of token by pod identity and not using the instance role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on how we can make sure this test is working as intended without checking on the assumed role identity