Skip to content

CloudWatchMetricPublisher fails to publish high QPS detailed metrics #3080

@Kurru

Description

@Kurru

Describe the bug

When publishing a detailed metric into CloudWatch using CloudWatchMetricPublisher, cloudWatchClient.putMetricData API throws a request validation exception:

WARN cloudwatch:117 - Failed while publishing some or all AWS SDK client-side metrics to CloudWatch.

java.util.concurrent.CompletionException: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.13.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: a84e1dcb-bfba-495b-97f5-60d7316c1800)

I believe this is due to there being more than 150 data points for a single metric, though I'm not sure as this seems to be internal to AWS servers.

This limitation seems to be documented here:

For example, a single PutMetricData call can include 20 metrics and 150 data points.

Expected behavior

Client library should construct the request object while taking into account this limitation either by sending the additional data points into later requests or by dropping the additional data points.

Current behavior

cloudWatchClient.putMetricData fails and a logging message is recorded.

All metric values for this high QPS service are lost.

WARN cloudwatch:117 - Failed while publishing some or all AWS SDK client-side metrics to CloudWatch.

java.util.concurrent.CompletionException: software.amazon.awssdk.services.cloudwatch.model.InvalidParameterValueException: The collection MetricData.member.13.Values must not have a size greater than 150. (Service: CloudWatch, Status Code: 400, Request ID: a84e1dcb-bfba-495b-97f5-60d7316c1800)

Steps to Reproduce

I haven't validated but I believe this should cause the error to be triggered.

var publisher = CloudWatchMetricPublisher.builder()
    .namespace("Test")
    .cloudWatchClient(asyncClient)
    .detailedMetrics(CoreMetric.API_CALL_DURATION)
    .dimensions(CoreMetric.OPERATION_NAME)
    .build();

for (int i=0;i<1000;i++) {
    MetricCollector methodCollector = MetricCollector.create("RPC");
    methodCollector.reportMetric(CoreMetric.API_CALL_DURATION, i);
    methodCollector.reportMetric(CoreMetric.OPERATION_NAME, "YourRPC");
    metricPublisher.publish(methodCollector.collect());
}

// Wait for the periodic metric flush
Thread.sleep(120*1000);

Possible Solution

software/amazon/awssdk/metrics/publishers/cloudwatch/internal/transform/MetricCollectionAggregator.java:128 could be updated from:

MetricDatum data = detailedMetricDatum(timeBucket, detailedAggregator, startIndex, MAX_VALUES_PER_REQUEST - valuesInRequestCounter.get());

To something like:

MetricDatum data = detailedMetricDatum(timeBucket, detailedAggregator, startIndex, Math.min(MAX_VALUES_PER_REQUEST - valuesInRequestCounter.get(), 150));

Context

Trying to implement metrics to monitor the performance of my API server using custom CloudWatch metrics.

AWS Java SDK version used

17

JDK version used

Whatever is in docker image openjdk:17-oracle

Operating System and version

Docker image: openjdk:17-oracle linux-oracle?

Metadata

Metadata

Assignees

Labels

bugThis issue is a bug.p2This is a standard priority issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions