Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_applicationautoscaling: Error: Only direct metrics are supported for Target Tracking. Use Step Scaling or supply a Metric object. #20659

Open
mostafafarzaneh opened this issue Jun 8, 2022 · 13 comments
Labels
@aws-cdk/aws-applicationautoscaling Related to AWS Application Auto Scaling bug This issue is a bug. needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. p2

Comments

@mostafafarzaneh
Copy link

mostafafarzaneh commented Jun 8, 2022

Describe the bug

I would like to use a MathExpression for custom metric in TargetTrackingScalingPolicy, but I got this error:

Only direct metrics are supported for Target Tracking. Use Step Scaling or supply a Metric object.

checking the code here, it only checks for metricStat not mathExpression.

Expected Behavior

Should allow to define math expression for Target Tracking.

Current Behavior

Only direct metrics are allowed

Reproduction Steps

Create Target Tracking using math expression

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.27.0

Framework Version

No response

Node.js Version

16.15.0

OS

Debian 10

Language

Python

Language Version

No response

Other information

No response

@mostafafarzaneh mostafafarzaneh added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 8, 2022
@github-actions github-actions bot added the @aws-cdk/aws-applicationautoscaling Related to AWS Application Auto Scaling label Jun 8, 2022
@mostafafarzaneh
Copy link
Author

I also tried to create a metric this way:

   custom_metric = cloudwatch.MathExpression(
      expression='SELECT AVG(ActiveConnections) FROM "myMetrics/custom"',
      period=Duration.minutes(1),
   )

and use it in StepScalingPolicy. CDK complainse:

Alarm contains invalid expressions. (Service: AmazonCloudWatch; Status Code: 400; Error Code: ValidationError; Request ID: 3c245f6f-9d5e-492e-b2e1-e0fa83422594; Proxy: null)

@peterwoodworth
Copy link
Contributor

These properties are directly passed to the ScalingPolicy CloudFormation resource in this property.

Our Metric class supports these properties, while our MathExpression class doesn't. I think we would need additional functionality from cloudformation for this to be implemented

@peterwoodworth peterwoodworth added p2 needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 9, 2022
@comcalvi comcalvi removed their assignment Jun 20, 2022
@shw1n
Copy link

shw1n commented Feb 3, 2023

Also encountering this issue

@matthias-pichler-warrify
Copy link
Contributor

Our Metric class supports these properties, while our MathExpression class doesn't. I think we would need additional functionality from cloudformation for this to be implemented

It seems like CloudFormation's AWS::AutoScaling::ScalingPolicy is indeed lacking some configuration parameters. The CustomizedMetricSpecification type from the AutoScaling API has a member Metrics where expressions can be specified like seen in the docs. On the other hand AWS::AutoScaling::ScalingPolicy CustomizedMetricSpecification does NOT have the Metrics property.

@hectorsouthern
Copy link

It looks like AWS announced support for this recently: https://www.amazonaws.cn/en/new/2023/application-auto-scaling-supports-metric-math-for-target-tracking-policies/

and the documentation is now available: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking-metric-math.html

It would be great to see this feature in CDK too

@SamStephens
Copy link
Contributor

It looks like AWS announced support for this recently: https://www.amazonaws.cn/en/new/2023/application-auto-scaling-supports-metric-math-for-target-tracking-policies/

and the documentation is now available: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking-metric-math.html

It would be great to see this feature in CDK too

As per the second page you link to, "This feature is not yet available in AWS CloudFormation.". So CDK either has to wait for Cloudformation support, or provide this via a custom resource.

@zubairzahoor
Copy link

Any updates on this? Or any workarounds via CDK as of now?

@alexbaileyuk
Copy link

alexbaileyuk commented Aug 2, 2023

@zubairzahoor

I came across this today whilst folloing the and lost a few hours on it. As a temporary workaround, I've added a custom resource using AwsCustomResource. Not ideal but could be an option for you in the mean time.

import { AwsCustomResource, AwsCustomResourcePolicy, PhysicalResourceId } from 'aws-cdk-lib/custom-resources';
import { Construct } from 'constructs';
import { Metric } from 'aws-cdk-lib/aws-cloudwatch';
import { IQueue } from 'aws-cdk-lib/aws-sqs';
import { Effect, PolicyStatement } from 'aws-cdk-lib/aws-iam';
import { buildVersion } from '../../../utils/build-version';

interface EcsSqsMathExpressionAutoScalingPolicyProps {
  targetValue: number;
  resourceId: string; // format service/{cluster_name}/{service_name}
  queue: IQueue;
  taskCountMetric: Metric;
}

export class EcsSqsMathExpressionAutoScalingPolicy extends Construct {
  constructor(scope: Construct, id: string, props: EcsSqsMathExpressionAutoScalingPolicyProps) {
    super(scope, id);

    new AwsCustomResource(this, 'scaling-put-autoscaling-policy', {
      onUpdate: {
        physicalResourceId: PhysicalResourceId.of(`sqs-backlog-scaling-policy/${props.resourceId}`),
        service: 'ApplicationAutoScaling',
        action: 'putScalingPolicy',
        parameters: {
          PolicyName: `sqs-backlog-scaling-policy-${props.resourceId}-${buildVersion}`,
          PolicyType: 'TargetTrackingScaling',
          ResourceId: props.resourceId,
          ScalableDimension: 'ecs:service:DesiredCount',
          ServiceNamespace: 'ecs',
          TargetTrackingScalingPolicyConfiguration: {
            TargetValue: props.targetValue,
            CustomizedMetricSpecification: {
              Metrics: [
                {
                  Id: 'm1',
                  Label: 'Appox. # of Messages Visible',
                  ReturnData: false,
                  MetricStat: {
                    Stat: 'Sum',
                    Metric: {
                      MetricName: props.queue.metricApproximateNumberOfMessagesVisible().metricName,
                      Namespace: props.queue.metricApproximateNumberOfMessagesVisible().namespace,
                      Dimensions: [
                        {
                          Name: 'QueueName',
                          Value: props.queue.queueName
                        }
                      ]
                    }
                  }
                },
                {
                  Id: 'm2',
                  Label: 'Running Instances Count',
                  ReturnData: false,
                  MetricStat: {
                    Stat: 'Average',
                    Metric: {
                      MetricName: props.taskCountMetric.metricName,
                      Namespace: props.taskCountMetric.namespace,
                      Dimensions: Object.entries(props.taskCountMetric.dimensions || {}).map(([key, value]) => ({
                        Name: key,
                        Value: value
                      }))
                    }
                  }
                },
                {
                  Label: 'Backlog per Instance',
                  Id: 'e1',
                  Expression: 'm1 / m2',
                  ReturnData: true
                }
              ]
            }
          }
        }
      },
      policy: AwsCustomResourcePolicy.fromStatements([
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: ['application-autoscaling:*', 'ecs:DescribeServices', 'ecs:UpdateService'],
          resources: ['*']
        })
      ])
    });
  }
}

You can use it like this:

    this.scaling = this.fargateService.autoScaleTaskCount({
      minCapacity: 0,
      maxCapacity: 100
    });

    const customScalingPolicy = new EcsSqsMathExpressionAutoScalingPolicy(this, 'scaling-policy', {
      targetValue: props.acceptableLatency.toSeconds() / props.averageMessageProcessingTime.toSeconds(),
      resourceId: `service/${props.cluster.clusterName}/${this.fargateService.serviceName}`,
      queue: queue,
      taskCountMetric: desiredCountMetric
    });

    customScalingPolicy.node.addDependency(this.scaling);

It may need some adaptations to meet your needs but it should give you a good starting point.

I should mention I've not fully tested this yet so if you notice anything weird then please share :)

@zubairzahoor
Copy link

@alexbaileyuk Thank you!
Tried this for my use-case (with AmazonMq/ECS) and seems to work. What are the minimum permissions needed for execution role of the lambda here?

@alexbaileyuk
Copy link

@zubairzahoor due to difficulties with this method I ended up writing a totally different function which pre-calculates backlog / instance by pulling and calculating. Something like this:

import { DescribeServicesCommand, ECSClient, paginateListServices } from '@aws-sdk/client-ecs';
import { CloudWatchClient, MetricDatum, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
import { SQSClient, GetQueueAttributesCommand } from '@aws-sdk/client-sqs';

const ecsClient = new ECSClient({
  region: 'eu-west-1'
});

const cloudwatchClient = new CloudWatchClient({
  region: 'eu-west-1'
});

const sqsClient = new SQSClient({
  region: 'eu-west-1'
});

export const putInstanceBacklogMetrics = async (clusterName: string) => {
  const consumerServices = await listServices(clusterName);

  const backlogMetrics = await Promise.all(consumerServices.map((serviceArn) => calculateBacklogForConsumerService(clusterName, serviceArn)));

  const metrics = backlogMetrics.map((backlog) => {
    console.log(`Service ${backlog.serviceName} has desired count ${backlog.desiredCount} and queue length ${backlog.queueLength}`);

    let instanceBacklog = null;

    if (backlog.desiredCount === 0 && backlog.queueLength > 0) {
      // If there are no instances running we have to pretend the backlog is the acceptable backlog per instance + 1
      // so that we scale up to one instance. This allows us to scale down to zero instances when there is no backlog.
      // This will cause some jitter in the instance backlog metric, but it allows us to scale to zero. In test environments
      // it'll be fine, in production we'll have enough traffic that the jitter will be negligible and instances will usually be
      // scaled up to at least one.
      instanceBacklog = backlog.queueLength > backlog.acceptableBacklogPerInstance ? backlog.queueLength : backlog.acceptableBacklogPerInstance + 1;
    } else if (backlog.queueLength === 0) {
      instanceBacklog = 0;
    } else if (backlog.desiredCount > 0) {
      instanceBacklog = backlog.queueLength / backlog.desiredCount;
    } else {
      instanceBacklog = 0;
    }

    return {
      MetricName: 'ConsumerInstanceBacklog',
      Dimensions: [
        {
          Name: 'ClusterName',
          Value: clusterName
        },
        {
          Name: 'ServiceName',
          Value: backlog.serviceName
        }
      ],
      Value: instanceBacklog
    };
  });

  if (metrics.length === 0) {
    console.log('No consumer services found');
    return;
  }

  await putConsumerInstanceBacklogMetric(metrics);
};

const listServices = async (clusterName: string) => {
  const paginator = paginateListServices({ client: ecsClient }, { cluster: clusterName });

  const serviceArns: string[] = [];

  for await (const page of paginator) {
    for (const serviceArn of page.serviceArns ?? []) {
      if (await isConsumerService(clusterName, serviceArn)) {
        serviceArns.push(serviceArn);
      }
    }
  }

  console.log(`Found ${serviceArns.length} consumer services in cluster ${clusterName}`);

  return serviceArns;
};

const isConsumerService = async (clusterName: string, serviceArn: string) => {
  const serviceDetails = await ecsClient.send(
    new DescribeServicesCommand({
      cluster: clusterName,
      services: [serviceArn],
      include: ['TAGS']
    })
  );

  return (
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'QueueUrl') !== undefined &&
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'AcceptableBacklogPerInstance') !== undefined
  );
};

const calculateBacklogForConsumerService = async (clusterName: string, serviceArn: string) => {
  const serviceDetails = await ecsClient.send(
    new DescribeServicesCommand({
      cluster: clusterName,
      services: [serviceArn],
      include: ['TAGS']
    })
  );

  const desiredCount = serviceDetails.services?.[0].desiredCount || 0;
  const queueName = serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'QueueUrl')?.value || '';

  const queueLength = await getQueueLength(queueName);

  const acceptableBacklogPerInstance = parseInt(
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'AcceptableBacklogPerInstance')?.value || '0'
  );

  return {
    serviceName: serviceDetails.services?.[0].serviceName || 'UNKNOWN',
    desiredCount: desiredCount,
    queueLength: queueLength,
    acceptableBacklogPerInstance: acceptableBacklogPerInstance
  };
};

const getQueueLength = async (queueUrl: string) => {
  const queueDetails = await sqsClient.send(
    new GetQueueAttributesCommand({
      QueueUrl: queueUrl,
      AttributeNames: ['ApproximateNumberOfMessages']
    })
  );

  return parseInt(queueDetails.Attributes?.ApproximateNumberOfMessages || '0');
};

const putConsumerInstanceBacklogMetric = async (metrics: MetricDatum[]) => {
  await cloudwatchClient.send(
    new PutMetricDataCommand({
      Namespace: 'ECS/CustomServiceMetrics',
      MetricData: metrics
    })
  );
};

It also relies on some tags on the ECS services. It's a bit messy and not well refined at the moment since I'm still testing and working on edge cases like the scale to zero ones. It loops through all services in a cluster and based on their tags and metrics, defines a new metric called ConsumerInstanceBacklog to do target tracking against.

I'd advise doing something similar. The main issues came on stack updates. You can't create the scaling policy without defining a name and when defining a name I ended up with tons of issues trying to update/replace/rollback etc. I'd recommend not using the above method for those reasons.

@zubairzahoor
Copy link

@alexbaileyuk I am more comfortable using the above, works well for me. Were there any issues you encounted with scaling-in using the custom resource?

@alexbaileyuk
Copy link

@zubairzahoor we're going to production later in the week with a more refined version of the code. We've not found any major issues so far.

@bmeudre
Copy link

bmeudre commented Jan 22, 2024

Very interesting discussion. I managed to fix it with CDK-only syntax. I hope it helps 😉

const resourceId = `endpoint/${this.endpointName}/variant/${variant.name}`;

// To define min/max values
const target = new ScalableTarget(this, 'ScalableTarget', {
  serviceNamespace: ServiceNamespace.SAGEMAKER,
  minCapacity: variant.autoScale.minCapacity,
  maxCapacity: variant.autoScale.maxCapacity,
  scalableDimension: 'sagemaker:variant:DesiredInstanceCount',
  resourceId,
});

// We need the endpoint before creating the autoscaling policy
target.node.addDependency(endpoint);

const scalingPolicy = new CfnScalingPolicy(this, 'ScalingPolicy', {
  policyName: resourceId,
  scalingTargetId: target.scalableTargetId,
  policyType: 'TargetTrackingScaling',
  targetTrackingScalingPolicyConfiguration: {
    targetValue: variant.autoScale.targetProcessingTime,
  },
});

// CDK doesn't support math expression in target tracking, adding it in cloudformation manually
scalingPolicy.addPropertyOverride(
  'TargetTrackingScalingPolicyConfiguration.CustomizedMetricSpecification',
  {
    Metrics: [
      {
        Id: 'm1',
        ReturnData: false,
        MetricStat: {
          Stat: 'Average',
          Metric: {
            MetricName: 'TotalProcessingTime',
            Namespace: 'AWS/SageMaker',
            Dimensions: [
              {
                Name: 'EndpointName',
                Value: this.endpointName,
              },
              {
                Name: 'VariantName',
                Value: variant.name,
              },
            ],
          },
        },
      },
      {
        Id: 'm2',
        ReturnData: true,
        Expression: 'FILL(m1, 0)',
      },
    ],
  }
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-applicationautoscaling Related to AWS Application Auto Scaling bug This issue is a bug. needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. p2
Projects
None yet
Development

No branches or pull requests

10 participants