Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Error reconciling customizedMetricSpecification HostingAutoscalingPolicy #185

Closed
paytmci opened this issue Nov 10, 2021 · 6 comments · Fixed by #188
Closed

Error reconciling customizedMetricSpecification HostingAutoscalingPolicy #185

paytmci opened this issue Nov 10, 2021 · 6 comments · Fixed by #188
Labels
bug Something isn't working

Comments

@paytmci
Copy link

paytmci commented Nov 10, 2021

What happened: Operator fails to reconcile HAP while using PolicyName other than the "default" - IE; customizedMetricSpecification

What you expected to happen: Operator reconcile the HAP by creating it without error in Sagemaker Endpoint.

How to reproduce it (as minimally and precisely as possible):

  1. Create an HostingEndpoint :
apiVersion: sagemaker.aws.amazon.com/v1
kind: HostingDeployment
metadata:
  name: my-model-endpoint
spec:
  region: ap-south-1
  endpointName: my-endpoint
  productionVariants:
    - variantName: AllTraffic
      modelName: my-model
      initialInstanceCount: 1
      instanceType: ml.c5.xlarge
      initialVariantWeight: 1
  models:
    - name: my-model
      executionRoleArn: arn:aws:iam::XXXXXXXX:role/sagemaker-role
      containers:
        - containerHostname: my-model
          modelDataUrl: s3://XXXX 
          image: XXXX # Removed
  1. Create a HostingAutoscalingPolicy using customizedMetricSpecification
apiVersion: sagemaker.aws.amazon.com/v1
kind: HostingAutoscalingPolicy
metadata:
  name: my-model-endpoint-scaling-policy
spec:
    resourceId:
      - endpointName: my-model-endpoint
        variantName: AllTraffic
    region: ap-south-1
    policyName: CPU-ScalingPolicy
    policyType: TargetTrackingScaling
    minCapacity: 1
    maxCapacity: 3
    targetTrackingScalingPolicyConfiguration:
      targetValue: 80.0
      customizedMetricSpecification:
        metricName: CPUUtilization
        namespace: /aws/sagemaker/Endpoints
        dimensions:
           - name: EndpointName
             value: my-model-endpoint
           - name: VariantName
             value: AllTraffic
        statistic: Average
        unit: Percent

  1. The operator produce the following failure log:
2021-11-05T18:49:49.241Z        INFO    controllers.HostingAutoscalingPolicy    Getting resource        {"hostingautoscalingpolicy": "my-namespace/my-model-endpoint-scaling-policyl"}
2021-11-05T18:49:49.241Z        INFO    controllers.HostingAutoscalingPolicy    Loaded AWS config       {"hostingautoscalingpolicy": "my-namespace/my-model-endpoint-scaling-policyl"}
2021-11-05T18:49:49.488Z        INFO    controllers.HostingAutoscalingPolicy    Determined action for AutoscalingJob    {"hostingautoscalingpolicy": "my-namespace/my-model-endpoint-scaling-policyl", "action": "NeedsCreate"}
2021-11-05T18:49:49.918Z        INFO    controllers.HostingAutoscalingPolicy    Got an error while reconciling HostingAutoscalingPolicy, will retry     {"hostingautoscalingpolicy": "my-namespace/my-model-endpoint-scaling-policyl", "err": "Unable to apply HostingAutoscalingPolicy: hosting autoscaling policy was not applied, description is empty", "errVerbose": "hosting autoscaling policy was not applied, description is empty\nUnable to apply HostingAutoscalingPolicy\ngithub.com/aws/amazon-sagemaker-operator-for-k8s/controllers/hostingautoscalingpolicy.(*Reconciler).reconcileHostingAutoscalingPolicy\n\t/workspace/controllers/hostingautoscalingpolicy/hostingautoscalingpolicy_controller.go:192\ngithub.com/aws/amazon-sagemaker-operator-for-k8s/controllers/hostingautoscalingpolicy.(*Reconciler).Reconcile\n\t/workspace/controllers/hostingautoscalingpolicy/hostingautoscalingpolicy_controller.go:112\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.8/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.8/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.8/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.18.8/pkg/util/wait/wait.go:90\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"}

Environment:

@mbaijal
Copy link
Contributor

mbaijal commented Nov 10, 2021

@jgagnon1 Thanks for reporting the issue and also opening the PR to fix it. Are you seeing this error for every HAP resource you create ? I have not seen that block of code prevent an HAP from being applied thus I will need to verify this from my side before I approve the PR. We are looking into it.

@jgagnon1
Copy link
Contributor

@mbaijal as described in the initial issue, this only happens for HAP with custom metrics using customizedMetricSpecification

@mbaijal
Copy link
Contributor

mbaijal commented Nov 16, 2021

Hi @jgagnon1,
We reviewed the code within the team and realized that the code block removed in PR #187 is actually required (I'll be sure to remove the TODO comment). It ensures that an empty description is not returned to the calling method which in turn updates the current context. While we do think this logic can be refactored to be cleaner, this does not seem to be the actual reason for your issue.

In order to debug the problem, I used the spec file provided by you to apply HAP but was unable to reproduce it as is. The HAP was successfully applied. Thus one of the following seems to be happening -

  1. The case in which I see the exact same error as you is when my specified endpoint region was incorrect. Could you ensure that the endpoint exists in ap-south-1 and all the endpoint related fields are correct in the yaml file?

  2. For HAP to work, the endpoint needs to be IN-SERVICE. The HAP controller is intelligent in that it checks for the endpoint status and reconciles till the endpoint is ready. In this case though I do expect the HAP controller to throw the appropriate error instead of the generic one we see.
    However could you please verify that this is not the issue either ?

  3. If neither of the above is an issue, my hunch is that it could possibly be just a policy creation latency related issue before the check of line #440 is performed leading to the error. This I do not see on my side so we need more details about your setup.

In any of these cases, I would like to work with you to root cause and fix the problem so please do let me know once you can confirm that the spec is correct and neither of the first two cases apply to you.

Thanks!

@jgagnon1
Copy link
Contributor

jgagnon1 commented Nov 16, 2021

Hi @mbaijal

First of all, thanks for your assistance and sorry about the confusion for this one, see details;

I went ahead and done some validation and reviewed all my manifests, and it actually seems like the issue was not about this TODO block as you indicated bu the policy name I've used from the sample. The fix is included in the PR #188.

As for point number 2, I tested two ways;

  1. Applying the Deployment manifest first, waiting the status to be IN-SERVICE and then apply the HAP.
  2. Applying both at the same time (in same manifest file)

As expected, both mechanism works, however in case 2; I am seeing the original error message which threw me off in the first place. I would propose we could improve the messaging a bit to make it clearer, what do you think ?

In any way, I think #188 should be good to merge and then feel free to close this issue, or keep it to improve logging message if you intend to.

Thanks !

@mbaijal
Copy link
Contributor

mbaijal commented Nov 16, 2021

Thanks for investigating this further. The issue automatically got closed as I merged PR#188. Let me open a new one to track the log change.

@mbaijal
Copy link
Contributor

mbaijal commented Nov 16, 2021

@jgagnon1 I opened a new issue to track this here - #189.

Do feel free to open more issues in case you run into any more errors or have suggestions for enhancements.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.