Actor autoscaling with Keda doesn't work as expected #4768

akkie · 2022-06-14T14:24:34Z

In what area(s)?

/area runtime

What version of Dapr?

1.7.2

Expected Behavior

If I use actor scaling, then actors should not fail.

Actual Behavior

Actor state API returns error "actor instance is missing" if actor scaling is enabled.

Detailed description about our findings

We have an actor that executes a process based on multiple tasks. After each task we call the state store API to store the actual process state. The reason for that is that we use a journal pattern which helps us to retrigger the process without executing already successful processed steps. This means that we call the state store many times during the actor lifetime.

For this actor we have enabled actor scaling with Keda. During load testing we have seen a lot of "actor instance is missing" errors. After looking through the code we have seen that this error occurs only in the actor state API when no actor can be found in the actor table.

Store state:

dapr/pkg/http/api.go

Line 1489 in bbc1abc

    
           msg := NewErrorResponse("ERR_ACTOR_INSTANCE_MISSING", messages.ErrActorInstanceMissing)

Get state:

dapr/pkg/http/api.go

Line 1648 in bbc1abc

    
           msg := NewErrorResponse("ERR_ACTOR_INSTANCE_MISSING", messages.ErrActorInstanceMissing)

This means that a running actor stores its state and the state API says that the actor is not in the table. This should normally not happen. So we searched for places where the actor is deleted from the table. We found two places:

Actor deactivation:

dapr/pkg/actors/actors.go

Line 278 in bbc1abc

a.actorsTable.Delete(actorKey)

Actor rebalancing:

dapr/pkg/actors/actors.go

Line 653 in bbc1abc

a.actorsTable.Delete(key)

For rebalancing we have found the drainRebalancedActors configuration that we have tried to disable. But this doesn't work. The issue occurs. Then we have tried to disable our automatic scaling by setting minReplicaCount and maxReplicaCount to the same number. After testing again, the issue was gone.

Steps to Reproduce the Problem

Enable actor scaling as described in https://docs.dapr.io/developing-applications/integrations/autoscale-keda/
Create an actor that stores it's state over the actor state API
Put the system under load so that the actor will scale

Release Note

RELEASE NOTE: FIX Actor actor instance missing error during scaling issue

The text was updated successfully, but these errors were encountered:

akkie · 2022-06-14T14:26:15Z

@fabistb Please add any missing information

yaron2 · 2022-06-14T14:59:09Z

@fabistb Please add any missing information

It's normal for an actor to not be present for a short period of time as it gets rebalanced, and this should be a retriable error that resolves once the tables are updated. Can you confirm that this error occurs after retries that happen after scaling has finished?

akkie · 2022-06-14T15:22:04Z

It's gone after our retry mechanism kicks in. Do I understand that correct that Dapr will also retry automatically after rebalancing?

yaron2 · 2022-06-14T16:35:18Z

Dapr will only retry calls if the error is a transient network error or an authentication error from the target sidecar. It will not retry if it's a missing actor error, so retrying here is up to your app.

artursouza · 2022-06-15T00:21:43Z

Can this be done via resiliency feature? /cc @halspang

halspang · 2022-06-15T17:43:09Z

In this case, resiliency doesn't cover it. Resiliency for actor state operations is handled at the component level. We could always bump it up a little bit to add retries around actor discovery if we think something like this is likely to be transient.

yaron2 · 2022-06-17T06:46:26Z

In this case, resiliency doesn't cover it. Resiliency for actor state operations is handled at the component level. We could always bump it up a little bit to add retries around actor discovery if we think something like this is likely to be transient.

Actor instances not found can be treated as transient IMO.

yaron2 · 2022-06-20T16:32:33Z

@akkie @fabistb This will be handled for 1.8.

fabistb · 2022-06-28T12:51:49Z

@halspang , @artursouza , @yaron2 , fyi.

Thank you very much for looking into this.

We tried dapr 1.8.0-rc.3 and 1.8.0-rc.4 and with our tests and we are unfortunately still able to reproduce this actor instance is missing exception.

Can we provide you some log or something like this to get this sorted out?

akkie added the kind/bug Something isn't working label Jun 14, 2022

artursouza added P1 pinned labels Jun 15, 2022

artursouza added P0 and removed P1 labels Jun 20, 2022

artursouza added this to the v1.8 milestone Jun 20, 2022

halspang mentioned this issue Jun 20, 2022

Add built-in retry for finding actor in placement #4803

Merged

7 tasks

artursouza closed this as completed in #4803 Jun 20, 2022

artursouza assigned halspang Jun 22, 2022

halspang mentioned this issue Jun 27, 2022

Fix resiliency for .NET Actor invocation #4838

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor autoscaling with Keda doesn't work as expected #4768

Actor autoscaling with Keda doesn't work as expected #4768

akkie commented Jun 14, 2022 •

edited

Loading

akkie commented Jun 14, 2022

yaron2 commented Jun 14, 2022

akkie commented Jun 14, 2022

yaron2 commented Jun 14, 2022

artursouza commented Jun 15, 2022

halspang commented Jun 15, 2022

yaron2 commented Jun 17, 2022

yaron2 commented Jun 20, 2022

fabistb commented Jun 28, 2022

Actor autoscaling with Keda doesn't work as expected #4768

Actor autoscaling with Keda doesn't work as expected #4768

Comments

akkie commented Jun 14, 2022 • edited Loading

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Detailed description about our findings

Steps to Reproduce the Problem

Release Note

akkie commented Jun 14, 2022

yaron2 commented Jun 14, 2022

akkie commented Jun 14, 2022

yaron2 commented Jun 14, 2022

artursouza commented Jun 15, 2022

halspang commented Jun 15, 2022

yaron2 commented Jun 17, 2022

yaron2 commented Jun 20, 2022

fabistb commented Jun 28, 2022

akkie commented Jun 14, 2022 •

edited

Loading