Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actor autoscaling with Keda doesn't work as expected #4768

Closed
akkie opened this issue Jun 14, 2022 · 9 comments · Fixed by #4803
Closed

Actor autoscaling with Keda doesn't work as expected #4768

akkie opened this issue Jun 14, 2022 · 9 comments · Fixed by #4803
Assignees
Labels
kind/bug Something isn't working P0 pinned
Milestone

Comments

@akkie
Copy link
Contributor

akkie commented Jun 14, 2022

In what area(s)?

/area runtime

What version of Dapr?

1.7.2

Expected Behavior

If I use actor scaling, then actors should not fail.

Actual Behavior

Actor state API returns error "actor instance is missing" if actor scaling is enabled.

Detailed description about our findings

We have an actor that executes a process based on multiple tasks. After each task we call the state store API to store the actual process state. The reason for that is that we use a journal pattern which helps us to retrigger the process without executing already successful processed steps. This means that we call the state store many times during the actor lifetime.

For this actor we have enabled actor scaling with Keda. During load testing we have seen a lot of "actor instance is missing" errors. After looking through the code we have seen that this error occurs only in the actor state API when no actor can be found in the actor table.

Store state:

msg := NewErrorResponse("ERR_ACTOR_INSTANCE_MISSING", messages.ErrActorInstanceMissing)

Get state:
msg := NewErrorResponse("ERR_ACTOR_INSTANCE_MISSING", messages.ErrActorInstanceMissing)

This means that a running actor stores its state and the state API says that the actor is not in the table. This should normally not happen. So we searched for places where the actor is deleted from the table. We found two places:

Actor deactivation:

a.actorsTable.Delete(actorKey)

Actor rebalancing:
a.actorsTable.Delete(key)

For rebalancing we have found the drainRebalancedActors configuration that we have tried to disable. But this doesn't work. The issue occurs. Then we have tried to disable our automatic scaling by setting minReplicaCount and maxReplicaCount to the same number. After testing again, the issue was gone.

Steps to Reproduce the Problem

Release Note

RELEASE NOTE: FIX Actor actor instance missing error during scaling issue

@akkie akkie added the kind/bug Something isn't working label Jun 14, 2022
@akkie
Copy link
Contributor Author

akkie commented Jun 14, 2022

@fabistb Please add any missing information

@yaron2
Copy link
Member

yaron2 commented Jun 14, 2022

@fabistb Please add any missing information

It's normal for an actor to not be present for a short period of time as it gets rebalanced, and this should be a retriable error that resolves once the tables are updated. Can you confirm that this error occurs after retries that happen after scaling has finished?

@akkie
Copy link
Contributor Author

akkie commented Jun 14, 2022

It's gone after our retry mechanism kicks in. Do I understand that correct that Dapr will also retry automatically after rebalancing?

@yaron2
Copy link
Member

yaron2 commented Jun 14, 2022

Dapr will only retry calls if the error is a transient network error or an authentication error from the target sidecar. It will not retry if it's a missing actor error, so retrying here is up to your app.

@artursouza
Copy link
Member

Can this be done via resiliency feature? /cc @halspang

@halspang
Copy link
Contributor

In this case, resiliency doesn't cover it. Resiliency for actor state operations is handled at the component level. We could always bump it up a little bit to add retries around actor discovery if we think something like this is likely to be transient.

@yaron2
Copy link
Member

yaron2 commented Jun 17, 2022

In this case, resiliency doesn't cover it. Resiliency for actor state operations is handled at the component level. We could always bump it up a little bit to add retries around actor discovery if we think something like this is likely to be transient.

Actor instances not found can be treated as transient IMO.

@artursouza artursouza added P0 and removed P1 labels Jun 20, 2022
@artursouza artursouza added this to the v1.8 milestone Jun 20, 2022
@yaron2
Copy link
Member

yaron2 commented Jun 20, 2022

@akkie @fabistb This will be handled for 1.8.

@fabistb
Copy link

fabistb commented Jun 28, 2022

@halspang , @artursouza , @yaron2 , fyi.

Thank you very much for looking into this.

We tried dapr 1.8.0-rc.3 and 1.8.0-rc.4 and with our tests and we are unfortunately still able to reproduce this actor instance is missing exception.

Can we provide you some log or something like this to get this sorted out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working P0 pinned
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants