New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker lost leases on shards after error #136
Comments
@tyger we're experiencing a similar issue, that is no other worker tried stealing the lease when one instance went down (due to auto-scaling). Have you found out the reason behind this? |
We're also experiencing this issue. We have restart all of our consumers periodically to ensure we catch any of these "dropped shards". |
We just recently started experiencing this behavior as well. |
@tyger this is what I gather from your comments and logs, it seems like your workers were restarted. The KCL on restarts does not get the leases that it previously had, it tries to steal leases. The leases don't tell you the state of the worker if they are alive or not, they have an expiration after which they can be stolen. When machine 2 restarted, the leases that it previously had may not have expired/timed out yet. That is why in the log message you see 0 leases available. During the reboot of the first machine all the leases expired and machine 2 was able to acquire them. When machine 1 came back up, it saw that there are 5 leases present and machine 2 had it. To load balance it, it would have tried to steal 2 leases. If you could provide some more logs, I would be able to give you a more concrete answer. |
Just recently we encountered strange behaviour of KCL in our system.
Our configuration:
In short: We encountered short network blip or something which caused timeout for some leases and they were lost and neither of two workers picked up them again.
Before error first worker had been processing one shard and second worker had been processing four shards.
then error happened (repeating stacktraces have been skipped), as it is seen from first three records second worker lost leases for three shards.
then system kept working with such assignments:
So, obviously, second worker lost leases for three shards and nobody were processing it.
After we noticed it the second machine was restarted:
And what looks strange in this listing is that line:
c.a.s.k.l.impl.LeaseTaker - Worker 97998ed6-183e-4225-9a76-dc1a31e081b6 saw 5 total leases, 0 available leases, 3 workers. Target is 2 leases, I have 0 leases, I will take 1 leases
after restart of second machine, as if there weren't available leases at all, though at that moment one lease should be freed by restarting machine and three should be free long time ago.after that system settled to the next assignment (first machine still processed one shard, second machine with new worker processed three shard, shard number 1 was abandoned):
Then we decided to restart the first machine.
The second machine this time took all leases. And only at this moment all shards were being processed.
Then first machine started and get lease for shard number 4:
The text was updated successfully, but these errors were encountered: