Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client stuck 1 day behind, getting fewer records than requested, can't catch up, no CPU constraint #1050

Open
tayl opened this issue Feb 22, 2023 · 4 comments
Labels

Comments

@tayl
Copy link

tayl commented Feb 22, 2023

Hello, we have an instance of a Kinesis consumer in one of our customers environments that is "stuck", but not in a typical way (typical to me anyway). The consumer is requesting and processing records as it usually does, but the records that are coming back got further and further behind, until they were exactly one day behind. Our iterator age for this shard hovers right around 86,100,000 ms. If that was our data retention period, that would make sense to me, but it isn't - this stream is set to 3 days. Additionally, the consumer is not burdened in any way. It has an abundance of CPU and memory available.

I think key to solving this is that the consumer is requesting more records than it's getting back. If it were up to date, I'd understand this, as it would get only what is new and nothing more. However, that's not the case - it's as if the shard believes now() - 24 hours is real-time. Additionally, CloudWatch is showing no read throughput exceeded errors.

I've confirmed using the data viewer in Kinesis web that there is data between now and 24 hours ago.

That's a long winded way of saying this consumer seems to have time traveled to 24 hours ago. From that frame of reference it's processing data in real-time. All other 32 consumers running on this 32 shard stream are doing fine. It's just this one that is confused.

What could cause this? Any more info I can provide to help diagnose?

@stair-aws
Copy link
Contributor

stair-aws commented Feb 22, 2023

Any more info I can provide to help diagnose?

Salut!

  1. Which version of KCL are you using?
  2. How are you certain the records are not potentially skewed towards a "hot" partition?

It has an abundance of CPU and memory available.

Obvious, friendly call-out: just because a host has resources does not ensure those resources are accessible to the process (e.g., JVM). Since the other consumers are self-reported as fine, I'll assume this isn't a factor but it might be worth a peek.

Off-the-cuff suggestions before diving deep:

  • Have you tried "turning it off and back on again"? (Cringe-worthy, but sometimes effective.) That is, can you replace the host on which this consumer is executing to potentially rule out any hardware issues?
  • Have you attempted to update the shard count to see if this may alleviate the issue?

@tayl
Copy link
Author

tayl commented Feb 22, 2023

Hi stair, thanks for getting back to me.

  1. We're using AWSSDK.Kinesis 3.7.1.51.
  2. The algorithm we use to steer records results in a normal distribution. Some shards have more data coming into them, but the range is pretty insignificant.

These consumers are running in dedicated ECS tasks, there are no other processes running, so at least according to the ECS reported metrics, they are under-utilized. ~10% CPU utilization and 5% mem. Min and max of each is within a few points of the average.

We've restarted many times. While the consumers are down, data piles up in the Kinesis shards. When the consumers come back, all consumers except the affected one burn through their backlog quickly. The one that is stuck at 86m ms accumulates 500k or so ms of iterator age during downtime. After downtime, it burns through the 500k ms, and settles right back around 86m.

Again these are ECS tasks, hardware is extrapolated away but I assume every time it's a different machine.

Is there a benefit to upping the shard count if the shards are under utilized, and not showing any failures or throttling?

Thanks

@tayl
Copy link
Author

tayl commented Feb 22, 2023

To rule it out, we doubled the CPU resources of the ECS tasks temporarily and saw no change. Again, the iterator of the affected shard climbed above 86m during downtime, and then quickly settled back to 86m once the consumer was running again.

@tayl
Copy link
Author

tayl commented Feb 23, 2023

Sorry for the back to back to back posts, just throwing in more info. Another consumer/shard combo has jumped up to that 86.1m iterator age number and is stuck there. We've looked and see nothing in our application that should be producing that number, or 1 day, or 24 hours.. Additionally, our retention period is 3 days, not 1, so anything near 1 day worth of ms in iterator age is unusual. The fact that two decoupled (other than processing the same Kinesis stream) consumers are doing this suddenly makes me think it's Kinesis related?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants