Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamodb-streams-kinesis-adapter broken on newest kcl 1.14.0 version #40

Closed
gjesse opened this issue Oct 26, 2020 · 12 comments
Closed

dynamodb-streams-kinesis-adapter broken on newest kcl 1.14.0 version #40

gjesse opened this issue Oct 26, 2020 · 12 comments

Comments

@gjesse
Copy link

gjesse commented Oct 26, 2020

See awslabs/amazon-kinesis-client#746 for more background:

Hello - yesterday I upgraded to 1.14.0 kcl client for our application that uses dynamodb streams for processing. Since then I've noticed these very consistent errors. we've seen 10s of thousands of these in just a few hours, and repeated for the same shard ids.

ERROR [2020-10-20 17:07:45,185] [RecordProcessor-0015] c.a.s.k.c.lib.worker.ProcessTask: ShardId shardId-00000001603151190143-090943ac: Caught exception: 
com.amazonaws.SdkClientException: Shard shardId-00000001603151190143-090943ac: GetRecordsResult is not valid. NextShardIterator: null. ChildShards: []
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisDataFetcher$AdvancingResult.accept(KinesisDataFetcher.java:126)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.SynchronousGetRecordsRetrievalStrategy.getRecords(SynchronousGetRecordsRetrievalStrategy.java:31)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.BlockingGetRecordsCache.getNextResult(BlockingGetRecordsCache.java:50)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:377)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:342)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:159)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

As best I can tell, a GetRecordsResult with a null NextShardIterator and no child shards is a valid response - in fact there is no field specified for child shards at all here:
docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetRecords.html

Using kcl 1.14.0, and creating a worker using dynamodb-streams-kinesis-adapter 1.5.2. The worker is setup using this method: https://github.com/awslabs/dynamodb-streams-kinesis-adapter/blob/master/src/main/java/com/amazonaws/services/dynamodbv2/streamsadapter/StreamsWorkerFactory.java#L44

I am not setting any special configuration other than the following, which I believe shouldn't be relevant.

 KinesisClientLibConfiguration config =  new KinesisClientLibConfiguration(applicationName,
                                                                                  streamName,
                                                                                  credentialsProvider,
                                                                                  workerId)
                        .withInitialPositionInStream(initialPositionInStream)
                        .withMaxRecords(kclConfiguration.getMaxRecordsToFetch())
                        .withTaskBackoffTimeMillis(kclConfiguration.getBackOffTimeMillis())
                        .withIdleTimeBetweenReadsInMillis(kclConfiguration.getIdleTimeBetweenReads());
@barryoneill
Copy link

Also hitting this problem, and a downgrade for us is not trivial at this point :(

@barryoneill
Copy link

@gjesse
Copy link
Author

gjesse commented Dec 17, 2020

It's been feeling pretty dead for a long time. I wish AWS would either deprecate it officially or support it. We are looking at transitioning to kinesis.

@crossroad0201
Copy link

I had the same error with KCL 1.14.3 and Kinesis Adapter 1.5.3.
The DynamoDB Streams shard is said to be automatically recreated every few hours (about 4 hours), and this error starts to occur at that timing.

Cause

After reading the source code, it seems that the cause is that the Kinesis Adapter does not meet the specifications of the #isValidResult() check process that was added to KinesisDataFetcher in KCL 1.14.0.

KCL's KinesisDataFetcher#isValidResult() has the following comment.

// GetRecords result should contain childShard information. There are two valid combinations for the nextShardIterator and childShards
// If the GetRecords call does not reach the shard end, getRecords result should contain a non-null nextShardIterator and an empty list of childShards.
// If the GetRecords call does not reach the shard end, getRecords result should contain a null nextShardIterator and a non-empty list of childShards.
// All other combinations are invalid and indicating an issue with GetRecords result from Kinesis service.

However, the Kinesis Adapter's AmazonDynamoDBStreamsAdapterClient#getRecords() implementation, it seems that we are not getting the child shards in the first place.
Therefore, it does not meet the specification that KCL assumes, "If nextShardIterator is null, then childShards must be a non-empty list.

First aid

At first, this error could be avoided by using KCL 1.13.x before the checking process was implemented.

However, as it is, DynamoDB Streams consumer applications will have to continue using the old version of KCL.

Idea for supports

I have come up with some ideas for a radical solution.

Idea 1. Make Kinesis Adapter get the child shards.

Currently, AmazonDynamoDBStreamsAdapterClient#getRecords() in Kinesis Adapter doesn't get the child shard information, but we can fix it to get it and set it in GetRecordsResult and return it.

I tried to fix it, but to get the child shards in the DynamoDB Streams API, use describeStream to get a list of shards, and then find the child shards of the shard in question.
However, since only a ShardIterator is passed as an argument to getRecords(), the ID of the parent shard is not known, and it is not possible to determine which shard is the target child shard from the list of shards.

Therefore, we think this idea is not feasible.

Idea 2. Don't check with KCL

When KCL is used in combination with Kinesis Adapter, it should not be checked by KinesisDataFetcher#isValidResult().

In this case, we can't make KCL depend on Kinesis Adapter, so how do we determine "used in combination with Kinesis Adapter"?

I think this can be solved by passing a flag to the constructor of KinesisDataFetcher that specifies whether or not to perform the check. (This flag can be specified in the KCL configuration.)

Do you have any other ideas?

@gjesse
Copy link
Author

gjesse commented Oct 18, 2021

Almost happy birthday for this issue, with zero comments from the maintainers..

@dacevedo12
Copy link

dacevedo12 commented Sep 27, 2022

@gguptp gguptp closed this as completed Jan 23, 2023
@MihaiBogdanEugen
Copy link

@gguptp was this fixed or why is this issue closed now?

@gguptp
Copy link
Contributor

gguptp commented Feb 20, 2023

We have released the newest dynamodb-streams-kinesis-adapter version 1.6.0, which is compatible with KCL 1.14.9 version

@MihaiBogdanEugen
Copy link

Thanks for the update!

@mlanglet
Copy link

@gguptp we've made the migration to dynamodb-streams-kinesis-adapter version 1.6.0 using KCL 1.14.9. However, we're still seeing the ERROR level logs about "GetRecordsResult is not valid" from com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask. But the stream processing does seem to be working. Can we suppress this error log with good conscience?

@gguptp
Copy link
Contributor

gguptp commented Apr 19, 2023

Please make sure StreamsWorkerFactory is getting used to initialize KCL worker

@mlanglet
Copy link

Ah yes, thanks for support @gguptp!! We were already on 1.5.3 but we were not creating our workers with StreamsWorkerFactory yet and I didn't read the older release notes. Now it's running just fine on 1.6.0 with 1.14.9! We're seeing a slight increase in CPU usage but also lower latency. All good! 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants