Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers #603

ashwing · 2019-08-28T18:47:42Z

Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…rds delivery.

isurues · 2019-08-30T00:18:34Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+        return result == null ? result : result.prepareForPublish();
+    }
+
+    RecordsRetrieved evictNextResult() {


nit: pollNextResult? just to be in sync with Queue API?

Hmmm no strong preference. Will change

Updated to pollNextResultAndUpdatePrefetchCounters

isurues · 2019-08-30T05:51:08Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+    private Instant lastEventDeliveryTime = Instant.EPOCH;
+    // This flag controls who should drain the next request in the prefetch queue.
+    // When set to false, the publisher and demand-notifier thread would have the control.
+    // When set to true, the event-notifier thread would have the control.


Better if we can define what each thread is doing. Maybe we can add a class comment for that.

isurues · 2019-08-30T06:12:55Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+    // This flag controls who should drain the next request in the prefetch queue.
+    // When set to false, the publisher and demand-notifier thread would have the control.
+    // When set to true, the event-notifier thread would have the control.
+    private AtomicBoolean shouldDrainEventOnAck = new AtomicBoolean(false);


shouldDrainEventOnlyOnAck?

It looks like the reason for shards getting stuck in the previous semaphore solution was that we were not releasing the semaphore when the health checker calls restartFrom. If we do that, shouldn't it solve the entire problem of shards getting stuck forever?

This solution fixes the above problem by handing the control over to publisher thread when the restartFrom is called. But what do we gain by letting the ack-notifier thread drain the queue?

The blocking publisher thread on the semaphore would have acquired the readlock. In order to release the semaphore, the reset thread should acquire readlock as well. Assume the reset thread acquires the readlock and releases the semaphore, we should avoid the publisher thread from publisher thread from further publishing, until the reset thread acquires the write lock.

Also if we have a stale notification after clearing the queue in reset say after 60 seconds, then this would go ahead and release a semaphore, which means the publisher can publish two events without receiving ack for first event, which would cause durability issues.

Considering these factors among others, we decided it would be better if we can validate the ack before publishing the next event, it would prevent the durability issue. Also to ensure that an event is scheduled as soon as possible when there are prefetched events and to avoid having any blocking calls, we introduced this concept of scheduling from different threads and exchanging controls based on the queue state and demand.

Aha.. yes. Makes sense. Thanks for the details.

But I still think we can achieve this without letting the ack thread drain the queue. We can discuss this later. :)

The publisher thread has this logic of spinning in a while loop to offer the prefetched element to the queue. That's the reason why the ack-thread (now) and demand-thread (earlier) are calling the drainQueue directly. Let's discuss this offline.

isurues · 2019-08-30T07:14:07Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+     * Method that will be called by the 'publisher thread' and the 'demand notifying thread',
+     * to drain the events if the 'event notifying thread' do not have the control.
+     */
+    private synchronized void initiateDrainQueueForRequests() {


initiate may not be the best word here. maybe drainQueueForRequestsIfAllowed?

isurues · 2019-08-30T07:17:26Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+        if (requestedResponses.get() > 0 && recordsToDeliver != null) {
+            lastEventDeliveryTime = Instant.now();
+            subscriber.onNext(recordsToDeliver);
+            if(!shouldDrainEventOnAck.get()) {


nit: missing space between if and (

will address

isurues · 2019-08-30T07:17:40Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+        } else {
+            // Since we haven't scheduled the event delivery, give the drain control back to publisher/demand-notifier
+            // thread.
+            if(shouldDrainEventOnAck.get()){


nit: missing space between if and (. check all such places.

isurues · 2019-08-30T07:22:46Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+            // Give the drain control to publisher/demand-notifier thread.
+            log.debug("{} : Publisher thread takes over the draining control. Queue Size : {}, Demand : {}", shardId,
+                    getRecordsResultQueue.size(), requestedResponses.get());
+            shouldDrainEventOnAck.set(false);


Before this change, we had the eventDeliveryLock semaphore which was only released upon receiving the ack. So if the ack is lost somewhere, the shard gets stuck for ever. Was there any reason why we didn't release it within restartFrom here?

Looks like we are fixing that issue in this change. Shouldn't adding this step into the Semaphore solution fix the issue? What do we gain on top of that by this new solution?

answered above

Yep, sorry about the duplicate. :)

isurues · 2019-08-30T07:31:46Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+                shouldDrainEventOnAck.set(false);
+            } else {
+                // Else attempt to drain the queue.
+                drainQueueForRequests();


In the happy case where we continue to receive acks for each event delivered, always the ack-notifier thread will be draining the queue. Publisher thread will continue to do only the fetch part. I think this conflicts a bit with the design of this class and the purpose of the publisher thread.

Publisher thread will be a kick starter in the event of paused deliveries. In happy case, yes ack-notifier will be scheduling for minimum delay between each delivery and for simplicity.

I agree that it'll help a bit with the delay. But not with the simplicity. :) We have 3 different threads draining the queue. I think it would be better and simpler if we had only one thread publishing and other two just updating the demand and registering the ack. Anyway that's a bigger change and we can consider that in the future. :)

Addressed this in https://github.com/awslabs/amazon-kinesis-client/pull/606/files

isurues · 2019-08-30T07:40:07Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

+        } else {
+            // Log and ignore any other ack received. As long as an ack is received for head of the queue
+            // we are good. Any stale or future ack received can be ignored, though the latter is not feasible
+            // to happen.


Another reason to come here is when the queue is reset before the ack comes back. Add this too in the comment?

That would be considered as a stale ack.

isurues

Looks good.

Fix to prevent data loss and stuck shards in the event of failed reco…

4cd6851

…rds delivery.

ashwing changed the title ~~Fix to prevent data loss and stuck shards in the event of failed records delivery~~ Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers Aug 28, 2019

isurues reviewed Aug 30, 2019

View reviewed changes

Review comment fixes

ebfb03a

isurues approved these changes Aug 30, 2019

View reviewed changes

Access specifiers fix

7b23757

isurues approved these changes Aug 30, 2019

View reviewed changes

ashwing mentioned this pull request Aug 30, 2019

KCL 2.2.2 has a bug that will cause polling readers to get stuck #602

Closed

micah-jaffe merged commit f6dec3e into awslabs:master Sep 3, 2019

micah-jaffe added this to the v2.2.3 milestone Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers #603

Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers #603

ashwing commented Aug 28, 2019 •

edited

isurues Aug 30, 2019

ashwing Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues Aug 30, 2019

ashwing Jun 9, 2020

isurues Aug 30, 2019

ashwing Aug 30, 2019

isurues left a comment

Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers #603

Fix to prevent data loss and stuck shards in the event of failed records delivery in Polling readers #603

Conversation

ashwing commented Aug 28, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isurues left a comment

Choose a reason for hiding this comment

ashwing commented Aug 28, 2019 •

edited