Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: deadlock; flush delete buffer after stopping the subscriber #68

Merged
merged 5 commits into from
Jul 31, 2019

Conversation

muharem
Copy link
Contributor

@muharem muharem commented Jun 27, 2019

few fixes related to aws subscriber:

  • deadlock occurs if the subscriber is stopped and received messages from aws waiting for a space in output channel, at that a consumer does not read from the output channel as it has stopped the subscriber.
  • removed messages are not flushed from the buffer after the subscriber is stopped and as a result handled and removed messages are not removed from sqs queue.
  • err var is not set to nil after each delete iteration.
  • stop all goroutine when the subscriber is being stopped
  • do not add a removed message to delete buffer if the subscriber is not running

@jsafoodpanda
Copy link
Contributor

@muharem What happens if multiple messages are received but not consumed?
So we get 10 messages from SQS, process 5 and then die. The last 5 should not be deleted. This PR is about the first 5 that were processed and then added to the deletion queue?

@muharem
Copy link
Contributor Author

muharem commented Jun 28, 2019

@muharem What happens if multiple messages are received but not consumed?
So we get 10 messages from SQS, process 5 and then die. The last 5 should not be deleted. This PR is about the first 5 that were processed and then added to the deletion queue?

@jsafoodpanda Yes, only about those 5, which were explicitly marked as Done and putted to the delete buffer.

Now the subscriber just stops without flushing the delete buffer.

@wilsont
Copy link

wilsont commented Jun 28, 2019

LGTM, can you also provide unit test for this?

@muharem
Copy link
Contributor Author

muharem commented Jul 1, 2019

LGTM, can you also provide unit test for this?

@wilsont done

@@ -65,6 +66,8 @@ var (
defaultSQSConsumeBase64 = true
)

var sqsClientFactoryFunc = createSqsClient
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helps to mock sqs client and use an actual constructor to create the subscriber for unittest

ReceiptHandle: m.message.ReceiptHandle,
}
if m.sub.isStopped() {
return m.sub.deleteMessageBatch(batchInput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the subscriber is stopped there is no goroutine to remove a message, remove it inside of the same gogoutine

atomic.SwapUint32(&s.stopped, uint32(0))
s.stop = make(chan chan error, 1)
s.flush = make(chan chan error, 1)
s.toDelete = make(chan *deleteRequest)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializing all channels inside of Start func for a multiple run via one subscriber instance

@@ -224,49 +251,54 @@ func (s *subscriber) Start() <-chan pubsub.Message {
s.Logger.Printf("found %d messages", len(resp.Messages))
// for each message, pass to output
for _, msg := range resp.Messages {
output <- &subscriberMessage{
select {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid deadlock when a consumer stop the subscriber and does not reading from output channel anymore

var delRequest *deleteRequest
var err error
select {
case flush := <-s.flush:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush and shutdown the goroutine

@muharem muharem changed the title bugfix flush delete buffer after stopping the subscriber bugfix: deadlockj; flush delete buffer after stopping the subscriber Jul 2, 2019
@muharem muharem changed the title bugfix: deadlockj; flush delete buffer after stopping the subscriber bugfix: deadlock; flush delete buffer after stopping the subscriber Jul 2, 2019
Copy link
Contributor

@dh-saurabh dh-saurabh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the explicit test cases for:

  1. Deadlock scenario you fixed
  2. Test assurance for unprocessed message can be recovered from the queue if the subscriber is stopped in between

@@ -168,12 +180,16 @@ func (m *subscriberMessage) ExtendDoneDeadline(d time.Duration) error {
// message has been deleted.
Copy link
Contributor

@dh-saurabh dh-saurabh Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this description still holds good ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information that a message will be removed synchronically in case when the subscriber is stopped could be add but in my opinion for the client it does not make any sense.
The doc should say that message will be deleted be this function.
And SQSDeleteBufferSize configuration's doc should say to a client about delete buffer.

_, err = s.sqs.DeleteMessageBatch(batchInput)
delRequest.receipt <- err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add suitable comments describing what this method is doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

msq1.Done()
assert.True(t, len(sqstest.Deleted) == 0, "Message unexpectedly was removed from the delete buffer")

sub.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this scenario mimic the case when you have three messages to process and you only processed one and stopped the subscriber and then you are flushing all the remaining messages without processing them. What happens when you restart the subscriber ? Will you get the remaining two messages ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the tests which also confirms that you will not lose any unprocessed messages from queue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you restart the subscriber ?

Is there a use case for this? The app should not restart consumption after stoping as the only stop scenario is shutting down.

What do you mean by not lose any unprocessed messages? If they are not processed, they should remain in the SQS and not deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you restart the subscriber ?

Is there a use case for this? The app should not restart consumption after stoping as the only stop scenario is shutting down.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by not lose any unprocessed messages? If they are not processed, they should remain in the SQS and not deleted.

Please ignore this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you get the remaining two messages ?

Yes

you are flushing all the remaining messages without processing them.

We are flushing only those which were marked as Done by Done method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case tests the deadlock
The subscriber gets 3 messages by one batch but the consumer reads from the channel only one and stop.

You can use this test case for the current version and see the deadlock

@muharem
Copy link
Contributor Author

muharem commented Jul 24, 2019

Please add the explicit test cases for:

  1. Deadlock scenario you fixed

Already there

  1. Test assurance for unprocessed message can be recovered from the queue if the subscriber is stopped in between

I do not get.
unprocessed message - unprocessed messages (the messages which were fetched from the queue but did not marked as Done by Done method) will be recovered to the queue by sqs functionality.

the subscriber is stopped in between - between what?

@dh-saurabh
Copy link
Contributor

the subscriber is stopped in between - between what?

It means the subscriber starts and stopped during whatever process you were using subscriber.

@dh-saurabh
Copy link
Contributor

unprocessed message - unprocessed messages (the messages which were fetched from the queue but did not marked as Done by Done method) will be recovered to the queue by sqs functionality.

Fair enough

Copy link
Contributor

@dh-saurabh dh-saurabh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @muharem confirmed that this PR is only for removing the already processed and marked done messages from the SQS and required tests cases are there approved from my side.

@muharem muharem merged commit 7d6318b into master Jul 31, 2019
@Maxibond Maxibond deleted the bugfix-remove-messages-from-delete-buffer branch December 4, 2019 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants