Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acked messages unexpectedly redelivered when others are negatively acked #5969

Closed
gmethvin opened this issue Dec 31, 2019 · 8 comments · Fixed by #6052
Closed

Acked messages unexpectedly redelivered when others are negatively acked #5969

gmethvin opened this issue Dec 31, 2019 · 8 comments · Fixed by #6052
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@gmethvin
Copy link
Member

Describe the bug

We've encountered an issue in which acknowledged messages are redelivered one or more times after other messages are negatively acknowledged. This seems to occur when messages are produced in batches. This happens in the absence of any known broker or connection failures.

To Reproduce

I've modified the NegativeAcksTest to test for the correct behavior here: master...gmethvin:negative-ack-duplicates

As the test demonstrates, in some configurations positively acknowledged messages are redelivered. This is similar to a situation we see in production.

Expected behavior

Only the negatively acknowledged messages should be redelivered. Positively acknowledged messages should not be redelivered, at least not in a typical situation with no failures.

We produce messages in batches, but both the APIs and the documentation suggest that both acks and negative acks act on a per-message level. If negative acks act on batches, then the APIs and documentation should be changed to clearly indicate that.

@gmethvin gmethvin added the type/bug The PR fixed a bug or issue reported a bug label Dec 31, 2019
@sijie
Copy link
Member

sijie commented Jan 1, 2020

I think the fundamental problem of #5891 #5969 is the cursor in tracking at batch level not at message level. Hence failures can result in redelivering the whole batch. In order to address the fundamental problem here, we need to improve the cursor tracking at the message level in the broker side.

Loop in @jiazhai @codelipenghui in the discussion.

@jerrypeng
Copy link
Contributor

jerrypeng commented Jan 1, 2020

@sijie @gmethvin we could track cursors at the batchLevel i.e. track cursors on the granularity of LedgerId:EntryId:BatchIndex, however there are performance implications if we would want to redeliver only messages that are NOT ACK ed or NACK ed. To do so would require us to de-serialized BK entries on the broker side, filter the messages in the entry, re-serialize the message, send it to the client, and finally the client with have to deserialize. To do this would entail more latency and 2X cpu effort for serialization and de-serialization. We have generally avoid entry inspection in the broker for such a reason.

The question I have is if a user really wants to get only the messages that are NOT ACK ed or NACK ed, can the user just turn off batching? Or if the user wants fewer duplicates from redelivered messages/batches, can the user just tune the batch size to be smaller? Are these good enough workflows or knobs to turn for users to satisfy these use cases?

Or is this also just a documentation / education issue? The docs are NOT very clear on what the expected behavior of ACKs or NACKs when batching is enabled.

@merlimat can you also chime in?

@sijie
Copy link
Member

sijie commented Jan 2, 2020

@jerrypeng there are many ways to avoid serialization and deserialization. Broker can deliver a bit metadata along with the batches to tell the client what messages are acked. Clients can filter out the already delivered messages during deserialization at client side.

The ability to control cursor tracking at the message level is a requirement for transactions anyway. We'd resolve the problem in single place.

@jerrypeng
Copy link
Contributor

jerrypeng commented Jan 2, 2020

@sijie isn't that design just trading cpu usage for network usage? If you don't filter on the broker side and resend the whole batch to be filtered on the client side, there will be network implications especially if batches are big.

@sijie
Copy link
Member

sijie commented Jan 3, 2020

@jerrypeng yes it is a trade off here - whose resource to be used for this task. However i think the core problem here isn’t the resource problem so far. It is a semantic issue. Redelivering batches mostly are okay. But people doesn’t have the knowledge of which messages have acknowledged when a batch is redelivered. Semantic issue is what most of people care about first.

@gmethvin
Copy link
Member Author

gmethvin commented Jan 3, 2020

I agree @sijie. The main issue is that the API semantics don't match what users expect. If the negativeAcknowledge API acts on a single message, then it is a very surprising behavior for all the messages in the batch to be redelivered. If on the other hand the negativeAcknowledge API applies to a batch, then the current behavior would be more understandable. I would argue that single-message semantics are much more useful for the user though.

@cdbartholomew
Copy link
Contributor

@jerrypeng @sijie @gmethvin I agree that redelivering batches is OK in most cases. Redelivery of unacked/nacked messages is the exception, not the rule, but when it occurs it should behave in a way that makes sense to the user of the API.

It's not that hard for the client to filter out the already acked messages from the batch. It is already keeping track of this so it knows when it can ack the batch back to broker. It's just a matter of using this information to filter received messages.

That approach doesn't cover all failure cases (ex if client restarts), but in the case where the application is NACKing a message, it would give reasonable behavior. Plus it will reduce the number of duplicate messages that get sent to the client when using batch messages.

@zzzming and I are happy to work on a PR for this if everyone agrees this is the right approach.

@jiazhai
Copy link
Member

jiazhai commented Jan 4, 2020

Oh, since it is also related to transaction implementation, Penghui and me are already writing a PIP for this. We are going to share the PIP soon.

sijie pushed a commit that referenced this issue Jun 2, 2020
Master issue: #6253
Fixes #5969

### Motivation

Add support for ack batch message local index. Can be disabled at broker side by set batchIndexAcknowledgeEnable=false at broker.conf

PIP-54 documentation will be created soon.

### Modifications

1. Managed cursor support track and persistent local index of batch message.
2. Client support send batch index ack to broker.
3. The batch messages with index ack information dispatched to the client.
4. Client skip the acked index.

### Verifying this change

New unit tests added
Huanli-Meng pushed a commit to Huanli-Meng/pulsar that referenced this issue Jun 12, 2020
Master issue: apache#6253
Fixes apache#5969

### Motivation

Add support for ack batch message local index. Can be disabled at broker side by set batchIndexAcknowledgeEnable=false at broker.conf

PIP-54 documentation will be created soon.

### Modifications

1. Managed cursor support track and persistent local index of batch message.
2. Client support send batch index ack to broker.
3. The batch messages with index ack information dispatched to the client.
4. Client skip the acked index.

### Verifying this change

New unit tests added
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this issue Aug 24, 2020
Master issue: apache#6253
Fixes apache#5969

### Motivation

Add support for ack batch message local index. Can be disabled at broker side by set batchIndexAcknowledgeEnable=false at broker.conf

PIP-54 documentation will be created soon.

### Modifications

1. Managed cursor support track and persistent local index of batch message.
2. Client support send batch index ack to broker.
3. The batch messages with index ack information dispatched to the client.
4. Client skip the acked index.

### Verifying this change

New unit tests added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants