Support grouping ACK requests by time and size #957

BewareMyPower · 2023-02-16T03:38:43Z

Fixes #949

Motivation

Currently the Go client does not support grouping ACK requests, so each time Ack (or similar APIs) is called, a ACK request will be sent, which could downgrade the performance. We need to support configuring the time and size to cache MessageID before sending ACK requests.

Modifications

Add an AckGroupingOptions field to ConsumerOptions, when it's nil, use 100ms as the max time and 1000 as the max size.
Add an ackGroupingTracker interface to support grouping ACK requests.
When AckWithResponse is false, adding the MessageID instance to the tracker instead of sending the requests to eventsCh.

Verifying this change

Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

Added ack_grouping_tracker_test.go to verify ackGroupingTracker in various cases
The consumer side change can be covered by existing tests because the default AckGroupingOptions config is { MaxSize: 1000, MaxTime: 100*time.Millisecond }.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (yes)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (GoDocs)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

Fixes apache#949 ### Motivation Currently the Go client does not support grouping ACK requests, so each time `Ack` (or similar APIs) is called, a ACK request will be sent, which could downgrade the performance. We need to support configuring the time and size to cache `MessageID` before sending ACK requests. ### Modifications - Add an `AckGroupingOptions` field to `ConsumerOptions`, when it's nil, use 100ms as the max time and 1000 as the max size. - Add an `ackGroupingTracker` interface to support grouping ACK requests - When `AckWithResponse` is false, adding the `MessageID` instance to the tracker instead of sending the requests to `eventsCh`. ### Verifying this change - [ ] Make sure that the change passes the CI checks. This change added tests and can be verified as follows: - Added `ack_grouping_tracker_test.go` to verify `ackGroupingTracker` in various cases - The consumer side change can be covered by existing tests because the default `AckGroupingOptions` config is `{ MaxSize: 1000, MaxTime: 100*time.Millisecond }`.

pulsar/ack_grouping_tracker.go

pulsar/consumer_partition_test.go

panszobe · 2023-02-22T08:36:25Z

Does Community has some Performance Test Data?

I used the latest version(v0.9.1-0.20230222003822-7d257b01b121) to test today, still found that which Test Case by EnableBatchIndexAcknowledgment=true consumed messages much more slower than EnableBatchIndexAcknowledgment=false, and could not catch up with the producer‘s production.
As consuming from the same topic, consumers of JAVA Client by EnableBatchIndexAcknowledgment=true consumed messages smoothly, and backlog was not enlarged.

I think that Support grouping ACK feature is OK, but maybe EnableBatchIndexAcknowledgment feature have much room for improvement.

shibd · 2023-02-22T08:43:02Z

Does Community has some Performance Test Data?

I used the latest version(v0.9.1-0.20230222003822-7d257b01b121) to test today, still found that which Test Case by EnableBatchIndexAcknowledgment=true consumed messages much more slower than EnableBatchIndexAcknowledgment=false, and could not catch up with the producer‘s production. As consuming from the same topic, consumers of JAVA Client by EnableBatchIndexAcknowledgment=true consumed messages smoothly, and backlog was not enlarged.

I think that Support grouping ACK feature is OK, but maybe EnableBatchIndexAcknowledgment feature have much room for improvement.

Maybe we should set ackIndividualCh and ackCumulativeCh size > 0. /cc @BewareMyPower

BewareMyPower · 2023-02-22T10:10:54Z

Maybe we should set ackIndividualCh and ackCumulativeCh size > 0

@shibd What's the benefit of setting this channel size greater than 0? I think we don't need to cache these MessageIDs again in the channel. They are already cached in cachedAcks.

Does Community has some Performance Test Data?

Currently not. Could you provide more information to diagnosis? For example, the prometheues metrics ProcessingTime records the processing time between a message is received and acknowledged. You can also adjust the AckGroupingOption to see if there are some improvements, e.g. increasing the MaxTime or MaxSize. A flamegraph pprof is also good for analysis. @panszobe

shibd · 2023-02-22T10:36:48Z

@shibd What's the benefit of setting this channel size greater than 0? I think we don't need to cache these MessageIDs again in the channel. They are already cached in cachedAcks.

Just guessing. For example, if the channel has buffers while the thread is flushing data, then the add operation will not be blocked.

panszobe · 2023-02-22T11:59:47Z

Does Community has some Performance Test Data?

Currently not. Could you provide more information to diagnosis? For example, the prometheues metrics ProcessingTime records the processing time between a message is received and acknowledged. You can also adjust the AckGroupingOption to see if there are some improvements, e.g. increasing the MaxTime or MaxSize. A flamegraph pprof is also good for analysis. @panszobe

There are some prometheues metrics in Grafana Graph:

Version label stands for
enable_batch_master: the latest version and EnableBatchIndexAcknowledgment set true
disable_batch_master: the latest version and EnableBatchIndexAcknowledgment set false
disable_batch_v0.6.0: version 0.6.0 and EnableBatchIndexAcknowledgment not provided

As the metrics shows, receiving and acking messages for enable_batch_master is slower than other two versions, processing time metrics looks like it does't have a very large difference.

There are three flamegraph screenshot for enable_batch_master:

For disable_batch_master*:

As above, compare to disable_batch_master, runEventsLoop and internalSendRequest takes much more time.

Next I will try to adjust AckGroupingOption to see if there are some improvements.
Please have a look, if you can diagnose what's wrong from something provided above is better.
@BewareMyPower @shibd

BewareMyPower · 2023-02-22T14:12:29Z

@panszobe From the flamegraph, I think the root case is the inefficiency process of sending ACKs. The most time of internalAck method is internal.(*rpcClient).RequestOnCnxNoWait. In this method, SendRequestNoWait takes 3/4 of the total time, while the internal.baseCommand function also takes about 1/4 of the total time.

While I'm looking forward to your test results of other AckGroupingOption configs, I'm going to test the performance locally. Could you give the information of your test data? Including:

The batch configs of the producer
The message size
The ACK type (individual or cumulative?)

BTW, the flush method of the ACK grouping tracker only does the following things:

Call partitionConsumer#sendIndividualAck or partitionConsumer#sendCumulativeAck, which just send the ACKs to the channel eventsCh
Remove some elements from the map.

I didn't see the flushIndividualAcks method from the flamegraph so it looks like it's not the bottle neck. /cc @shibd

panszobe · 2023-02-23T02:48:59Z

@panszobe From the flamegraph, I think the root case is the inefficiency process of sending ACKs. The most time of internalAck method is internal.(*rpcClient).RequestOnCnxNoWait. In this method, SendRequestNoWait takes 3/4 of the total time, while the internal.baseCommand function also takes about 1/4 of the total time.

While I'm looking forward to your test results of other AckGroupingOption configs, I'm going to test the performance locally. Could you give the information of your test data? Including:

The batch configs of the producer

The message size

The ACK type (individual or cumulative?)

The situations of producer: about 20MB/s, 200k messages/s, batch = 10. Through JAVA pressure test procedure.
ACK type is individual, we use AckID method to commit. SubscriptionType is Shared.

I adjusted the AckGroupingOption configs at 09:49, increasing MaxSize to 1000 and MaxTime to 1000, metrics monitor as below:

It seems that the adjustment does not have any improvement.

Looking forward to your local test. @BewareMyPower

BewareMyPower · 2023-02-23T03:03:58Z

increasing MaxSize to 1000 and MaxTime to 1000,

The default MaxSize is 1000, which means every time 1000 messages are acknowledged, the ACK request will be sent. Increasing MaxTime to 1000 won't help because the receiving speed is far more than 1000 messages per second.

BTW, I just remembered an important thing that this PR does not support the list ACK. See

pulsar-client-go/pulsar/ack_grouping_tracker.go

Line 72 in 7d257b0

// TODO: support ack a list of MessageIDs

. i.e. each time 1000 message IDs are flushed, 1000 ACK requests will be sent. So it does not make a difference with no grouping. The total number of the ACK requests does not change.

I'm going to implement the list ACK soon and see if it works.

panszobe · 2023-02-23T03:37:32Z

BTW, I just remembered an important thing that this PR does not support the list ACK. See

pulsar-client-go/pulsar/ack_grouping_tracker.go

Line 72 in 7d257b0

// TODO: support ack a list of MessageIDs

Emmm, I made a mistake while adjusting the options. As your diagnosis, I adjusted the options by increasing the two settings to 10000, which still didn't make a difference.
Maybe the break point is your new PR for supporting the list ACK.

BewareMyPower · 2023-02-23T05:26:03Z

Okay. When I open a new PR, I will also perform a simple test in my local env to see if it works.

panszobe · 2023-03-02T10:38:37Z

@BewareMyPower hi~
Seeing the new PR, I used latest verision(v0.9.1-0.20230301160414-42ded0d59c46) to test today, but it seems that it has no improvements.

According to the flamegraphs as above, SendRequest and MessageReceive took much time.

As above, consuming rates and message receving rates also show the test result.

The subscription is very simple, just receiving and acking the Message, has no business processing logic, as below:

But it still could not catch up with the producer's production while the other one which disables BatchIndexAcknowledgment has no large backlog and keeps stable.

Do you have any other idea? Or there is any optimizations doing now?

BewareMyPower · 2023-03-02T13:18:10Z

@panszobe I'm working on this issue currently. But since I'm busy on something else this week, the PR was not opened. If you mean #968, then it's not related to this issue. It's just a problem I found when I wrote the code.

panszobe · 2023-03-02T14:48:23Z

@panszobe I'm working on this issue currently. But since I'm busy on something else this week, the PR was not opened. If you mean #968, then it's not related to this issue. It's just a problem I found when I wrote the code.

OK. Looking forward to the new PR. Thanks~

### Motivation apache#957 adds the support for grouping MessageID instances to acknowledge. However, when flushing N cached MessageID instances, N CommandAck requests will be sent. It downgrades the performance. ### Modifications When more than one MessageID instances are acknowledged, use a single CommandAck request to carry all of them.

### Motivation apache#957 adds the support for grouping MessageID instances to acknowledge. However, when flushing N cached MessageID instances, N CommandAck requests will be sent. It downgrades the performance. ### Modifications - When more than one MessageID instances are acknowledged, use a single CommandAck request to carry all of them. - Reduce number of the stored MessageIDs of the ACK grouping tracker - Support configure AckGroupingOptions - Add test to cover new logic of the ACK grouping tracker

BewareMyPower requested review from wolfstudy, RobertIndie, nodece, cckellogg and shibd February 16, 2023 03:41

BewareMyPower self-assigned this Feb 16, 2023

BewareMyPower added the type/feature label Feb 16, 2023

BewareMyPower added this to the v0.10.0 milestone Feb 16, 2023

BewareMyPower marked this pull request as draft February 16, 2023 08:03

BewareMyPower force-pushed the bewaremypower/ack-tracker branch from b25d861 to 6800045 Compare February 16, 2023 09:01

BewareMyPower marked this pull request as ready for review February 16, 2023 09:01

Fix flushAndClean race

7f034b0

shibd reviewed Feb 20, 2023

View reviewed changes

pulsar/ack_grouping_tracker.go Outdated Show resolved Hide resolved

pulsar/ack_grouping_tracker.go Show resolved Hide resolved

Use unbuffered channel for flush operations

e6cb8d6

shibd reviewed Feb 21, 2023

View reviewed changes

pulsar/consumer_partition_test.go Show resolved Hide resolved

Apply different AckGroupingOptions and expose this config

f34f65a

shibd approved these changes Feb 21, 2023

View reviewed changes

shibd merged commit 7d257b0 into apache:master Feb 22, 2023

BewareMyPower deleted the bewaremypower/ack-tracker branch February 24, 2023 11:11

shibd mentioned this pull request Mar 1, 2023

[feat] Support memory limit for producer. #955

Merged

BewareMyPower mentioned this pull request Mar 6, 2023

Consume Performance drops when set EnableBatchIndexAcknowledgment = true #949

Closed

shibd mentioned this pull request Mar 8, 2023

[Fix] Message is blocked on the AckGroupingTracker.isDuplicate method. #986

Merged

geniusjoe mentioned this pull request Mar 4, 2024

support Cumulative ack #183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support grouping ACK requests by time and size #957

Support grouping ACK requests by time and size #957

BewareMyPower commented Feb 16, 2023

panszobe commented Feb 22, 2023

shibd commented Feb 22, 2023 •

edited

Loading

BewareMyPower commented Feb 22, 2023 •

edited

Loading

shibd commented Feb 22, 2023

panszobe commented Feb 22, 2023

BewareMyPower commented Feb 22, 2023

panszobe commented Feb 23, 2023

BewareMyPower commented Feb 23, 2023 •

edited

Loading

panszobe commented Feb 23, 2023

BewareMyPower commented Feb 23, 2023

panszobe commented Mar 2, 2023

BewareMyPower commented Mar 2, 2023

panszobe commented Mar 2, 2023

Support grouping ACK requests by time and size #957

Support grouping ACK requests by time and size #957

Conversation

BewareMyPower commented Feb 16, 2023

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

panszobe commented Feb 22, 2023

shibd commented Feb 22, 2023 • edited Loading

BewareMyPower commented Feb 22, 2023 • edited Loading

shibd commented Feb 22, 2023

panszobe commented Feb 22, 2023

BewareMyPower commented Feb 22, 2023

panszobe commented Feb 23, 2023

BewareMyPower commented Feb 23, 2023 • edited Loading

panszobe commented Feb 23, 2023

BewareMyPower commented Feb 23, 2023

panszobe commented Mar 2, 2023

BewareMyPower commented Mar 2, 2023

panszobe commented Mar 2, 2023

shibd commented Feb 22, 2023 •

edited

Loading

BewareMyPower commented Feb 22, 2023 •

edited

Loading

BewareMyPower commented Feb 23, 2023 •

edited

Loading