Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][client] Fix messages in the batch container timed out unexpectedly #21889

Conversation

BewareMyPower
Copy link
Contributor

Fixes #21884

Motivation

When ProducerImpl#run is called where pendingMessages is empty and batchMessageContainer is not empty, the whole batch's timestamp is treated as lastBatchSendNanoTime + batch latency.

Given send timeout as 5 seconds, assuming a batch was flushed at t1 and the next message was sent after 5 seconds:

  • t1: ProducerImpl#batchFlushTask(), lastBatchSendNanoTime is now t1
  • t1 + 0.0011 s: ProducerImpl#run(), no pending message and the batch container is empty
  • t1 + 5.0010 s: sendAsync(), the message is added to batch
  • t1 + 5.0011 s: ProducerImpl#run() is called before batchFlushTask()

createdAt = lastBatchSendNanoTime
+ TimeUnit.MICROSECONDS.toNanos(conf.getBatchingMaxPublishDelayMicros());

createdAt will be t1 + 0.001 s. However, the current time point is t1 + 5.0011 s, the interval is 5.001 s, timeout happened.

Modification

Record the timestamp when the 1st message is added to the batch container and use this timestamp instead of lastBatchSendNanoTime to compute the createdAt.

Add testSendTimerCheckForBatchContainer to cover this case.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Fixes apache#21884

### Motivation

When `ProducerImpl#run` is called where `pendingMessages` is empty and
`batchMessageContainer` is not empty, the whole batch's timestamp is
treated as `lastBatchSendNanoTime + batch latency`.

Given send timeout as 5 seconds, assuming a batch was flushed at `t1`
and the next message was sent after 5 seconds:
- t1: `ProducerImpl#batchFlushTask()`, `lastBatchSendNanoTime` is now `t1`
- t1 + 0.0011 s: `ProducerImpl#run()`, no pending message and the batch container is empty
- t1 + 5.0010 s: `sendAsync()`, the message is added to batch
- t1 + 5.0011 s: `ProducerImpl#run()` is called before `batchFlushTask()`

https://github.com/apache/pulsar/blob/176bdeacd309e8c1e49358987a1946abd30ba34a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L2014-L2015

`createdAt` will be `t1 + 0.001 s`. However, the current time point is
`t1 + 5.0011 s`, the interval is 5.001 s, timeout happened.

### Modification

Record the timestamp when the 1st message is added to the batch
container and use this timestamp instead of `lastBatchSendNanoTime` to
compute the `createdAt`.

Add `testSendTimerCheckForBatchContainer` to cover this case.
@BewareMyPower BewareMyPower self-assigned this Jan 12, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 12, 2024
@nodece nodece self-requested a review January 15, 2024 04:11
@Technoboy- Technoboy- added this to the 3.3.0 milestone Jan 15, 2024
@Technoboy- Technoboy- closed this Jan 15, 2024
@Technoboy- Technoboy- reopened this Jan 15, 2024
Copy link
Contributor

@gaoran10 gaoran10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left a trivial suggestion.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Do you know how to reproduce this issue? In our environment, we have the same issue, but I cannot reproduce this issue.

@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Jan 16, 2024

@nodece It's not hard to reproduce. The open sourced KoP can also reproduce this issue.

  1. Run a KoP standalone (v3.1.1.1 with Pulsar 3.1.0)
  2. Run a Kafka consumer via CLI (./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --group my-group)
  3. Just wait. If it is not reproduced after some time, try unloading the public/__kafka namespace and wait. (./bin/pulsar-admin namespaces unload public/__kafka)

I reproduced again just now.

2024-01-16T11:33:12,414+0800 [pulsar-web-49-19] INFO  org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Unloading namespace public/__kafka
[2024-01-16 11:33:50,344] WARN [Consumer clientId=console-consumer, groupId=my-group] Offset commit failed on partition my-topic-0 at offset 0: The request timed out. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

After about 40 seconds since unloading, the timeout error occurred.

@codecov-commenter
Copy link

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (176bdea) 73.58% compared to head (4c868c3) 73.55%.
Report is 6 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21889      +/-   ##
============================================
- Coverage     73.58%   73.55%   -0.03%     
- Complexity    32325    32347      +22     
============================================
  Files          1859     1859              
  Lines        138263   138373     +110     
  Branches      15153    15160       +7     
============================================
+ Hits         101736   101779      +43     
- Misses        28644    28702      +58     
- Partials       7883     7892       +9     
Flag Coverage Δ
inttests 24.18% <50.00%> (+0.08%) ⬆️
systests 23.68% <50.00%> (-0.10%) ⬇️
unittests 72.83% <93.75%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...sar/client/impl/AbstractBatchMessageContainer.java 79.54% <100.00%> (+3.86%) ⬆️
.../pulsar/client/impl/BatchMessageContainerImpl.java 80.89% <100.00%> (+0.21%) ⬆️
...sar/client/impl/BatchMessageKeyBasedContainer.java 65.30% <100.00%> (+1.47%) ⬆️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 83.01% <80.00%> (-0.29%) ⬇️

... and 64 files with indirect coverage changes

@BewareMyPower BewareMyPower merged commit b56e40c into apache:master Jan 16, 2024
47 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-send-timer-created-at branch January 16, 2024 06:38
@codelipenghui codelipenghui added the category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost label Jan 17, 2024
@Technoboy- Technoboy- modified the milestones: 3.3.0, 3.2.0 Jan 19, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 1, 2024
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.0.3 release/3.1.3 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Send timeout happened immediately when batching is enabled
9 participants