Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java Client] Fixed the producer OOM if got exception while add message to batch container #12170

Merged
merged 2 commits into from
Oct 20, 2021

Conversation

shoothzj
Copy link
Member

@shoothzj shoothzj commented Sep 24, 2021

Motivation

Currently, if BatchMessageContainerImpl's add method throw an Exception while adding the first messages.
Then, the message list is not empty, while the batchedMessageMetadataAndPayload is null.
The producer can't recovery unless we reboot.

Modifications

Support to recovery after add the first message throws exception

image

Documentation

bug fix, doesn't need doc

@shoothzj
Copy link
Member Author

@codelipenghui @BewareMyPower @eolivelli @merlimat Please take a first look if that change is right. I will add test cases if that's the right way to fix it.
Thanks

@Anonymitaet Anonymitaet added the doc-not-needed Your PR changes do not impact docs label Sep 24, 2021
@BewareMyPower BewareMyPower added area/client type/bug The PR fixed a bug or issue reported a bug labels Sep 24, 2021
@BewareMyPower
Copy link
Contributor

Could you explain why clear() could fix the memory exhausted problem?

Generally, for OOM issues, I suggest a fail fast solution. Otherwise, there might be a memory leak somewhere.

@shoothzj
Copy link
Member Author

Could you explain why clear() could fix the memory exhausted problem?

Generally, for OOM issues, I suggest a fail fast solution. Otherwise, there might be a memory leak somewhere.

Agree. The return of false can achieve the goal of fast fail.
Why clear is because numMessagesInBatch has been modified to 1, which make BatchMessageContainerImpl never allocate buffer again. We need a clear() or numMessagesInBatch=0

@BewareMyPower
Copy link
Contributor

But here it swallows the exception and returns false. What would happen if new messages arrived?

@shoothzj
Copy link
Member Author

But here it swallows the exception and returns false. What would happen if new messages arrived?

new messages will start a new journey, it can be add success if memory can be allocated now.

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have details about the exception?

@shoothzj
Copy link
Member Author

Do you have details about the exception?

Yes, OutOfDirectMemoryError

@codelipenghui
Copy link
Contributor

Currently, if BatchMessageContainerImpl's add method throw an Exception while adding the first messages.

I mean you have mentioned will get an exception from the add method, do you see the exception details? which will help us to better understand the problem.

@shoothzj shoothzj marked this pull request as draft September 26, 2021 16:07
@eolivelli eolivelli modified the milestones: 2.9.0, 2.10.0 Oct 6, 2021
@shoothzj shoothzj marked this pull request as ready for review October 8, 2021 08:49
@shoothzj
Copy link
Member Author

shoothzj commented Oct 8, 2021

Currently, if BatchMessageContainerImpl's add method throw an Exception while adding the first messages.

I mean you have mentioned will get an exception from the add method, do you see the exception details? which will help us to better understand the problem.

OutOfDirectMemoryError. As same in my unit test, if alloc error, the producer can not normal work again. I am making it can self-healing.

@shoothzj
Copy link
Member Author

shoothzj commented Oct 8, 2021

@codelipenghui @merlimat PTAL

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eolivelli eolivelli merged commit ee62763 into apache:master Oct 20, 2021
@shoothzj shoothzj deleted the batch-msg-oom-recover branch October 20, 2021 08:00
zeo1995 pushed a commit to zeo1995/pulsar that referenced this pull request Oct 20, 2021
* up/master:
  [pulsar-java-client] Auto-recovery after exception like out of direct memory (apache#12170)
  Allow to config pulsar client allocator out of memory policy (apache#12200)
  [Transaction] Fix bugs, Exception thrower by TB::appendBufferToTxn must be ManagedLedgerException.  (apache#12376)
  Bumped version to 2.10.0-SNAPSHOT (apache#12285)
  [docs][Website] add docs of broker entry metadata (apache#12404)
  [C++] Use weak ref to ClientConnection for timeout task (apache#12409)
  fix windows test path probleam (apache#12398)
  [website][upgrade]feat: home page (apache#12383)
  [docs] [ISSUE 11526] Update the description of `topic` (apache#12375)
  [Docs] Add document label check robot (apache#12371)
  [Admin] Get schema validation enforce add applied. (apache#12349)
  [Doc] add scope explanations (apache#12380)
  Fix java doc for MultipleListenerValidator (apache#12389)
zeo1995 pushed a commit to zeo1995/pulsar that referenced this pull request Oct 20, 2021
….2-chapter-4

* website/branch-2.7.2-chapter-3:
  Update the nesting of forms in the document
  [pulsar-java-client] Auto-recovery after exception like out of direct memory (apache#12170)
  Allow to config pulsar client allocator out of memory policy (apache#12200)
  [Transaction] Fix bugs, Exception thrower by TB::appendBufferToTxn must be ManagedLedgerException.  (apache#12376)
  Bumped version to 2.10.0-SNAPSHOT (apache#12285)
  [docs][Website] add docs of broker entry metadata (apache#12404)
  [C++] Use weak ref to ClientConnection for timeout task (apache#12409)
  fix windows test path probleam (apache#12398)
  [website][upgrade]feat: home page (apache#12383)
  [docs] [ISSUE 11526] Update the description of `topic` (apache#12375)
  [Docs] Add document label check robot (apache#12371)
  [Admin] Get schema validation enforce add applied. (apache#12349)
  [Doc] add scope explanations (apache#12380)
  Fix java doc for MultipleListenerValidator (apache#12389)
codelipenghui pushed a commit that referenced this pull request Oct 21, 2021
@codelipenghui codelipenghui added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Oct 21, 2021
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Oct 27, 2021
… memory (apache#12170)

(cherry picked from commit ee62763)
(cherry picked from commit 10301ee)
eolivelli pushed a commit to eolivelli/pulsar that referenced this pull request Nov 29, 2021
@codelipenghui codelipenghui changed the title [pulsar-java-client] Auto-recovery after exception like out of direct memory [Java Client] Fixed the producer OOM if got exception while add message to batch container Feb 21, 2022
@BewareMyPower
Copy link
Contributor

BewareMyPower commented May 24, 2022

Add the release/2.9.3 label since this PR is depended by #15413

BewareMyPower pushed a commit that referenced this pull request May 24, 2022
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label May 24, 2022
shoothzj pushed a commit that referenced this pull request Jun 3, 2022
Fixes #15689
Fixes #15793
Fixes #15897

### Motivation

`PulsarByteBufAllocator.DEFAULT` instance was replaced in BatchMessageContainerImplTest, but never restored. This caused strange issues in many unit tests. OutOfMemoryError and NoClassDefFoundErrors  were most likely caused by the BatchMessageContainerImplTest. 

### Modifications

Fix the problems in BatchMessageContainerImplTest and reset the state after the test completes.


### Additional context

The flaky test was introduced by #12170 which has been cherry-picked to branch-2.8, branch-2.9 and included also in branch-2.10 .
lhotari added a commit that referenced this pull request Jun 6, 2022
Fixes #15689
Fixes #15793
Fixes #15897

### Motivation

`PulsarByteBufAllocator.DEFAULT` instance was replaced in BatchMessageContainerImplTest, but never restored. This caused strange issues in many unit tests. OutOfMemoryError and NoClassDefFoundErrors  were most likely caused by the BatchMessageContainerImplTest.

### Modifications

Fix the problems in BatchMessageContainerImplTest and reset the state after the test completes.

### Additional context

The flaky test was introduced by #12170 which has been cherry-picked to branch-2.8, branch-2.9 and included also in branch-2.10 .

(cherry picked from commit ef1e0aa)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jun 6, 2022
…he#15911)

Fixes apache#15689
Fixes apache#15793
Fixes apache#15897

### Motivation

`PulsarByteBufAllocator.DEFAULT` instance was replaced in BatchMessageContainerImplTest, but never restored. This caused strange issues in many unit tests. OutOfMemoryError and NoClassDefFoundErrors  were most likely caused by the BatchMessageContainerImplTest.

### Modifications

Fix the problems in BatchMessageContainerImplTest and reset the state after the test completes.

### Additional context

The flaky test was introduced by apache#12170 which has been cherry-picked to branch-2.8, branch-2.9 and included also in branch-2.10 .

(cherry picked from commit ef1e0aa)
(cherry picked from commit 3f3d452)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.2 release/2.9.3 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants