Skip to content

[CELEBORN-1580] ReadBufferDispacther should notify exception to listener#2707

Closed
codenohup wants to merge 3 commits intoapache:mainfrom
codenohup:fix-oom
Closed

[CELEBORN-1580] ReadBufferDispacther should notify exception to listener#2707
codenohup wants to merge 3 commits intoapache:mainfrom
codenohup:fix-oom

Conversation

@codenohup
Copy link
Contributor

What changes were proposed in this pull request?

When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

Why are the changes needed?

If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add an unit test case.

@codenohup codenohup marked this pull request as ready for review August 27, 2024 07:53
@RexXiong
Copy link
Contributor

The ReadBufferDispatcher may encounter an exception when a Netty OutOfDirectMemoryError occurs. In this case, we should allow the map partition reader to retry; otherwise, the Flink Task Manager could hang. cc @SteNicholas @mridulm

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

SteNicholas pushed a commit that referenced this pull request Aug 28, 2024
### What changes were proposed in this pull request?
When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

### Why are the changes needed?
If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add an unit test case.

Closes #2707 from codenohup/fix-oom.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
(cherry picked from commit c233929)
Signed-off-by: SteNicholas <programgeek@163.com>
@SteNicholas
Copy link
Member

@codenohup, thanks for contribution. Merged to main(v0.6.0) and branch-0.5(v0.5.2).

zaynt4606 pushed a commit to zaynt4606/celeborn that referenced this pull request Aug 29, 2024
### What changes were proposed in this pull request?
When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

### Why are the changes needed?
If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add an unit test case.

Closes apache#2707 from codenohup/fix-oom.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
s0nskar pushed a commit to s0nskar/celeborn that referenced this pull request Sep 16, 2024
### What changes were proposed in this pull request?
When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

### Why are the changes needed?
If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add an unit test case.

Closes apache#2707 from codenohup/fix-oom.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
wankunde pushed a commit to wankunde/celeborn that referenced this pull request Oct 11, 2024
### What changes were proposed in this pull request?
When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

### Why are the changes needed?
If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add an unit test case.

Closes apache#2707 from codenohup/fix-oom.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants