Skip to content

[CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation#2932

Closed
SteNicholas wants to merge 3 commits into
apache:mainfrom
SteNicholas:CELEBORN-1700
Closed

[CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation#2932
SteNicholas wants to merge 3 commits into
apache:mainfrom
SteNicholas:CELEBORN-1700

Conversation

@SteNicholas
Copy link
Copy Markdown
Member

@SteNicholas SteNicholas commented Nov 20, 2024

What changes were proposed in this pull request?

Flink supports fallback to vanilla Flink built-in shuffle implementation.

Why are the changes needed?

When quota is unenough or workers are unavailable, RemoteShuffleMaster does not support fallback to NettyShuffleMaster, and RemoteShuffleEnvironment does not support fallback to NettyShuffleEnvironment at present. Flink should support fallback to vanilla Flink built-in shuffle implementation for unenough quota and unavailable workers.

Flink Shuffle Fallback

Does this PR introduce any user-facing change?

  • Introduce ShuffleFallbackPolicy interface to determine whether fallback to vanilla Flink built-in shuffle implementation.
/**
 * The shuffle fallback policy determines whether fallback to vanilla Flink built-in shuffle
 * implementation.
 */
public interface ShuffleFallbackPolicy {

  /**
   * Returns whether fallback to vanilla flink built-in shuffle implementation.
   *
   * @param shuffleContext The job shuffle context of Flink.
   * @param celebornConf The configuration of Celeborn.
   * @param lifecycleManager The {@link LifecycleManager} of Celeborn.
   * @return Whether fallback to vanilla flink built-in shuffle implementation.
   */
  boolean needFallback(
      JobShuffleContext shuffleContext,
      CelebornConf celebornConf,
      LifecycleManager lifecycleManager);
}
  • Introduce celeborn.client.flink.shuffle.fallback.policy config to support shuffle fallback policy configuration.

How was this patch tested?

  • RemoteShuffleMasterSuiteJ#testRegisterJobWithForceFallbackPolicy
  • WordCountTestBase#celeborn flink integration test with fallback - word count

@SteNicholas SteNicholas marked this pull request as draft November 20, 2024 15:12
@SteNicholas SteNicholas marked this pull request as ready for review November 20, 2024 16:04
@SteNicholas
Copy link
Copy Markdown
Member Author

SteNicholas commented Nov 20, 2024

Ping @reswqa, @codenohup, @RexXiong.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
@SteNicholas SteNicholas force-pushed the CELEBORN-1700 branch 2 times, most recently from 1f57391 to dfc93da Compare November 22, 2024 03:28
Copy link
Copy Markdown
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, LGTM.

Copy link
Copy Markdown
Contributor

@codenohup codenohup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your contribution!

@SteNicholas
Copy link
Copy Markdown
Member Author

Ping @RexXiong, @FMX.

Copy link
Copy Markdown
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except a minor

@RexXiong RexXiong closed this in 9cd6d96 Nov 27, 2024
@RexXiong
Copy link
Copy Markdown
Contributor

Thanks, merge to main(v0.6.0)

FMX pushed a commit that referenced this pull request Dec 20, 2024
…lback to vanilla Flink built-in shuffle implementation

### What changes were proposed in this pull request?

 Support `ShuffleFallbackCount` metric for fallback to vanilla Flink built-in shuffle implementation.

### Why are the changes needed?

#2932 has already supported fallback to vanilla Flink built-in shuffle implementation, which is lack of `ShuffleFallbackCount` metric to feedback the situation of fallback.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RemoteShuffleMasterSuiteJ#testRegisterPartitionWithProducerForForceFallbackPolicy`

Closes #3012 from SteNicholas/CELEBORN-1700.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants