Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1313] Custom Network Location Aware Replication #2367

Closed
wants to merge 5 commits into from

Conversation

akpatnam25
Copy link
Contributor

@akpatnam25 akpatnam25 commented Mar 7, 2024

What changes were proposed in this pull request?

Enable custom network location aware replication, based on a custom impl of DNSToSwitchMapping.

Why are the changes needed?

Resolution of network location of multiple workers at master can be expensive at times. This way, each worker resolves its own network location and sends to master via the RegisterWorker transport message. If worker cannot resolve, fallback to attempting to resolve at master (during update meta or reload of snapshot). Proposal: Celeborn Custom Network Location Aware Replication

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated the unit tests.

@akpatnam25
Copy link
Contributor Author

cc @waitinfuture @otterc @mridulm can you help review? Thanks.

Copy link

codecov bot commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 48.83%. Comparing base (0f60dce) to head (2d070a7).
Report is 16 commits behind head on main.

Files Patch % Lines
...born/common/protocol/message/ControlMessages.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2367      +/-   ##
==========================================
- Coverage   48.85%   48.83%   -0.01%     
==========================================
  Files         208      209       +1     
  Lines       12984    13089     +105     
  Branches     1115     1133      +18     
==========================================
+ Hits         6342     6391      +49     
- Misses       6232     6278      +46     
- Partials      410      420      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify use of resolveToMap in AbstractMetaManager.restoreMetaFromFile as well to query only those which have an unresolved rack ?

@akpatnam25
Copy link
Contributor Author

akpatnam25 commented Mar 11, 2024

cc @RexXiong @FMX @pan3793 if you could also help review. Thanks

@akpatnam25
Copy link
Contributor Author

cc @SteNicholas

@pan3793
Copy link
Member

pan3793 commented Mar 12, 2024

cc @zwangsheng

@waitinfuture
Copy link
Contributor

cc @AngersZhuuuu

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. After some checks, this PR is correct and won't block clusters for rolling upgrades.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thanks!

Copy link
Contributor

@zwangsheng zwangsheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for design, with some confuse.

Copy link
Contributor

@AngersZhuuuu AngersZhuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor comment

@akpatnam25
Copy link
Contributor Author

addressed the review comments cc @waitinfuture @AngersZhuuuu @FMX @zwangsheng

@waitinfuture
Copy link
Contributor

Thanks, merging to main(v0.5.0)

FMX pushed a commit that referenced this pull request Aug 14, 2024
### What changes were proposed in this pull request?
Fixing a bug where the `networkLocation` is not persisted in Ratis, and the master defaults to `DEFAULT_RACK` when it loads the snapshot. This was missed in #2367 unfortunately, and it came up during our stress testing internally.

### Why are the changes needed?
Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated unit test to ensure serde is correct.

Closes #2669 from akpatnam25/CELEBORN-1549.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
gotikkoxq added a commit to gotikkoxq/celeborn that referenced this pull request Aug 26, 2024
### What changes were proposed in this pull request?
Fixing a bug where the `networkLocation` is not persisted in Ratis, and the master defaults to `DEFAULT_RACK` when it loads the snapshot. This was missed in apache/celeborn#2367 unfortunately, and it came up during our stress testing internally.

### Why are the changes needed?
Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated unit test to ensure serde is correct.

Closes #2669 from akpatnam25/CELEBORN-1549.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants