-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CELEBORN-1313] Custom Network Location Aware Replication #2367
Conversation
cc @waitinfuture @otterc @mridulm can you help review? Thanks. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2367 +/- ##
==========================================
- Coverage 48.85% 48.83% -0.01%
==========================================
Files 208 209 +1
Lines 12984 13089 +105
Branches 1115 1133 +18
==========================================
+ Hits 6342 6391 +49
- Misses 6232 6278 +46
- Partials 410 420 +10 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify use of resolveToMap
in AbstractMetaManager.restoreMetaFromFile
as well to query only those which have an unresolved rack ?
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Show resolved
Hide resolved
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Outdated
Show resolved
Hide resolved
cc @SteNicholas |
cc @zwangsheng |
common/src/test/scala/org/apache/celeborn/common/network/CelebornRackResolverSuite.scala
Show resolved
Hide resolved
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Show resolved
Hide resolved
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. After some checks, this PR is correct and won't block clusters for rolling upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, thanks!
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for design, with some confuse.
...src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one minor comment
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Worker.scala
Show resolved
Hide resolved
addressed the review comments cc @waitinfuture @AngersZhuuuu @FMX @zwangsheng |
Thanks, merging to main(v0.5.0) |
### What changes were proposed in this pull request? Fixing a bug where the `networkLocation` is not persisted in Ratis, and the master defaults to `DEFAULT_RACK` when it loads the snapshot. This was missed in #2367 unfortunately, and it came up during our stress testing internally. ### Why are the changes needed? Needed for custom network aware replication, so that networkLocation state is kept in snapshot file. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated unit test to ensure serde is correct. Closes #2669 from akpatnam25/CELEBORN-1549. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request? Fixing a bug where the `networkLocation` is not persisted in Ratis, and the master defaults to `DEFAULT_RACK` when it loads the snapshot. This was missed in apache/celeborn#2367 unfortunately, and it came up during our stress testing internally. ### Why are the changes needed? Needed for custom network aware replication, so that networkLocation state is kept in snapshot file. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated unit test to ensure serde is correct. Closes #2669 from akpatnam25/CELEBORN-1549. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
What changes were proposed in this pull request?
Enable custom network location aware replication, based on a custom impl of
DNSToSwitchMapping
.Why are the changes needed?
Resolution of network location of multiple workers at master can be expensive at times. This way, each worker resolves its own network location and sends to master via the RegisterWorker transport message. If worker cannot resolve, fallback to attempting to resolve at master (during update meta or reload of snapshot). Proposal: Celeborn Custom Network Location Aware Replication
Does this PR introduce any user-facing change?
No
How was this patch tested?
Updated the unit tests.