Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled #28911

Closed
wants to merge 48 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jun 23, 2020

What changes were proposed in this pull request?

This PR adds support to read host-local shuffle data from disk directly when external shuffle service is disabled.

Similar to #25299, we first try to get local disk directories for the shuffle data, which is located at the same host with the current executor. The only difference is, in #25299, it gets the directories from the external shuffle service while in this PR, it gets the directory from the executors.

To implement the feature, this PR extends the HostLocalDirManager for both ExternalBlockStoreClient and NettyBlockTransferService. Also, this PR adds getHostLocalDirs for NettyBlockTransferService as ExternalBlockStoreClient does, in order to send the get-dir-request to the corresponding executor. And this PR resued the request messageGetLocalDirsForExecutors for simple.

Why are the changes needed?

After SPARK-27651 / #25299, Spark can read host-local shuffle data directly from disk when external shuffle service is enabled. To extend the future, we can also support it when the external shuffle service is disabled.

Does this PR introduce any user-facing change?

Yes. Before this PR, to use the host-local shuffle reading feature, users should not only enable spark.shuffle.readHostLocalDisk but also spark.shuffle.service.enabled. After this PR, enable spark.shuffle.readHostLocalDisk should be enough, and external shuffle service is no longer a pre-requirement.

How was this patch tested?

Added test and tested manually.

@Ngone51
Copy link
Member Author

Ngone51 commented Jun 23, 2020

@attilapiros @tgravescs @jiangxb1987 Please take a look, thanks!

@SparkQA
Copy link

SparkQA commented Jun 23, 2020

Test build #124426 has finished for PR 28911 at commit 0d62ccb.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

thanks for working on this, was interested in this as well. Can you update the description to include details on your overall approach - where do you get the directories from, etc?

@Ngone51
Copy link
Member Author

Ngone51 commented Jun 24, 2020

@tgravescs updated the description, thanks!

@Ngone51
Copy link
Member Author

Ngone51 commented Jun 24, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 24, 2020

Test build #124454 has finished for PR 28911 at commit 0d62ccb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124662 has finished for PR 28911 at commit 446780a.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 2, 2020

I've updated the PR. Could you take another look?

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124905 has finished for PR 28911 at commit da14484.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 9, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125437 has finished for PR 28911 at commit da14484.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Jul 9, 2020

@Ngone51 I am still catching up on changes; as part of #25299 or subsequently (or here) are we updating preferred locality for shuffle tasks to account for ability to do node local reads ?
Essentially, all shuffle blocks on a node (irrespective of executor) should be treated with equal locality preference for computing pref locality for shuffle tasks.

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a quick look and had a few comments, overall approach seems fine. I need to take a more in depth review.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 9, 2020

@mridulm we don't and no need to do it. The current implementation of getPreferredLocationsForShuffle already treats blocks on the same node as the same locality preference(see L617):

def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int)
: Seq[String] = {
if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD &&
dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) {
val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId,
dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION)
if (blockManagerIds.nonEmpty) {
blockManagerIds.get.map(_.host)
} else {
Nil
}
} else {
Nil
}
}

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 9, 2020

@tgravescs Thanks for the review. I'll try to address them tomorrow.

@mridulm
Copy link
Contributor

mridulm commented Jul 9, 2020 via email

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 10, 2020

Ah, I get your point and I can imagine how it may affect current locality preference. Let's take an example to see if we're on the same page.

For example, now we have executor1 and executor2 on node1, executor3 and executor4 on node2. And we also have 10 shuffle data bytes on executor1 and executor2 from task1 and task2 separately. Besides, we also have 40 shuffle data bytes on executor3 and executor4 from task3 and task4 separately. (Assuming all the shuffle data are for the same reduce partition.)

With the current implementation of getLocationsWithLargestOutputs, we only count an executor's host as a locality prefer location when [shuffle data for a certain reduce partiton on this executor] / [total shuffle data]) >= fractionThreshold(default 0.2). So, in this case, only node2 is considered as a preferred location because (40 / 10 + 10 + 40 + 40) = 0.4 >= 0.2. But node1 is not because (10 / 10 + 10 + 40 + 40) = 0.1 < 0.2.

However, node1 can also be a preferred location if we aggregate the size of the shuffle data on the same host, since we will have (10 + 10 / 10 + 10 + 40 + 40) = 0.2 >= 0.2.

It looks reasonable to me. cc @attilapiros @tgravescs @jiangxb1987 @holdenk Any ideas?

@mridulm
Copy link
Contributor

mridulm commented Jul 10, 2020

The fix for this need not necessarily come in this PR, but can be a feature addition.
Note that local shuffle reads from across executors on a node will really benefit when locality preference also accounts for it - until then, the potential benefits will be reduced.

The solution is fairly straightforward, given existing implementation of getLocationsWithLargestOutputs - when aggregating, aggregate by host instead of blockmanager id when local reads across executors on a node are possible. This PR, #25299 are candidates when this can be enabled (with suitable flag checks, etc).

@SparkQA
Copy link

SparkQA commented Jul 10, 2020

Test build #125589 has finished for PR 28911 at commit 1126341.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jul 17, 2020

Personally, I'd save locality changes for a follow up PR. Making changes in core is pretty hard, so long as we have a JIRA and it's a good incremental chunk of work keeping it smaller for review (and potential revert if something goes wrong) is better (of course there are situations where that isn't possible, but I think changing locality calculations would be strictly additive.)

Comment on lines 70 to 71
* (when this is a NettyBlockTransferService). Note there's only one executor when this is a
* NettyBlockTransferService because we ask one specific executor at a time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the last sentence here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got this.

When the external shuffle service is target by this request we can collect the local dirs of multiple executors at once (as all the host local dirs are available in the external shuffle service running on the host as it is central component in this sense on that host).

But here we can request the local dirs for only one executor: the one which handles the request itself.

@Ngone51 what about adding an assert here:


Checking the array only contains one executor ID and it is equal with the executorId of the blockManager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the check to ensure it's the only one executor id but didn't check its equality with blockManager's executor id. Because we only have BlockDataManager in NettyBlockRpcServer which does not expose executor id.

I am still wondering whether it's worthwhile to expose it for the sanity check purpose.

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 20, 2020

Thank you for the review. I'll try to address them tomorrow!

@Ngone51 Ngone51 force-pushed the support_node_local_shuffle branch from 1126341 to bcb6012 Compare July 21, 2020 06:49
@SparkQA
Copy link

SparkQA commented Jul 21, 2020

Test build #126227 has finished for PR 28911 at commit bcb6012.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class BlockTransferService extends BlockStoreClient

@Ngone51
Copy link
Member Author

Ngone51 commented Jul 21, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 21, 2020

Test build #126234 has finished for PR 28911 at commit bcb6012.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class BlockTransferService extends BlockStoreClient

@dongjoon-hyun
Copy link
Member

If you want, sure! @holdenk .

@SparkQA
Copy link

SparkQA commented Aug 30, 2020

Test build #128022 has finished for PR 28911 at commit 2aa71f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@attilapiros
Copy link
Contributor

@Ngone51 The PR description mentions disabled dynamic allocation as requirement but this was changed as result of a review finding. Could you please update it?

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 31, 2020

Thank you @dongjoon-hyun for the detailed review. It helps a lot to improve PR.

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 31, 2020

@holdenk Sure, please feel free to add any comments!

@Ngone51
Copy link
Member Author

Ngone51 commented Aug 31, 2020

@attilapiros Updated, thanks for the reminder!

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128093 has finished for PR 28911 at commit 6b97be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HostLocalShuffleReadingSuite extends SparkFunSuite with Matchers with LocalSparkContext

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor points of clarification, but no blocking concerns from me after this review.

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128139 has finished for PR 28911 at commit a23ab17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master for Apache Spark 3.1.0 on December 2020.
Thank you, @Ngone51 and all.

@Ngone51
Copy link
Member Author

Ngone51 commented Sep 3, 2020

Thank you all!!

wangyum pushed a commit that referenced this pull request May 26, 2023
…rnal shuffle service is disabled

This PR adds support to read host-local shuffle data from disk directly when external shuffle service is disabled.

Similar to #25299, we first try to get local disk directories for the shuffle data, which is located at the same host with the current executor. The only difference is, in #25299, it gets the directories from the external shuffle service while in this PR, it gets the directory from the executors.

To implement the feature, this PR extends the `HostLocalDirManager ` for both `ExternalBlockStoreClient` and `NettyBlockTransferService`. Also, this PR adds `getHostLocalDirs` for `NettyBlockTransferService` as `ExternalBlockStoreClient` does, in order to send the get-dir-request to the corresponding executor. And this PR resued the request message`GetLocalDirsForExecutors` for simple.

After SPARK-27651 / #25299, Spark can read host-local shuffle data directly from disk when external shuffle service is enabled. To extend the future, we can also support it when the external shuffle service is disabled.

Yes. Before this PR, to use the host-local shuffle reading feature, users should not only enable `spark.shuffle.readHostLocalDisk` but also `spark.shuffle.service.enabled`. After this PR, enable `spark.shuffle.readHostLocalDisk` should be enough, and external shuffle service is no longer a pre-requirement.

Added test and tested manually.

Closes #28911 from Ngone51/support_node_local_shuffle.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants