[SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled #28911

Ngone51 · 2020-06-23T15:04:36Z

What changes were proposed in this pull request?

This PR adds support to read host-local shuffle data from disk directly when external shuffle service is disabled.

Similar to #25299, we first try to get local disk directories for the shuffle data, which is located at the same host with the current executor. The only difference is, in #25299, it gets the directories from the external shuffle service while in this PR, it gets the directory from the executors.

To implement the feature, this PR extends the HostLocalDirManager for both ExternalBlockStoreClient and NettyBlockTransferService. Also, this PR adds getHostLocalDirs for NettyBlockTransferService as ExternalBlockStoreClient does, in order to send the get-dir-request to the corresponding executor. And this PR resued the request messageGetLocalDirsForExecutors for simple.

Why are the changes needed?

After SPARK-27651 / #25299, Spark can read host-local shuffle data directly from disk when external shuffle service is enabled. To extend the future, we can also support it when the external shuffle service is disabled.

Does this PR introduce any user-facing change?

Yes. Before this PR, to use the host-local shuffle reading feature, users should not only enable spark.shuffle.readHostLocalDisk but also spark.shuffle.service.enabled. After this PR, enable spark.shuffle.readHostLocalDisk should be enough, and external shuffle service is no longer a pre-requirement.

How was this patch tested?

Added test and tested manually.

Ngone51 · 2020-06-23T15:05:33Z

@attilapiros @tgravescs @jiangxb1987 Please take a look, thanks!

SparkQA · 2020-06-23T15:12:40Z

Test build #124426 has finished for PR 28911 at commit 0d62ccb.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2020-06-23T15:59:51Z

thanks for working on this, was interested in this as well. Can you update the description to include details on your overall approach - where do you get the directories from, etc?

Ngone51 · 2020-06-24T03:53:02Z

@tgravescs updated the description, thanks!

Ngone51 · 2020-06-24T04:02:58Z

Jenkins, retest this please.

SparkQA · 2020-06-24T04:21:12Z

Test build #124454 has finished for PR 28911 at commit 0d62ccb.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

core/src/test/scala/org/apache/spark/shuffle/HostLocalShuffleFetchSuite.scala

core/src/main/scala/org/apache/spark/internal/config/package.scala

SparkQA · 2020-06-30T16:24:18Z

Test build #124662 has finished for PR 28911 at commit 446780a.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-07-02T12:49:32Z

I've updated the PR. Could you take another look?

SparkQA · 2020-07-02T12:57:13Z

Test build #124905 has finished for PR 28911 at commit da14484.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-07-09T05:56:09Z

Jenkins, retest this please.

SparkQA · 2020-07-09T07:05:01Z

Test build #125437 has finished for PR 28911 at commit da14484.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2020-07-09T09:17:16Z

@Ngone51 I am still catching up on changes; as part of #25299 or subsequently (or here) are we updating preferred locality for shuffle tasks to account for ability to do node local reads ?
Essentially, all shuffle blocks on a node (irrespective of executor) should be treated with equal locality preference for computing pref locality for shuffle tasks.

tgravescs

took a quick look and had a few comments, overall approach seems fine. I need to take a more in depth review.

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

Ngone51 · 2020-07-09T16:44:59Z

@mridulm we don't and no need to do it. The current implementation of getPreferredLocationsForShuffle already treats blocks on the same node as the same locality preference(see L617):

spark/core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Lines 610 to 624 in 6fcb70e

    
           def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int) 
        
               : Seq[String] = { 
        
             if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD && 
        
                 dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) { 
        
               val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId, 
        
                 dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION) 
        
               if (blockManagerIds.nonEmpty) { 
        
                 blockManagerIds.get.map(_.host) 
        
               } else { 
        
                 Nil 
        
               } 
        
             } else { 
        
               Nil 
        
             } 
        
           }

Ngone51 · 2020-07-09T16:45:43Z

@tgravescs Thanks for the review. I'll try to address them tomorrow.

mridulm · 2020-07-09T23:37:03Z

The output is host, but size computation does not aggregate by host.

…

On Thu, Jul 9, 2020 at 9:45 AM wuyi ***@***.***> wrote: @tgravescs <https://github.com/tgravescs> Thanks for the review. I'll try to address them tomorrow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28911 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMETFBZGIMLWOFUE234CBDR2XX4NANCNFSM4OFYIFCQ> .

Ngone51 · 2020-07-10T08:06:23Z

Ah, I get your point and I can imagine how it may affect current locality preference. Let's take an example to see if we're on the same page.

For example, now we have executor1 and executor2 on node1, executor3 and executor4 on node2. And we also have 10 shuffle data bytes on executor1 and executor2 from task1 and task2 separately. Besides, we also have 40 shuffle data bytes on executor3 and executor4 from task3 and task4 separately. (Assuming all the shuffle data are for the same reduce partition.)

With the current implementation of getLocationsWithLargestOutputs, we only count an executor's host as a locality prefer location when [shuffle data for a certain reduce partiton on this executor] / [total shuffle data]) >= fractionThreshold(default 0.2). So, in this case, only node2 is considered as a preferred location because (40 / 10 + 10 + 40 + 40) = 0.4 >= 0.2. But node1 is not because (10 / 10 + 10 + 40 + 40) = 0.1 < 0.2.

However, node1 can also be a preferred location if we aggregate the size of the shuffle data on the same host, since we will have (10 + 10 / 10 + 10 + 40 + 40) = 0.2 >= 0.2.

It looks reasonable to me. cc @attilapiros @tgravescs @jiangxb1987 @holdenk Any ideas?

mridulm · 2020-07-10T10:04:04Z

The fix for this need not necessarily come in this PR, but can be a feature addition.
Note that local shuffle reads from across executors on a node will really benefit when locality preference also accounts for it - until then, the potential benefits will be reduced.

The solution is fairly straightforward, given existing implementation of getLocationsWithLargestOutputs - when aggregating, aggregate by host instead of blockmanager id when local reads across executors on a node are possible. This PR, #25299 are candidates when this can be enabled (with suitable flag checks, etc).

SparkQA · 2020-07-10T12:43:30Z

Test build #125589 has finished for PR 28911 at commit 1126341.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-07-17T06:10:54Z

Personally, I'd save locality changes for a follow up PR. Making changes in core is pretty hard, so long as we have a JIRA and it's a good incremental chunk of work keeping it smaller for review (and potential revert if something goes wrong) is better (of course there are situations where that isn't possible, but I think changing locality calculations would be strictly additive.)

holdenk · 2020-07-17T06:12:22Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

+   * (when this is a NettyBlockTransferService). Note there's only one executor when this is a
+   * NettyBlockTransferService because we ask one specific executor at a time.


Can you clarify the last sentence here?

Oh I got this.

When the external shuffle service is target by this request we can collect the local dirs of multiple executors at once (as all the host local dirs are available in the external shuffle service running on the host as it is central component in this sense on that host).

But here we can request the local dirs for only one executor: the one which handles the request itself.

@Ngone51 what about adding an assert here:

spark/core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

Line 119 in 1126341

val execId = getLocalDirs.execIds.head

Checking the array only contains one executor ID and it is equal with the executorId of the blockManager.

I added the check to ensure it's the only one executor id but didn't check its equality with blockManager's executor id. Because we only have BlockDataManager in NettyBlockRpcServer which does not expose executor id.

I am still wondering whether it's worthwhile to expose it for the sanity check purpose.

core/src/main/scala/org/apache/spark/internal/config/package.scala

Ngone51 · 2020-07-20T16:01:34Z

Thank you for the review. I'll try to address them tomorrow!

SparkQA · 2020-07-21T07:05:01Z

Test build #126227 has finished for PR 28911 at commit bcb6012.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BlockTransferService extends BlockStoreClient

Ngone51 · 2020-07-21T07:36:58Z

Jenkins, retest this please.

SparkQA · 2020-07-21T11:29:54Z

Test build #126234 has finished for PR 28911 at commit bcb6012.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BlockTransferService extends BlockStoreClient

dongjoon-hyun · 2020-08-29T23:05:13Z

If you want, sure! @holdenk .

SparkQA · 2020-08-30T00:00:05Z

Test build #128022 has finished for PR 28911 at commit 2aa71f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2020-08-30T03:45:27Z

@Ngone51 The PR description mentions disabled dynamic allocation as requirement but this was changed as result of a review finding. Could you please update it?

Ngone51 · 2020-08-31T12:29:10Z

Thank you @dongjoon-hyun for the detailed review. It helps a lot to improve PR.

Ngone51 · 2020-08-31T12:29:49Z

@holdenk Sure, please feel free to add any comments!

Ngone51 · 2020-08-31T12:32:29Z

@attilapiros Updated, thanks for the reminder!

SparkQA · 2020-08-31T15:22:25Z

Test build #128093 has finished for PR 28911 at commit 6b97be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HostLocalShuffleReadingSuite extends SparkFunSuite with Matchers with LocalSparkContext

holdenk

Two minor points of clarification, but no blocking concerns from me after this review.

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

core/src/main/scala/org/apache/spark/storage/BlockManager.scala

SparkQA · 2020-09-01T10:31:38Z

Test build #128139 has finished for PR 28911 at commit a23ab17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-02T18:39:48Z

Retest this please.

dongjoon-hyun

+1, LGTM. Merged to master for Apache Spark 3.1.0 on December 2020.
Thank you, @Ngone51 and all.

Ngone51 · 2020-09-03T02:06:28Z

Thank you all!!

…rnal shuffle service is disabled This PR adds support to read host-local shuffle data from disk directly when external shuffle service is disabled. Similar to #25299, we first try to get local disk directories for the shuffle data, which is located at the same host with the current executor. The only difference is, in #25299, it gets the directories from the external shuffle service while in this PR, it gets the directory from the executors. To implement the feature, this PR extends the `HostLocalDirManager ` for both `ExternalBlockStoreClient` and `NettyBlockTransferService`. Also, this PR adds `getHostLocalDirs` for `NettyBlockTransferService` as `ExternalBlockStoreClient` does, in order to send the get-dir-request to the corresponding executor. And this PR resued the request message`GetLocalDirsForExecutors` for simple. After SPARK-27651 / #25299, Spark can read host-local shuffle data directly from disk when external shuffle service is enabled. To extend the future, we can also support it when the external shuffle service is disabled. Yes. Before this PR, to use the host-local shuffle reading feature, users should not only enable `spark.shuffle.readHostLocalDisk` but also `spark.shuffle.service.enabled`. After this PR, enable `spark.shuffle.readHostLocalDisk` should be enough, and external shuffle service is no longer a pre-requirement. Added test and tested manually. Closes #28911 from Ngone51/support_node_local_shuffle. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

probot-autolabeler bot added the CORE label Jun 23, 2020

attilapiros reviewed Jun 29, 2020

View reviewed changes

core/src/test/scala/org/apache/spark/shuffle/HostLocalShuffleFetchSuite.scala Outdated Show resolved Hide resolved

holdenk reviewed Jun 29, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

tgravescs reviewed Jul 9, 2020

View reviewed changes

holdenk reviewed Jul 17, 2020

View reviewed changes

agrawaldevesh mentioned this pull request Jul 18, 2020

[SPARK-32199][SPARK-32198] Reduce job failures during decommissioning #29014

Closed

Ngone51 force-pushed the support_node_local_shuffle branch from 1126341 to bcb6012 Compare July 21, 2020 06:49

Ngone51 added 10 commits August 30, 2020 11:56

reorg import

054bf69

should be

5fbd6bb

move checkInit to BlockStoreClient

8c5fdb3

fix indent

9aa6974

improve error message

09665e2

remove k8s tip

6d21906

resply with error

5ea8f24

improve comment

57f3e1d

allFetchSucceed -> allFetchSucceeded

d70e757

combine the tests

6b97be5

holdenk reviewed Aug 31, 2020

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/storage/BlockManager.scala Show resolved Hide resolved

Ngone51 added 5 commits September 1, 2020 13:36

remove unnecesary brackets

5e98eca

check executorId

8f16c17

use _

7c38190

indent

a35807b

update test

a23ab17

dongjoon-hyun approved these changes Sep 2, 2020

View reviewed changes

dongjoon-hyun closed this in e6fec33 Sep 2, 2020

abellina mentioned this pull request Mar 23, 2021

[BUG] RapidsShuffleManager didn't pass dirs to getBlockData from a wrapped ShuffleBlockResolver NVIDIA/spark-rapids#2001

Closed

		* (when this is a NettyBlockTransferService). Note there's only one executor when this is a
		* NettyBlockTransferService because we ask one specific executor at a time.

[SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled #28911

[SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled #28911

Conversation

Ngone51 commented Jun 23, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Jun 23, 2020

SparkQA commented Jun 23, 2020

tgravescs commented Jun 23, 2020

Ngone51 commented Jun 24, 2020

Ngone51 commented Jun 24, 2020

SparkQA commented Jun 24, 2020

SparkQA commented Jun 30, 2020

Ngone51 commented Jul 2, 2020

SparkQA commented Jul 2, 2020

Ngone51 commented Jul 9, 2020

SparkQA commented Jul 9, 2020

mridulm commented Jul 9, 2020

tgravescs left a comment

Choose a reason for hiding this comment

Ngone51 commented Jul 9, 2020

Ngone51 commented Jul 9, 2020

mridulm commented Jul 9, 2020 via email

Ngone51 commented Jul 10, 2020

mridulm commented Jul 10, 2020

SparkQA commented Jul 10, 2020

holdenk commented Jul 17, 2020

holdenk Jul 17, 2020

Choose a reason for hiding this comment

attilapiros Jul 17, 2020 • edited

Choose a reason for hiding this comment

Ngone51 Jul 21, 2020

Choose a reason for hiding this comment

Ngone51 commented Jul 20, 2020

SparkQA commented Jul 21, 2020

Ngone51 commented Jul 21, 2020

SparkQA commented Jul 21, 2020

dongjoon-hyun commented Aug 29, 2020

SparkQA commented Aug 30, 2020

attilapiros commented Aug 30, 2020

Ngone51 commented Aug 31, 2020

Ngone51 commented Aug 31, 2020

Ngone51 commented Aug 31, 2020

SparkQA commented Aug 31, 2020

holdenk left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 1, 2020

dongjoon-hyun commented Sep 2, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Ngone51 commented Sep 3, 2020

Ngone51 commented Jun 23, 2020 •

edited

attilapiros Jul 17, 2020 •

edited