[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

bozhang2820 · 2023-08-02T11:01:47Z

What changes were proposed in this pull request?

This change tries to handle shuffle fetch failures due to decommissions.

When encountering a fetch failure for a block, ShuffleBlockFetcherIterator will check the updated map output location (BlockManagerId) for it, and see if the location changes. When it does, the fetch failure will be ignored and the updated location and the block id will be recorded. At the end of the fetch request (when all blocks in the request are processed), multiple requests will be assembled and enqueued, to retry the fetches from the updated locations for the corresponding blocks.

Why are the changes needed?

This is to improve stability when decommission is enabled.

Does this PR introduce any user-facing change?

This change comes with a feature flag spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled.

How was this patch tested?

Added a new unit test.

Ngone51 · 2023-08-03T02:56:37Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+      currentLocationOpt = getMapOutputLocation(shuffleId, mapId)
+    }
+    if (currentLocationOpt.isEmpty) {
+      throw new MetadataUpdateFailedException(shuffleId, mapId,


Could you reuse MetadataFetchFailedException? We can use the message field to distinguish the error case.

mridulm · 2023-08-03T04:37:59Z

+CC @otterc

mridulm · 2023-08-12T00:39:50Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    // Try to get the cached location first in case other concurrent tasks
+    // fetched the fresh location already
+    var currentLocationOpt = getMapOutputLocation(shuffleId, mapId)
+    if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {


nit:

Suggested change

if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {

if (currentLocationOpt.exists(_ == prevLocation)) {

Will change to currentLocationOpt.contains(prevLocation).

mridulm · 2023-08-12T00:47:34Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {
+      // Address in the cache unchanged. Try to clean cache and get a fresh location
+      unregisterShuffle(shuffleId)
+      currentLocationOpt = getMapOutputLocation(shuffleId, mapId)


Note: we end up removing both map and merge status here - for this call second call, pass canFetchMergeResult = true in getMapOutputLocation

Good catch. Will do.

mridulm · 2023-08-12T00:50:02Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+      throw new MetadataFetchFailedException(shuffleId, -1,
+        message = s"Failed to get map output location for shuffleId $shuffleId, mapId $mapId")
+    }
+    currentLocationOpt.get


nit: currentLocationOpt.getOrElse( throw ... )

When shuffle fallback storage is enabled, this currentLocationOptcan be the FALLBACK_BLOCK_MANAGER_ID, and DeferFetchRequestResult below doesn't handle this special case.
so either 1) check the FetchRequest for fallback storage special ID 2)rewrite the RPC address to localhost so we get the blocks inside the fallback storage.

Let us filter it out here, and add support for fetching from fallback in a separate pr.

+CC @dongjoon-hyun as well.

mridulm · 2023-08-12T00:51:19Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("If true, executors will try to refresh the cached locations for the shuffle blocks" +
+        "when fetch failures happens (and decommission shuffle block migration is enabled), " +
+        "and retry fetching when the location changes.")
+      .version("3.5.0")


Change to 4.0.0

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

mridulm · 2023-08-12T01:13:17Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -264,18 +272,22 @@ final class ShuffleBlockFetcherIterator(
      case FetchBlockInfo(blockId, size, mapIndex) => (blockId.toString, (size, mapIndex))
    }.toMap
    val remainingBlocks = new HashSet[String]() ++= infoMap.keys
-    val deferredBlocks = new ArrayBuffer[String]()
+    val deferredBlocks = new HashMap[BlockManagerId, Queue[String]]()


nit:

Suggested change

val deferredBlocks = new HashMap[BlockManagerId, Queue[String]]()

val deferredBlocks = new HashMap[BlockManagerId, ArrayBuffer[String]]()

ukby1234 · 2023-08-14T02:45:58Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

@@ -1288,6 +1288,30 @@ private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTr
    mapSizesByExecutorId.iter
  }

+  def getMapOutputLocationWithRefresh(


Maybe return Option[BlockManagerId]?:

def getMapOutputLocationWithRefresh( shuffleId: Int, mapId: Long, prevLocation: BlockManagerId): Option[BlockManagerId] = { // Try to get the cached location first in case other concurrent tasks // fetched the fresh location already getMapOutputLocation(shuffleId, mapId) match { case Some(location) => if (location == prevLocation) { unregisterShuffle(shuffleId) getMapOutputLocation(shuffleId, mapId) } else { Some(location) } case _ => None } }

We still want to throw a MetadataFetchFailedException when failing to get a refreshed location here. So I would prefer returning a BlockManagerId and make it specific.

We can do the following with Option:

val currentAddressOpt = mapOutputTrackerWorker .getMapOutputLocationWithRefresh(shuffleId, mapId, address) currentAddressOpt match { case Some(currentAddress) => if (currentAddress != address) { logInfo(s"Map status location for block $blockId changed from $address " + s"to $currentAddress") remainingBlocks -= blockId deferredBlocks.getOrElseUpdate(currentAddress, new ArrayBuffer[String]()) .append(blockId) enqueueDeferredFetchRequestIfNecessary() } else { results.put(FailureFetchResult(block, infoMap(blockId)._2, address, e)) } case None => results.put(FailureFetchResult(block, infoMap(blockId)._2, address, e)) }

It is also consistent with other function signatures with getMapOutputLocation.

core/src/main/scala/org/apache/spark/TestUtils.scala

bozhang2820

Sorry for the late reply. Will work on this more actively from now.

bozhang2820 · 2023-08-15T15:17:11Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

@@ -1288,6 +1288,30 @@ private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTr
    mapSizesByExecutorId.iter
  }

+  def getMapOutputLocationWithRefresh(


We still want to throw a MetadataFetchFailedException when failing to get a refreshed location here. So I would prefer returning a BlockManagerId and make it specific.

bozhang2820 · 2023-08-15T15:17:47Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    // Try to get the cached location first in case other concurrent tasks
+    // fetched the fresh location already
+    var currentLocationOpt = getMapOutputLocation(shuffleId, mapId)
+    if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {


Will change to currentLocationOpt.contains(prevLocation).

bozhang2820 · 2023-08-31T06:52:00Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {
+      // Address in the cache unchanged. Try to clean cache and get a fresh location
+      unregisterShuffle(shuffleId)
+      currentLocationOpt = getMapOutputLocation(shuffleId, mapId)


Good catch. Will do.

core/src/main/scala/org/apache/spark/TestUtils.scala

bozhang2820 · 2023-09-07T01:31:27Z

Also CC @jiangxb1987

Ngone51

LGTM

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

Co-authored-by: wuyi <yi.wu@databricks.com>

mridulm · 2023-09-14T03:13:58Z

core/src/main/scala/org/apache/spark/TestUtils.scala

+    val conf = SparkEnv.get.conf
+    val (keys, values) = confPairs.unzip
+    val currentValues = keys.map { key =>
+      if (conf.contains(key)) {
+        Some(conf.get(key))
+      } else {
+        None
+      }
+    }
+    (keys, values).zipped.foreach { (key, value) =>
+      conf.set(key, value)
+    }
+    try f finally {
+      keys.zip(currentValues).foreach {
+        case (key, Some(value)) => conf.set(key, value)
+        case (key, None) => conf.remove(key)
+      }
+    }
+  }


nit:

Suggested change

val conf = SparkEnv.get.conf

val (keys, values) = confPairs.unzip

val currentValues = keys.map { key =>

if (conf.contains(key)) {

Some(conf.get(key))

} else {

None

}

}

(keys, values).zipped.foreach { (key, value) =>

conf.set(key, value)

}

try f finally {

keys.zip(currentValues).foreach {

case (key, Some(value)) => conf.set(key, value)

case (key, None) => conf.remove(key)

}

}

}

def withConf[T](confPairs: (String, String)*)(f: => T): T = {

val conf = SparkEnv.get.conf

val inputConfMap = confPairs.toMap

val modifiedValues = conf.getAll.filter(kv => inputConfMap.contains(kv._1)).toMap

inputConfMap.foreach { kv =>

conf.set(kv._1, kv._2)

}

try f finally {

inputConfMap.keys.foreach { key =>

if (modifiedValues.contains(key)) {

conf.set(key, modifiedValues(key))

} else {

conf.remove(key)

}

}

}

}

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

mridulm · 2023-09-14T03:19:41Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+    if (currentLocationOpt.contains(prevLocation)) {
+      // Address in the cache unchanged. Try to clean cache and get a fresh location
+      unregisterShuffle(shuffleId)
+      currentLocationOpt = getMapOutputLocation(shuffleId, mapId, canFetchMergeResult = true)


Suggested change

currentLocationOpt = getMapOutputLocation(shuffleId, mapId, canFetchMergeResult = true)

currentLocationOpt = getMapOutputLocation(shuffleId, mapId, fetchMergeResult)

mridulm · 2023-09-14T03:25:53Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

                results.put(FailureFetchResult(block, infoMap(blockId)._2, address, e))
+              } else {
+                val (shuffleId, mapId) = BlockId.getShuffleIdAndMapId(block)


Should we move the getShuffleIdAndMapId into the Try ?
We will effectively block shuffle indefinitely in case getShuffleIdAndMapId throws an exception (it should not currently - but code could evolve).

Something like:

Try { val (shuffleId, mapId) = BlockId.getShuffleIdAndMapId(block) mapOutputTrackerWorker .getMapOutputLocationWithRefresh(shuffleId, mapId, address) } match {

mridulm · 2023-09-14T03:32:31Z

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

+        }
+      }
+      when(mapOutputTracker.getMapOutputLocationWithRefresh(any(), any(), any()))
+        .thenAnswer(_ => throw new MetadataFetchFailedException(0, 0, ""))


super nit: set mapId to -1

mridulm · 2023-09-14T03:34:00Z

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

+  }
+
+
+  test("metadata fetch failure in handle map output location change") {


Can you also add a simple test to ensure existing behavior is preserved when no migration has happened ?

mridulm · 2023-09-14T03:39:30Z

@bozhang2820, the test failure might be resolved by updating you branch to the latest master.

Ngone51 · 2023-09-27T03:24:05Z

@bozhang2820 Could you rebase the PR?

ukby1234 · 2023-10-20T04:30:00Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+                val (shuffleId, mapId) = BlockId.getShuffleIdAndMapId(block)
+                val mapOutputTrackerWorker = mapOutputTracker.asInstanceOf[MapOutputTrackerWorker]
+                Try(mapOutputTrackerWorker
+                  .getMapOutputLocationWithRefresh(shuffleId, mapId, address)) match {


Refreshing map output locations in a Netty callback thread will cause potential deadlock. Here is the reason:

Some map output locations are stored via broadcast variables

This code has a synchronization block

The netty response to fetch broadcast variables might be blocked by other handlers like the shuffle success handler

In the above case, because the shuffle success handler also requires the same lock from 2), this is a deadlock

The above situation happened during my test of this code running this patch.

ukby1234 · 2023-10-20T04:31:10Z

@bozhang2820 not sure if you still have time to work on this PR. I opened another PR to address some issues mentioned above.

ukby1234 · 2023-10-20T04:38:51Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+      SparkEnv.get.conf.get(config.STORAGE_DECOMMISSION_ENABLED) &&
+      SparkEnv.get.conf.get(config.STORAGE_DECOMMISSION_SHUFFLE_BLOCKS_ENABLED)
+
+  private val shouldPerformShuffleLocationRefresh =


what about make this one of the constructor argument? One of the benefit is that you don't need to write tests with TestUtils.withConf

bozhang2820 · 2023-10-20T09:00:07Z

@bozhang2820 not sure if you still have time to work on this PR. I opened another PR to address some issues mentioned above.

We have encountered some performance issues during our tests with this change, and will have to address those before moving forward.

mridulm · 2023-10-20T13:11:42Z

Do you have details that can be shared @bozhang2820 ? Thanks

mridulm · 2024-01-18T08:19:06Z

Any updates on this @bozhang2820 ? Thanks

github-actions · 2024-04-28T00:21:08Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Handle shuffle fetch failures in decommissions

dfdc906

github-actions bot added the CORE label Aug 2, 2023

Ngone51 reviewed Aug 3, 2023

View reviewed changes

change to use MetadataFetchFailedException

e717582

mridulm reviewed Aug 12, 2023

View reviewed changes

ukby1234 reviewed Aug 14, 2023

View reviewed changes

melihsozdinler reviewed Aug 17, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/TestUtils.scala Show resolved Hide resolved

bozhang2820 commented Aug 31, 2023

View reviewed changes

bozhang2820 added 3 commits September 6, 2023 09:18

Address comments

6a2e34c

Change to local fetch when possible

6b50554

Handle failures in getMapOutputLocationWithRefresh properly

3d424ab

Ngone51 approved these changes Sep 11, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

bozhang2820 and others added 2 commits September 12, 2023 09:25

Nits

22fb18e

Co-authored-by: wuyi <yi.wu@databricks.com>

style

8188981

mridulm reviewed Sep 14, 2023

View reviewed changes

Ngone51 mentioned this pull request Sep 27, 2023

[SPARK-41341][CORE] Wait shuffle fetch to finish when decommission executor #38852

Closed

mridulm mentioned this pull request Oct 19, 2023

[SPARK-44635][CORE] Refresh shuffle locations when decommission happens #43443

Closed

ukby1234 reviewed Oct 20, 2023

View reviewed changes

github-actions bot added the Stale label Apr 28, 2024

github-actions bot closed this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

bozhang2820 commented Aug 2, 2023 •

edited

Ngone51 Aug 3, 2023

mridulm commented Aug 3, 2023

mridulm Aug 12, 2023

bozhang2820 Aug 15, 2023

mridulm Aug 12, 2023

bozhang2820 Aug 31, 2023

mridulm Aug 12, 2023

ukby1234 Aug 12, 2023

mridulm Aug 13, 2023 •

edited

mridulm Aug 12, 2023

mridulm Aug 12, 2023

ukby1234 Aug 14, 2023 •

edited

bozhang2820 Aug 15, 2023

ukby1234 Sep 13, 2023

bozhang2820 left a comment

bozhang2820 Aug 15, 2023

bozhang2820 Aug 15, 2023

bozhang2820 Aug 31, 2023

bozhang2820 commented Sep 7, 2023

Ngone51 left a comment

mridulm Sep 14, 2023

mridulm Sep 14, 2023

mridulm Sep 14, 2023

mridulm Sep 14, 2023

mridulm Sep 14, 2023

mridulm commented Sep 14, 2023

Ngone51 commented Sep 27, 2023

ukby1234 Oct 20, 2023

ukby1234 commented Oct 20, 2023

ukby1234 Oct 20, 2023

bozhang2820 commented Oct 20, 2023

mridulm commented Oct 20, 2023

mridulm commented Jan 18, 2024

github-actions bot commented Apr 28, 2024

	if (currentLocationOpt.isDefined && currentLocationOpt.get == prevLocation) {
	if (currentLocationOpt.exists(_ == prevLocation)) {

	val deferredBlocks = new HashMap[BlockManagerId, Queue[String]]()
	val deferredBlocks = new HashMap[BlockManagerId, ArrayBuffer[String]]()

	currentLocationOpt = getMapOutputLocation(shuffleId, mapId, canFetchMergeResult = true)
	currentLocationOpt = getMapOutputLocation(shuffleId, mapId, fetchMergeResult)

		}


		test("metadata fetch failure in handle map output location change") {

[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

Conversation

bozhang2820 commented Aug 2, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

mridulm commented Aug 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Aug 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ukby1234 Aug 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bozhang2820 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bozhang2820 commented Sep 7, 2023

Ngone51 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm commented Sep 14, 2023

Ngone51 commented Sep 27, 2023

Choose a reason for hiding this comment

ukby1234 commented Oct 20, 2023

Choose a reason for hiding this comment

bozhang2820 commented Oct 20, 2023

mridulm commented Oct 20, 2023

mridulm commented Jan 18, 2024

github-actions bot commented Apr 28, 2024

bozhang2820 commented Aug 2, 2023 •

edited

mridulm Aug 13, 2023 •

edited

ukby1234 Aug 14, 2023 •

edited