[SPARK-26867][YARN] Spark Support of YARN Placement Constraint #32804

AngersZhuuuu · 2021-06-07T09:09:55Z

What changes were proposed in this pull request?

Support YARN Placement Constraint, In this pr:

Add LocalityPreferredSchedulingRequestContainerPlacementStrategy for compute locality for SchedulingRequest
Add ContainerImpl for UT, since the interface Container's common API can't support setAllocationRequestId and getAllocationRequestId
Add YarnSchedulingRequestAllocator as the implement of allocator when use Placement Constraint.

Why are the changes needed?

Spark can allow users to configure the Placement Constraint so that users will have more control on where the executors will get placed. For example:

Spark job wants to be run on machines where Python version is x or Java version is y (Node Attributes)
Spark job needs / does not need executors to be placed on machine where Hbase RegionServer / Zookeeper / Or any other Service is running. (Affinity / Anti Affinity)
Spark job wants no more than 2 of it's executors on same node (Cardinality)
Spark Job A executors wants / does not want to be run on where Spark Job / Any Other Job B containers runs (Application_Tag NameSpace)

Does this PR introduce any user-facing change?

Use can set spark.yarn.schedulingRequestEnabled to use YARN placement constraint and set spark.yarn.executor.nodeAttributes to set constraint.

How was this patch tested?

Added UT and manuel tested this in yarn cluser:

set node attribute

/usr/share/yarn-3/bin/yarn nodeattributes -attributestonodes -attributes attr1
Hostname Attribute-value  rm.yarn.io/attr1 :ip-xxxx              1

/usr/share/yarn-3/bin/yarn nodeattributes -attributestonodes -attributes attr2
Hostname Attribute-value  rm.yarn.io/attr2 :ip-xxx
spark command

export SPARK_CONF_DIR=/tmp/spark-conf-3.1.1
export SPARK_HOME=/tmp/spark-3.1.1
/tmp/spark-3.1.1/bin/spark-sql --queue infra \
--executor-memory 1g --executor-cores 1  \
--conf spark.yarn.schedulingRequestEnabled=true  \
--conf spark.executor.instances=1 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=1 \
--conf spark.dynamicAllocation.maxExecutors=2 \
--conf spark.dynamicAllocation.minExecutors=0    \
--conf spark.dynamicAllocation.executorIdleTimeout=10s \
--conf spark.yarn.executor.nodeAttribute='attr1=1'

Under dynamic allocation, can work as expect, all allocated container is under NM ip-xxx,
if we set node attribute as attr1=2, we can't allocate any executor container.

SparkQA · 2021-06-07T09:47:02Z

Test build #139410 has finished for PR 32804 at commit c80b66f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-07T09:53:17Z

Test build #139414 has finished for PR 32804 at commit 4ef8f35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-07T10:07:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43932/

SparkQA · 2021-06-07T10:23:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43936/

SparkQA · 2021-06-07T10:43:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43932/

SparkQA · 2021-06-07T10:58:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43936/

mridulm · 2021-06-07T16:15:57Z

+CC @otterc, @venkata91

SparkQA · 2021-06-08T04:49:37Z

Test build #139445 has finished for PR 32804 at commit ec9e127.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-08T05:02:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43968/

SparkQA · 2021-06-08T05:34:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43968/

AngersZhuuuu · 2021-06-09T07:00:34Z

...nagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSchedulingRequestAllocator.scala

+    }
+
+    val expression = (nodeAttributes.isDefined, locality.isDefined) match {
+      case (true, true) => Some(or(and(nodeAttributes.get, locality.get), nodeAttributes.get))


Here should use below constraint is better.

delayedOr( timedClockConstraint(and(nodeAttributes, locality), delayedOrIntervalMilliseconds, TimeUnit.MILLISECONDS), timedClockConstraint(nodeAttributes, delayedOrIntervalMilliseconds * 2, TimeUnit.MILLISECONDS)))

But it always stuck when test in yarn-3.3.0 cluster. Hope for help at this part. It's hard to balance node attribute constraint and locality requirement.

SparkQA · 2021-06-09T07:55:29Z

Test build #139553 has finished for PR 32804 at commit 2cd9e69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-09T10:16:41Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44079/

tgravescs · 2021-06-22T13:56:10Z

thanks for working on this, it looks very interesting

Add LocalityPreferredSchedulingRequestContainerPlacementStrategy for compute locality for SchedulingRequest

Can you add more description about this? This seems like a lot of changes and not what I expected from the description. I was expecting us just to pass the node attributes along to yarn but this is much more then that so please describe in detail how this is working. How does this work exactly with constraint vs data locality? Is there some sort of timed wait or error handling if it never becomes available or cluster doesn't have node with that attribute? You had like 4 examples of things you could do, but are there more, what dependencies does it need. How does an attribute allow you to get cardinality or affinity?

is there any constraints on Hadoop version? (I thought it was just introduced in 3.2.0)

eventually the documentation .md file would need to be updated.

tgravescs · 2021-06-22T13:16:36Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

@@ -27,6 +27,24 @@ import org.apache.spark.network.util.ByteUnit
 package object config extends Logging {

  /* Common app configuration. */
+  private[spark] val SCHEDULING_REQUEST_ENABLED =
+    ConfigBuilder("spark.yarn.schedulingRequestEnabled")


I think we need to come up with better name and more description. From just reading this, I have no idea what it does. spark always requests containers from yarn to schedule on, so why is it off...(ie that is what I read from the config name)

tgravescs · 2021-06-22T13:29:14Z

.../apache/spark/deploy/yarn/LocalityPreferredSchedulingRequestContainerPlacementStrategy.scala

+
+
+/**
+ * This strategy is calculating the optimal locality preferences of YARN containers by considering


this looks like a direct copy of the ContainerLocalityPreferences javadoc, I'm assuming this is different so please update description, is I missed the difference, I think we need to pull it up to the top to be clear how its different

tgravescs · 2021-06-22T13:40:28Z

.../apache/spark/deploy/yarn/LocalityPreferredSchedulingRequestContainerPlacementStrategy.scala

+      hostToLocalTaskCount: Map[String, Int],
+      allocatedHostToContainersMap: HashMap[String, Set[ContainerId]],
+      localityMatchedPendingAllocations: Seq[SchedulingRequest],
+      schedulingRequestToNodes: mutable.HashMap[Long, Array[String]],


not documented

github-actions · 2021-10-01T00:10:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

zuston · 2023-08-17T08:49:13Z

Thanks for your effort. Any update on this? @AngersZhuuuu

AngersZhuuuu added 2 commits June 7, 2021 16:58

[SPARK-26867][YARN] Spark Support of YARN Placement Constraint

8c10170

Update YarnSchedulingRequestAllocator.scala

c80b66f

github-actions bot added the YARN label Jun 7, 2021

Update YarnSchedulingRequestAllocator.scala

4ef8f35

AngersZhuuuu added 2 commits June 8, 2021 11:47

update

6003eb6

Update YarnSchedulingRequestAllocator.scala

ec9e127

Update YarnSchedulingRequestAllocator.scala

2cd9e69

AngersZhuuuu commented Jun 9, 2021

View reviewed changes

tgravescs reviewed Jun 22, 2021

View reviewed changes

github-actions bot added the Stale label Oct 1, 2021

github-actions bot closed this Oct 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26867][YARN] Spark Support of YARN Placement Constraint #32804

[SPARK-26867][YARN] Spark Support of YARN Placement Constraint #32804

AngersZhuuuu commented Jun 7, 2021 •

edited

Loading

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

mridulm commented Jun 7, 2021

SparkQA commented Jun 8, 2021

SparkQA commented Jun 8, 2021

SparkQA commented Jun 8, 2021

AngersZhuuuu Jun 9, 2021

SparkQA commented Jun 9, 2021

SparkQA commented Jun 9, 2021

tgravescs commented Jun 22, 2021

tgravescs Jun 22, 2021

tgravescs Jun 22, 2021

tgravescs Jun 22, 2021

github-actions bot commented Oct 1, 2021

zuston commented Aug 17, 2023



		/**
		* This strategy is calculating the optimal locality preferences of YARN containers by considering

[SPARK-26867][YARN] Spark Support of YARN Placement Constraint #32804

[SPARK-26867][YARN] Spark Support of YARN Placement Constraint #32804

Conversation

AngersZhuuuu commented Jun 7, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

SparkQA commented Jun 7, 2021

mridulm commented Jun 7, 2021

SparkQA commented Jun 8, 2021

SparkQA commented Jun 8, 2021

SparkQA commented Jun 8, 2021

AngersZhuuuu Jun 9, 2021

Choose a reason for hiding this comment

SparkQA commented Jun 9, 2021

SparkQA commented Jun 9, 2021

tgravescs commented Jun 22, 2021

tgravescs Jun 22, 2021

Choose a reason for hiding this comment

tgravescs Jun 22, 2021

Choose a reason for hiding this comment

tgravescs Jun 22, 2021

Choose a reason for hiding this comment

github-actions bot commented Oct 1, 2021

zuston commented Aug 17, 2023

AngersZhuuuu commented Jun 7, 2021 •

edited

Loading