[SPARK-27823] [CORE] Refactor resource handling code by mengxr · Pull Request #24856 · apache/spark

mengxr · 2019-06-12T16:49:10Z

What changes were proposed in this pull request?

Continue the work from #24821. Refactor resource handling code to make the code more readable. Major changes:

Moved resource-related classes to spark.resource from spark.
Added ResourceUtils and helper classes so we don't need to directly deal with Spark conf.
ResourceID: resource identifier and it provides conf keys
ResourceRequest/Allocation: abstraction for requested and allocated resources
Added TestResourceIDs to reference commonly used resource IDs in tests like spark.executor.resource.gpu.

cc: @tgravescs @jiangxb1987 @Ngone51

How was this patch tested?

Unit tests for added utils and existing unit tests.

SparkQA · 2019-06-12T17:04:41Z

Test build #106432 has finished for PR 24856 at commit ac1ab1d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-12T18:46:10Z

Test build #106433 has finished for PR 24856 at commit 4e241d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-06-12T18:57:33Z

@dongjoon-hyun this is not a new feature.

SparkQA · 2019-06-12T18:58:41Z

Test build #106435 has finished for PR 24856 at commit 94896ed.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-12T19:07:24Z

Thank you for the change, @mengxr !

SparkQA · 2019-06-13T19:48:01Z

Test build #106483 has finished for PR 24856 at commit 8275468.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-13T19:56:40Z

Test build #106484 has finished for PR 24856 at commit 5fe74b3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-13T20:11:12Z

Test build #106485 has finished for PR 24856 at commit 9ea06a0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-13T21:11:35Z

Test build #106487 has finished for PR 24856 at commit e4fc573.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-13T21:43:48Z

Test build #106488 has finished for PR 24856 at commit 4cebf70.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-14T00:06:38Z

Test build #106489 has finished for PR 24856 at commit 0f11e8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-06-14T14:34:31Z

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

+      }
+    } else {
+      throw new SparkException(s"User is expecting to use resource: $resourceName but " +
+        "didn't specify a discovery script!")


"but neither allocated by resources file nor specified a discovery script!" ?

The resources file is provided by the cluster manager via a discovery script, not directly by user. So the error message is correct.

Ngone51 · 2019-06-14T14:35:07Z

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

+        "didn't specify a discovery script!")
+    }
+    if (!result.name.equals(resourceName)) {
+      throw new SparkException("Error running the resource discovery script, script returned " +


"Error running the resource discovery script $scriptFile, script ...." ?

Ngone51 · 2019-06-14T14:49:40Z

core/src/main/scala/org/apache/spark/SparkContext.scala

-      val executorResourcesAndCounts = sc.conf.getAllWithPrefixAndSuffix(
-        SPARK_EXECUTOR_RESOURCE_PREFIX, SPARK_RESOURCE_AMOUNT_SUFFIX).toMap
+      val taskResourceRequirements = parseTaskResourceRequirements(sc.conf)
+      val executorResourcesAndCounts =


Shall we unify executor's count to amount ?

Ngone51 · 2019-06-14T14:55:06Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+        if (execCount < taskReq.amount) {
+          throw new SparkException("The executor resource config: " +
+            ResourceID(SPARK_EXECUTOR_PREFIX, taskReq.resourceName).amountConf +
+            s" = $execCount has to be >= the task config: " +


Just for symmetry: executor resource config <-> task resource config

Ngone51 · 2019-06-14T15:10:36Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+          logWarning(s"The configuration of resource: ${taskReq.resourceName} " +
+            s"(limits tasks to $resourceNumSlots) will result in wasted resources of resource " +
+            s"${limitingResourceName} (would allow for $numSlots tasks). " +
+            "Please adjust your configuration.")


I know it's previous logic, but isn't this warning a little redundant comparing to the following warning below ?

Done. Good catch!

Ngone51 · 2019-06-14T15:41:54Z

core/src/test/scala/org/apache/spark/resource/ResourceUtilsSuite.scala

+    request = parseResourceRequest(conf, DRIVER_GPU_ID)
+    assert(request.id.resourceName === GPU, "should only have GPU for resource")
+    assert(request.amount === 2, "GPU count should be 2")
+    assert(request.discoveryScript.get === discoveryScript, "discovery script should be empty")


"discovery script should be discoveryScriptGPU"

changed to "should get discovery script". putting discoveryScriptGPU in results duplicate info.

Ngone51 · 2019-06-14T15:42:21Z

core/src/test/scala/org/apache/spark/resource/ResourceUtilsSuite.scala

+    assert(request.id.resourceName === GPU, "should only have GPU for resource")
+    assert(request.amount === 2, "GPU count should be 2")
+    assert(request.discoveryScript.get === discoveryScript, "discovery script should be empty")
+    assert(request.vendor.get === vendor, "vendor should be empty")


"vendor should be nvidia.com"

Ngone51 · 2019-06-14T15:47:35Z

core/src/main/scala/org/apache/spark/resource/ResourceInformation.scala

+      parse(json).extract[ResourceInformationJson].toResourceInformation
+    } catch {
+      case NonFatal(e) =>
+        throw new SparkException(s"Error parsing JSON into ResourceInformation:\n$json\n", e)


Maybe, given tips of what is right JSON format for user ?

Ngone51 · 2019-06-14T15:51:31Z

core/src/test/scala/org/apache/spark/resource/ResourceUtilsSuite.scala

+import org.apache.spark.resource.TestResourceIDs._
+import org.apache.spark.util.Utils
+
+class ResourceUtilsSuite extends SparkFunSuite


case for the combination of resources file and discovery script ?

added one test. I will leave improving test coverage to a follow-up PR since the refactoring work is quite big now.

Ngone51 · 2019-06-14T15:54:03Z

core/src/main/scala/org/apache/spark/resource/ResourceInformation.scala

+  def parseJson(json: String): ResourceInformation = {
+    implicit val formats = DefaultFormats
+    try {
+      parse(json).extract[ResourceInformationJson].toResourceInformation


Shall we check duplicate addresses for user ?

We can add that check later. It is beyond the scope of this refactor.

jiangxb1987

Generally looks good

jiangxb1987 · 2019-06-17T21:36:26Z

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

+  def confPrefix: String = s"$componentName.resource.$resourceName." // with ending dot
+  def amountConf: String = s"$confPrefix${ResourceUtils.AMOUNT}"
+  def discoveryScriptConf: String = s"$confPrefix${ResourceUtils.DISCOVERY_SCRIPT}"
+  def vendorConf: String = s"$confPrefix${ResourceUtils.VENDOR}"


This is k8s specific, do we want to move it to the k8s package?

currently its only used by k8s but I didn't want to make it specific. I can see others using it in the future and its just another suffix on spark.{executor/driver}.resources.{resourcename}. . If we made it specific I don't think would be as user friendly as now I have to know which prefix to use.

jiangxb1987 · 2019-06-17T22:32:04Z

core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala

+  def parseAllResourceRequests(
+      sparkConf: SparkConf,
+      componentName: String): Seq[ResourceRequest] = {
+    listResourceIds(sparkConf, componentName).map { id =>


This would call sparkConf.getAllWithPrefix multiple times, we may reduce the call to only once and manually turn the Map[String, String] to Seq[ResourceRequest]. I understand this is tradeoff between code readability and performance, I'm just worried the config number can be big thus we want to reduce the call of getAllWithPrefix. I'd prefer to leave a TODO here and consider the improvement in the future.

This is a one-time only call for each service. I think we should optimize SparkConf and getAllWithPrefix instead.

SparkQA · 2019-06-18T18:15:04Z

Test build #106632 has finished for PR 24856 at commit 7fe9a04.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-18T20:18:30Z

Test build #106635 has finished for PR 24856 at commit f391585.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-06-18T21:12:14Z

test this please

mengxr · 2019-06-18T22:03:53Z

The failed test succeeded on my local. @jiangxb1987 Could you take a look? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/106635/testReport/org.apache.spark/SparkContextSuite/test_resource_scheduling_under_local_cluster_mode/

SparkQA · 2019-06-18T23:38:24Z

Test build #106640 has finished for PR 24856 at commit f391585.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-06-19T00:04:41Z

@mengxr It's a known flaky test, but I haven't figure out the root cause.

jiangxb1987

LGTM

jiangxb1987 · 2019-06-19T00:18:51Z

Thanks, merged to master

tgravescs and others added 11 commits June 7, 2019 11:01

Add abstraction layer for resource configs and parsing

fc7ae13

misc formatting

88baf86

rename

75757ac

fix k8s suite

bc6ee95

Add warning message

3298c0d

fix scalastyle

27736c5

fix style

db953d9

Fix bug in checking task requirements and fix test

4db01b0

simplify ResourceUtils and tests

e8f2457

hide private methods

686b8e3

remove setRequestConf

ac1ab1d

mengxr added 3 commits June 12, 2019 10:18

pass build

4e241d6

rename TaskResourceRequirement.count to amount

83ceb87

remove toJson and use default serialization

94896ed

dongjoon-hyun added the NEW FEATURE label Jun 12, 2019

mengxr added IMPROVEMENT and removed NEW FEATURE labels Jun 12, 2019

mengxr added 6 commits June 12, 2019 14:10

move resource related classes to spark.resource

767f3fa

fix scalastyle

4f3123f

move JSON handling to ResourceInformation and add tests

0c58fce

unit test for ResourceID

fe2762e

move TestResourceIDs to resource/

7694b9c

rename getAll to getOrDiscoverAll

8275468

clean up

9ea06a0

mengxr force-pushed the SPARK-27823 branch from 5fe74b3 to 9ea06a0 Compare June 13, 2019 19:55

mengxr added 4 commits June 13, 2019 13:23

refactor writeStringToFileAndSetPermissions

7770265

refactor createTempJsonFile

fb154f3

pass tests

0c9679e

mesos ...

e4fc573

mengxr changed the title ~~[WIP] [SPARK-27823] [CORE] Add an abstraction layer for resource handling~~ [SPARK-27823] [CORE] Refactor resource handling code Jun 13, 2019

fix javaunidoc

0f11e8c

mengxr force-pushed the SPARK-27823 branch from 4cebf70 to 0f11e8c Compare June 13, 2019 21:56

Ngone51 reviewed Jun 14, 2019

View reviewed changes

dongjoon-hyun added SPARK CORE and removed IMPROVEMENT labels Jun 14, 2019

jiangxb1987 reviewed Jun 17, 2019

View reviewed changes

address comments

7fe9a04

fix test

f391585

jiangxb1987 approved these changes Jun 19, 2019

View reviewed changes

jiangxb1987 closed this in 7056e00 Jun 19, 2019

Conversation

mengxr commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

mengxr commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

dongjoon-hyun commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 13, 2019

Uh oh!

SparkQA commented Jun 13, 2019

Uh oh!

SparkQA commented Jun 13, 2019

Uh oh!

SparkQA commented Jun 13, 2019

Uh oh!

SparkQA commented Jun 13, 2019

Uh oh!

SparkQA commented Jun 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2019

mengxr commented Jun 12, 2019 •

edited

Loading