Skip to content

[SPARK-27823] [CORE] Refactor resource handling code#24856

Closed
mengxr wants to merge 28 commits intoapache:masterfrom
mengxr:SPARK-27823
Closed

[SPARK-27823] [CORE] Refactor resource handling code#24856
mengxr wants to merge 28 commits intoapache:masterfrom
mengxr:SPARK-27823

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Jun 12, 2019

What changes were proposed in this pull request?

Continue the work from #24821. Refactor resource handling code to make the code more readable. Major changes:

  • Moved resource-related classes to spark.resource from spark.
  • Added ResourceUtils and helper classes so we don't need to directly deal with Spark conf.
  • ResourceID: resource identifier and it provides conf keys
  • ResourceRequest/Allocation: abstraction for requested and allocated resources
  • Added TestResourceIDs to reference commonly used resource IDs in tests like spark.executor.resource.gpu.

cc: @tgravescs @jiangxb1987 @Ngone51

How was this patch tested?

Unit tests for added utils and existing unit tests.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106432 has finished for PR 24856 at commit ac1ab1d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106433 has finished for PR 24856 at commit 4e241d6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Jun 12, 2019

@dongjoon-hyun this is not a new feature.

@SparkQA
Copy link

SparkQA commented Jun 12, 2019

Test build #106435 has finished for PR 24856 at commit 94896ed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for the change, @mengxr !

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106483 has finished for PR 24856 at commit 8275468.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106484 has finished for PR 24856 at commit 5fe74b3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106485 has finished for PR 24856 at commit 9ea06a0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr mengxr changed the title [WIP] [SPARK-27823] [CORE] Add an abstraction layer for resource handling [SPARK-27823] [CORE] Refactor resource handling code Jun 13, 2019
@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106487 has finished for PR 24856 at commit e4fc573.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2019

Test build #106488 has finished for PR 24856 at commit 4cebf70.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106489 has finished for PR 24856 at commit 0f11e8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
} else {
throw new SparkException(s"User is expecting to use resource: $resourceName but " +
"didn't specify a discovery script!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but neither allocated by resources file nor specified a discovery script!" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resources file is provided by the cluster manager via a discovery script, not directly by user. So the error message is correct.

"didn't specify a discovery script!")
}
if (!result.name.equals(resourceName)) {
throw new SparkException("Error running the resource discovery script, script returned " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Error running the resource discovery script $scriptFile, script ...." ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val executorResourcesAndCounts = sc.conf.getAllWithPrefixAndSuffix(
SPARK_EXECUTOR_RESOURCE_PREFIX, SPARK_RESOURCE_AMOUNT_SUFFIX).toMap
val taskResourceRequirements = parseTaskResourceRequirements(sc.conf)
val executorResourcesAndCounts =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we unify executor's count to amount ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (execCount < taskReq.amount) {
throw new SparkException("The executor resource config: " +
ResourceID(SPARK_EXECUTOR_PREFIX, taskReq.resourceName).amountConf +
s" = $execCount has to be >= the task config: " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for symmetry: executor resource config <-> task resource config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

logWarning(s"The configuration of resource: ${taskReq.resourceName} " +
s"(limits tasks to $resourceNumSlots) will result in wasted resources of resource " +
s"${limitingResourceName} (would allow for $numSlots tasks). " +
"Please adjust your configuration.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's previous logic, but isn't this warning a little redundant comparing to the following warning below ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Good catch!

request = parseResourceRequest(conf, DRIVER_GPU_ID)
assert(request.id.resourceName === GPU, "should only have GPU for resource")
assert(request.amount === 2, "GPU count should be 2")
assert(request.discoveryScript.get === discoveryScript, "discovery script should be empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"discovery script should be discoveryScriptGPU"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "should get discovery script". putting discoveryScriptGPU in results duplicate info.

assert(request.id.resourceName === GPU, "should only have GPU for resource")
assert(request.amount === 2, "GPU count should be 2")
assert(request.discoveryScript.get === discoveryScript, "discovery script should be empty")
assert(request.vendor.get === vendor, "vendor should be empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"vendor should be nvidia.com"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

parse(json).extract[ResourceInformationJson].toResourceInformation
} catch {
case NonFatal(e) =>
throw new SparkException(s"Error parsing JSON into ResourceInformation:\n$json\n", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, given tips of what is right JSON format for user ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

import org.apache.spark.resource.TestResourceIDs._
import org.apache.spark.util.Utils

class ResourceUtilsSuite extends SparkFunSuite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case for the combination of resources file and discovery script ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added one test. I will leave improving test coverage to a follow-up PR since the refactoring work is quite big now.

def parseJson(json: String): ResourceInformation = {
implicit val formats = DefaultFormats
try {
parse(json).extract[ResourceInformationJson].toResourceInformation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check duplicate addresses for user ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add that check later. It is beyond the scope of this refactor.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good

def confPrefix: String = s"$componentName.resource.$resourceName." // with ending dot
def amountConf: String = s"$confPrefix${ResourceUtils.AMOUNT}"
def discoveryScriptConf: String = s"$confPrefix${ResourceUtils.DISCOVERY_SCRIPT}"
def vendorConf: String = s"$confPrefix${ResourceUtils.VENDOR}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is k8s specific, do we want to move it to the k8s package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently its only used by k8s but I didn't want to make it specific. I can see others using it in the future and its just another suffix on spark.{executor/driver}.resources.{resourcename}. . If we made it specific I don't think would be as user friendly as now I have to know which prefix to use.

def parseAllResourceRequests(
sparkConf: SparkConf,
componentName: String): Seq[ResourceRequest] = {
listResourceIds(sparkConf, componentName).map { id =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would call sparkConf.getAllWithPrefix multiple times, we may reduce the call to only once and manually turn the Map[String, String] to Seq[ResourceRequest]. I understand this is tradeoff between code readability and performance, I'm just worried the config number can be big thus we want to reduce the call of getAllWithPrefix. I'd prefer to leave a TODO here and consider the improvement in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a one-time only call for each service. I think we should optimize SparkConf and getAllWithPrefix instead.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106632 has finished for PR 24856 at commit 7fe9a04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106635 has finished for PR 24856 at commit f391585.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Jun 18, 2019

test this please

@mengxr
Copy link
Contributor Author

mengxr commented Jun 18, 2019

@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106640 has finished for PR 24856 at commit f391585.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

@mengxr It's a known flaky test, but I haven't figure out the root cause.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiangxb1987
Copy link
Contributor

Thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants