Skip to content

[SPARK-21859][CORE] Fix SparkFiles.get failed on driver in yarn-cluster and yarn-client mode#19102

Closed
lgrcyanny wants to merge 1 commit intoapache:masterfrom
lgrcyanny:fix-spark-yarn-files-master
Closed

[SPARK-21859][CORE] Fix SparkFiles.get failed on driver in yarn-cluster and yarn-client mode#19102
lgrcyanny wants to merge 1 commit intoapache:masterfrom
lgrcyanny:fix-spark-yarn-files-master

Conversation

@lgrcyanny
Copy link

What changes were proposed in this pull request?

when use SparkFiles.get a file on driver in yarn-client or yarn-cluster, it will report file not found exception.
This exception only happens on driver, SparkFiles.get on executor works fine.
we can reproduce the bug as follows:

val conf = new SparkConf().setAppName("SparkFilesTest")
val sc = new SparkContext(conf)
def testOnDriver(fileName: String) = {
    val file = new File(SparkFiles.get(fileName))
    if (!file.exists()) {
        println(s"$file not exist")
    } else {
        // print file content on driver
        val content = Source.fromFile(file).getLines().mkString("\n")
        println(s"File content: ${content}")
    }
}
// the output will be file not exist
conf = SparkConf().setAppName("test files")
sc = SparkContext(appName="spark files test")
def test_on_driver(filename):
    file = SparkFiles.get(filename)
    print("file path: {}".format(file))
    if os.path.exists(file):
        with open(file) as f:
        lines = f.readlines()
        print(lines)
    else:
        print("file doesn't exist")
        run_command("ls .")

the output will be file not exist

How was this patch tested?

tested in integration tests and manual tests
submit the demo case in yarn-cluster and yarn-client mode, and verify the test result

./bin/spark-submit --master yarn-cluster --files README.md --class "testing.SparkFilesTest" testing.jar
./bin/spark-submit --master yarn-client --files README.md --class "testing.SparkFilesTest" testing.jar
./bin/spark-submit --master yarn-cluster --files README.md test_get_files.py
./bin/spark-submit --master yarn-client --files README.md test_get_files.py

…er and yarn-client mode

when use SparkFiles.get a file on driver in yarn-client or yarn-cluster, it will report file not found exception.
This exception only happens on driver, SparkFiles.get on executor works fine.
we can reproduce the bug as follows:
```scala
val conf = new SparkConf().setAppName("SparkFilesTest")
val sc = new SparkContext(conf)
def testOnDriver(fileName: String) = {
    val file = new File(SparkFiles.get(fileName))
    if (!file.exists()) {
        println(s"$file not exist")
    } else {
        // print file content on driver
        val content = Source.fromFile(file).getLines().mkString("\n")
        println(s"File content: ${content}")
    }
}
// the output will be file not exist
```

```python
conf = SparkConf().setAppName("test files")
sc = SparkContext(appName="spark files test")
def test_on_driver(filename):
    file = SparkFiles.get(filename)
    print("file path: {}".format(file))
    if os.path.exists(file):
        with open(file) as f:
        lines = f.readlines()
        print(lines)
    else:
        print("file doesn't exist")
        run_command("ls .")
```
the output will be file not exist

tested in integration tests and manual tests
submit the demo case in yarn-cluster and yarn-client mode, and verify the test result

```
./bin/spark-submit --master yarn-cluster --files README.md --class "testing.SparkFilesTest" testing.jar
./bin/spark-submit --master yarn-client --files README.md --class "testing.SparkFilesTest" testing.jar
./bin/spark-submit --master yarn-cluster --files README.md test_get_files.py
./bin/spark-submit --master yarn-client --files README.md test_get_files.py
```
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like exposing these files to the driver was added as a feature in SPARK-16787, whether intentionally or not. I'm fine with fixing things this way (aside from the YARN issue), but if doing this it's probably a good idea to fix the SparkSubmit help text (and maybe other documentation that may also be wrong). It currently says:

        |  --files FILES               Comma-separated list of files to be placed in the working
        |                              directory of each executor. File paths of these files
        |                              in executors can be accessed via SparkFiles.get(fileName).

Files are not necessarily placed in the working directory, and they're also now available to the driver, so this text should be updated accordingly.

OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES,
sysProp = "spark.cores.max"),
OptionAssigner(args.files, LOCAL | STANDALONE | MESOS, ALL_DEPLOY_MODES,
OptionAssigner(args.files, ALL_CLUSTER_MGRS, ALL_DEPLOY_MODES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jerryshao mentioned this in your other PR, but this is not correct. YARN distributes these files through other means, so doing this might cause other issues.

What you want to do here to make the files show up in the driver is to add some code in SparkContext. There's this code at the end of def addFile, which was added in SPARK-16787:

    val timestamp = System.currentTimeMillis
    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
      logInfo(s"Added file $path at $key with timestamp $timestamp")
      // Fetch the file locally so that closures which are run on the driver can still use the
      // SparkFiles API to access files.
      Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
        env.securityManager, hadoopConfiguration, timestamp, useCache = false)
      postEnvironmentUpdate()
    }

You basically want to do that for all files in spark.yarn.dist.files when in YARN client mode.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For yarn-client mode, --files are are already added to "spark.yarn.dist.files", I agree with you that, just addFile in SparkContext for yarn-client mode for "spark.yarn.dist.files". BTW, I will fix the doc for Spark-Submit either.
Thanks @vanzin

@lgrcyanny lgrcyanny closed this Sep 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants