[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. #24909

AngersZhuuuu · 2019-06-19T13:14:39Z

What changes were proposed in this pull request?

ISSUE : https://issues.apache.org/jira/browse/SPARK-28106
When we use add jar in SQL, it will have three step:

add jar to HiveClient's classloader
HiveClientImpl.runHiveSQL("ADD JAR" + PATH)
SessionStateBuilder.addJar

The second step seems has no impact to the whole process. Since event it failed, we still can execute.
The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl
The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map.

Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below:

How was this patch tested?

Exist Unit Test
Environment Test

srowen

I vaguely remember that we don't want to do this, because the JAR might not yet exist at the time the driver is started, as it might be distributed by Spark? but I think I could be misremembering.

srowen · 2019-06-19T13:36:37Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -1799,6 +1799,20 @@ class SparkContext(config: SparkConf) extends Logging {
        // For local paths with backslashes on Windows, URI throws an exception
        addJarFile(new File(path))
      } else {
+        /**


Nit: you don't want scaladoc syntax here, and the comment doesn't add anything anyway

srowen · 2019-06-19T13:36:52Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+        val hadoopPath = new Path(schemeCorrectedPath)
+        val fs = hadoopPath.getFileSystem(hadoopConfiguration)
+        if(!fs.exists(hadoopPath))
+          throw new FileNotFoundException(s"Jar ${schemeCorrectedPath} not found")


If anything, why not check this below?

If anything, why not check this below?
When we use "ADD JAR" SQL command, it will call SessionResourceBuilder's addJar method.Then it call SparkContext's addJar method. It truly happen that when we add jar path with HDFS schema, it don't check . May be we can add this check in SessionResourceBuilder?

@srowen I change this check to SessionResourceBuilder. Then only sql query will cause this check, won't impact start process

What is the potential impact if we add this change in SparkContext#addJar? What I can think of is that will delay the start process as each remote jar will be checked. Also do we need to add a similar check in SparkContext#addFile API?

@jerryshao when Add File, it will call fs.getFileStatus, it will check if the path is a file or a dir, this action will return exception when we add a wrong path of file.

19/06/20 14:59:45 ERROR org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: "Error executing query, currentState RUNNING, " java.io.FileNotFoundException: /userd at org.apache.hadoop.fs.viewfs.InodeTree.resolve(InodeTree.java:403) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:377) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1546) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1510) at org.apache.spark.sql.execution.command.AddFileCommand.run(resources.scala:50) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at org.apache.spark.sql.SparkSession.sql(SparkSessi

I think this problem does not only exist in using ADD JAR, normally if you call SparkContext#addJar, it will also be failed. So my thinking is that it could be fixed in addJar, rather than a separate method.

@jerryshao I was to focused on SQL engine. you said is right. Maybe I should check more with @srowen. If this problem checked, I will make a change.

@jerryshao How about my latest change .

To me I would prefer to add the check in addJar not a separate method, which also keep align with addFile (it will also throw an exception in place when file is not found).

@jerryshao sorry, when I @ you, I forget to push mu code from local to GitHub. ==

jerryshao · 2019-06-20T03:45:51Z

Please change the PR title to follow the Spark pattern like others. Also please remove the PR description template sentence and add your own.

gatorsmile · 2019-06-23T07:27:12Z

ok to test

gatorsmile · 2019-06-23T07:29:21Z

cc @GregOwen Could you take a look at this PR?

SparkQA · 2019-06-23T07:33:27Z

Test build #106804 has finished for PR 24909 at commit 44b5462.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Can't it be possible that the jar path isn't accessible at driver, but only at executors?

AngersZhuuuu · 2019-06-23T09:35:38Z

Can't it be possible that the jar path isn't accessible at driver, but only at executors?

Special case, some jar may be used only in executor, but seem's we can't check it in driver.
For add jar , local file will be add to RPC's file server, then executor can get it. For remote file, we just make sure it exist ,then let executor to get it. That's enough.
But if driver can reach but executor can't, that should be a ENV setting up problem

SparkQA · 2019-06-23T12:07:18Z

Test build #106806 has finished for PR 24909 at commit 63b7c6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GregOwen · 2019-06-24T22:47:33Z

@gatorsmile This PR LGTM.
I checked with @yunzoud and she says that she doesn't know of any applications that currently use the "add a jar that doesn't yet exist" feature that @srowen mentions in his comment. If we're concerned about breaking those workflows, we can add a Spark conf to decide whether or not to fail fast.

SparkQA · 2019-06-26T09:10:44Z

Test build #106924 has finished for PR 24909 at commit cf98646.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T09:22:02Z

Test build #106925 has finished for PR 24909 at commit 71af716.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T09:42:30Z

Test build #106926 has finished for PR 24909 at commit e863d20.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T09:57:41Z

Test build #106927 has finished for PR 24909 at commit 4bb4e89.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-26T11:37:23Z

Test build #106928 has finished for PR 24909 at commit f53fe21.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…hange to spark schema

SparkQA · 2019-07-12T04:53:59Z

Test build #107575 has finished for PR 24909 at commit 8d0f3f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/SparkContext.scala

srowen · 2019-07-12T15:42:41Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+      val jarPath = "hdfs:///no/path/to/TestUDTF.jar"
+      sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
+      sc.addJar(jarPath)
+      assert(sc.listJars().filter(_.contains("TestUDTF.jar")).size == 0)


Nit: How about .forall(j => !j.contains("TestUDTF.jar"))? or just check .filter(...).isEmpty
I guess this is about the best that can be done for a test without an FS to test against. So the behavior change here is that the bad path isn't added.

Have changed the test judge code .
Yeah, if path don't add, the error won't happen.

SparkQA · 2019-07-12T16:22:28Z

Test build #107604 has finished for PR 24909 at commit da76d97.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-12T16:22:39Z

core/src/test/scala/org/apache/spark/SparkContextSuite.scala

+      val jarPath = "hdfs:///no/path/to/TestUDTF.jar"
+      sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
+      sc.addJar(jarPath)
+      assert(sc.listJars().forall(!_..contains("TestUDTF.jar")))


I think you have an extra period here

==
Before commit code, accidentally hit the keyboard, have change it .

SparkQA · 2019-07-12T18:23:35Z

Test build #107605 has finished for PR 24909 at commit 8820641.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-13T02:45:23Z

Test build #107625 has finished for PR 24909 at commit 03dcfaf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2019-07-13T03:15:02Z

@srowen
Test failed , but seems not my change's problem.

SparkQA · 2019-07-13T20:31:07Z

Test build #4820 has started for PR 24909 at commit 03dcfaf.

jerryshao · 2019-07-15T06:05:03Z

core/src/main/scala/org/apache/spark/SparkContext.scala

@@ -1792,12 +1792,36 @@ class SparkContext(config: SparkConf) extends Logging {
      }
    }

+    def addRemoteJarFile(path: String): String = {


Better to change to checkRemoteJarFile, here in this method it only checks the jar file.

origin addFileJar will also check jar exists. all same to local jar file .

addJarFile also adds the jar file to fileserver, that's the key purpose there, not just checking.

Reasonable, done .

SparkQA · 2019-07-15T07:05:01Z

Test build #107666 has finished for PR 24909 at commit 780a2b5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2019-07-15T07:52:49Z

Jenkins, retest this please.

AngersZhuuuu · 2019-07-15T07:56:36Z

Jenkins, retest this please.

Recently， SparkQA always return unreasonable status.
Return unit test failed , but I can't find which one .

SparkQA · 2019-07-15T09:26:03Z

Test build #107674 has finished for PR 24909 at commit 780a2b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-15T16:49:58Z

Test build #4822 has finished for PR 24909 at commit 780a2b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-07-15T19:18:46Z

looks fine to me.

Sorry jumping in late on the reviews. on the old discussion about whether we need to let people add a jar which doesn't exist yet, I agree with everybody else that there isn't a good reason to keep the old behavior, we should change it.

AngersZhuuuu · 2019-07-16T01:45:45Z

looks fine to me.

Sorry jumping in late on the reviews. on the old discussion about whether we need to let people add a jar which doesn't exist yet, I agree with everybody else that there isn't a good reason to keep the old behavior, we should change it.

Maybe for gurantee core start up process. Ignor bad path or stop core early. throw exception is ok for STS and SparkSQLCLI. bu not good for start up process.

jerryshao · 2019-07-16T01:46:16Z

Jenkins, retest this please.

jerryshao · 2019-07-16T01:47:19Z

To avoid some flaky tests, run jenkins again. Overall LGTM.

SparkQA · 2019-07-16T04:04:25Z

Test build #107712 has finished for PR 24909 at commit 780a2b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2019-07-16T07:25:50Z

Thanks for the fix, merging to master branch.

…Context, check jar path exist first. ## What changes were proposed in this pull request? ISSUE : https://issues.apache.org/jira/browse/SPARK-28106 When we use add jar in SQL, it will have three step: - add jar to HiveClient's classloader - HiveClientImpl.runHiveSQL("ADD JAR" + PATH) - SessionStateBuilder.addJar The second step seems has no impact to the whole process. Since event it failed, we still can execute. The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map. Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below: ![image](https://user-images.githubusercontent.com/46485123/59817635-4a527f80-9353-11e9-9e08-9407b2b54023.png) ## How was this patch tested? Exist Unit Test Environment Test Closes apache#24909 from AngersZhuuuu/SPARK-28106. Lead-authored-by: Angers <angers.zhu@gamil.com> Co-authored-by: 朱夷 <zhuyi01@corp.netease.com> Signed-off-by: jerryshao <jerryshao@tencent.com>

[spark-28016] check add jar path exist jar

0e22773

AngersZhuuuu changed the title ~~[spark-28106] check add jar path exist jar~~ [SPARK-28106] check add jar path exist jar Jun 19, 2019

srowen reviewed Jun 19, 2019

View reviewed changes

Change jar path check to SessionResourceBuilder

44b5462

AngersZhuuuu changed the title ~~[SPARK-28106] check add jar path exist jar~~ [SPARK-28106][SQL] When add jar, check path exist . Jun 20, 2019

AngersZhuuuu changed the title ~~[SPARK-28106][SQL] When add jar, check path exist .~~ [SPARK-28106][SQL] When add jar, check path exist first. Jun 20, 2019

AngersZhuuuu changed the title ~~[SPARK-28106][SQL] When add jar, check path exist first.~~ [SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. Jun 20, 2019

dongjoon-hyun added the SQL label Jun 22, 2019

Angers added 2 commits June 23, 2019 17:02

Modify scala stype

22b56a5

modify commengt to JAVA doc style

63b7c6a

viirya reviewed Jun 23, 2019

View reviewed changes

add check to SparkContext and only check for sql behavior

cf98646

fix scala type

71af716

fix bug of parameter loss

e863d20

add origin add jar method

4bb4e89

fix scala type error

f53fe21

Angers added 2 commits June 26, 2019 20:11

For local jar file, it will be checked with addJarFile method, then c…

5f277c2

…hange to spark schema

fix path to null problem

2a159ec

srowen reviewed Jul 12, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/SparkContext.scala Show resolved Hide resolved

srowen reviewed Jul 12, 2019

View reviewed changes

Change test assert method for better style

da76d97

srowen reviewed Jul 12, 2019

View reviewed changes

fix little bug

8820641

change unit test title

03dcfaf

jerryshao reviewed Jul 15, 2019

View reviewed changes

change method name from addRemoteJarFile to checkRemoteJarFile

780a2b5

jerryshao approved these changes Jul 16, 2019

View reviewed changes

jerryshao closed this in be4a552 Jul 16, 2019

[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. #24909

[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. #24909

Conversation

AngersZhuuuu commented Jun 19, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao Jun 25, 2019 • edited

Choose a reason for hiding this comment

AngersZhuuuu Jun 25, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Jun 20, 2019

gatorsmile commented Jun 23, 2019

gatorsmile commented Jun 23, 2019

SparkQA commented Jun 23, 2019

viirya left a comment

Choose a reason for hiding this comment

AngersZhuuuu commented Jun 23, 2019 • edited

SparkQA commented Jun 23, 2019

GregOwen commented Jun 24, 2019

SparkQA commented Jun 26, 2019

SparkQA commented Jun 26, 2019

SparkQA commented Jun 26, 2019

SparkQA commented Jun 26, 2019

SparkQA commented Jun 26, 2019

SparkQA commented Jul 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2019

SparkQA commented Jul 13, 2019

AngersZhuuuu commented Jul 13, 2019

SparkQA commented Jul 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 15, 2019

jerryshao commented Jul 15, 2019

AngersZhuuuu commented Jul 15, 2019 • edited

SparkQA commented Jul 15, 2019

SparkQA commented Jul 15, 2019

squito commented Jul 15, 2019

AngersZhuuuu commented Jul 16, 2019

jerryshao commented Jul 16, 2019

jerryshao commented Jul 16, 2019

SparkQA commented Jul 16, 2019

jerryshao commented Jul 16, 2019

AngersZhuuuu commented Jun 19, 2019 •

edited

jerryshao Jun 25, 2019 •

edited

AngersZhuuuu Jun 25, 2019 •

edited

AngersZhuuuu commented Jun 23, 2019 •

edited

AngersZhuuuu commented Jul 15, 2019 •

edited