New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. #24909
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember that we don't want to do this, because the JAR might not yet exist at the time the driver is started, as it might be distributed by Spark? but I think I could be misremembering.
@@ -1799,6 +1799,20 @@ class SparkContext(config: SparkConf) extends Logging { | |||
// For local paths with backslashes on Windows, URI throws an exception | |||
addJarFile(new File(path)) | |||
} else { | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't want scaladoc syntax here, and the comment doesn't add anything anyway
val hadoopPath = new Path(schemeCorrectedPath) | ||
val fs = hadoopPath.getFileSystem(hadoopConfiguration) | ||
if(!fs.exists(hadoopPath)) | ||
throw new FileNotFoundException(s"Jar ${schemeCorrectedPath} not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, why not check this below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, why not check this below?
When we use "ADD JAR" SQL command, it will call SessionResourceBuilder's addJar method.Then it call SparkContext's addJar method. It truly happen that when we add jar path with HDFS schema, it don't check . May be we can add this check in SessionResourceBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I change this check to SessionResourceBuilder. Then only sql query will cause this check, won't impact start process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the potential impact if we add this change in SparkContext#addJar
? What I can think of is that will delay the start process as each remote jar will be checked. Also do we need to add a similar check in SparkContext#addFile
API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao when Add File, it will call fs.getFileStatus, it will check if the path is a file or a dir, this action will return exception when we add a wrong path of file.
19/06/20 14:59:45 ERROR org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: "Error executing query, currentState RUNNING, " java.io.FileNotFoundException: /userd at org.apache.hadoop.fs.viewfs.InodeTree.resolve(InodeTree.java:403) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:377) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1546) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1510) at org.apache.spark.sql.execution.command.AddFileCommand.run(resources.scala:50) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at org.apache.spark.sql.SparkSession.sql(SparkSessi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this problem does not only exist in using ADD JAR
, normally if you call SparkContext#addJar
, it will also be failed. So my thinking is that it could be fixed in addJar
, rather than a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao I was to focused on SQL engine. you said is right. Maybe I should check more with @srowen. If this problem checked, I will make a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao How about my latest change .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me I would prefer to add the check in addJar
not a separate method, which also keep align with addFile
(it will also throw an exception in place when file is not found).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao sorry, when I @ you, I forget to push mu code from local to GitHub. ==
Please change the PR title to follow the Spark pattern like others. Also please remove the PR description template sentence and add your own. |
ok to test |
cc @GregOwen Could you take a look at this PR? |
Test build #106804 has finished for PR 24909 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't it be possible that the jar path isn't accessible at driver, but only at executors?
Special case, some jar may be used only in executor, but seem's we can't check it in driver. |
Test build #106806 has finished for PR 24909 at commit
|
@gatorsmile This PR LGTM. |
Test build #106924 has finished for PR 24909 at commit
|
Test build #106925 has finished for PR 24909 at commit
|
Test build #106926 has finished for PR 24909 at commit
|
Test build #106927 has finished for PR 24909 at commit
|
Test build #106928 has finished for PR 24909 at commit
|
Test build #107575 has finished for PR 24909 at commit
|
val jarPath = "hdfs:///no/path/to/TestUDTF.jar" | ||
sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) | ||
sc.addJar(jarPath) | ||
assert(sc.listJars().filter(_.contains("TestUDTF.jar")).size == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: How about .forall(j => !j.contains("TestUDTF.jar"))
? or just check .filter(...).isEmpty
I guess this is about the best that can be done for a test without an FS to test against. So the behavior change here is that the bad path isn't added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed the test judge code .
Yeah, if path don't add, the error won't happen.
Test build #107604 has finished for PR 24909 at commit
|
val jarPath = "hdfs:///no/path/to/TestUDTF.jar" | ||
sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) | ||
sc.addJar(jarPath) | ||
assert(sc.listJars().forall(!_..contains("TestUDTF.jar"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have an extra period here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==
Before commit code, accidentally hit the keyboard, have change it .
Test build #107605 has finished for PR 24909 at commit
|
Test build #107625 has finished for PR 24909 at commit
|
@srowen |
Test build #4820 has started for PR 24909 at commit |
@@ -1792,12 +1792,36 @@ class SparkContext(config: SparkConf) extends Logging { | |||
} | |||
} | |||
|
|||
def addRemoteJarFile(path: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to change to checkRemoteJarFile
, here in this method it only checks the jar file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origin addFileJar will also check jar exists. all same to local jar file .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addJarFile
also adds the jar file to fileserver, that's the key purpose there, not just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable, done .
Test build #107666 has finished for PR 24909 at commit
|
Jenkins, retest this please. |
Recently, SparkQA always return unreasonable status. |
Test build #107674 has finished for PR 24909 at commit
|
Test build #4822 has finished for PR 24909 at commit
|
looks fine to me. Sorry jumping in late on the reviews. on the old discussion about whether we need to let people add a jar which doesn't exist yet, I agree with everybody else that there isn't a good reason to keep the old behavior, we should change it. |
Maybe for gurantee core start up process. Ignor bad path or stop core early. throw exception is ok for STS and SparkSQLCLI. bu not good for start up process. |
Jenkins, retest this please. |
To avoid some flaky tests, run jenkins again. Overall LGTM. |
Test build #107712 has finished for PR 24909 at commit
|
Thanks for the fix, merging to master branch. |
…Context, check jar path exist first. ## What changes were proposed in this pull request? ISSUE : https://issues.apache.org/jira/browse/SPARK-28106 When we use add jar in SQL, it will have three step: - add jar to HiveClient's classloader - HiveClientImpl.runHiveSQL("ADD JAR" + PATH) - SessionStateBuilder.addJar The second step seems has no impact to the whole process. Since event it failed, we still can execute. The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map. Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below: ![image](https://user-images.githubusercontent.com/46485123/59817635-4a527f80-9353-11e9-9e08-9407b2b54023.png) ## How was this patch tested? Exist Unit Test Environment Test Closes apache#24909 from AngersZhuuuu/SPARK-28106. Lead-authored-by: Angers <angers.zhu@gamil.com> Co-authored-by: 朱夷 <zhuyi01@corp.netease.com> Signed-off-by: jerryshao <jerryshao@tencent.com>
What changes were proposed in this pull request?
ISSUE : https://issues.apache.org/jira/browse/SPARK-28106
When we use add jar in SQL, it will have three step:
The second step seems has no impact to the whole process. Since event it failed, we still can execute.
The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl
The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map.
Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below:
How was this patch tested?
Exist Unit Test
Environment Test