Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. #24909

Closed
wants to merge 29 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Jun 19, 2019

What changes were proposed in this pull request?

ISSUE : https://issues.apache.org/jira/browse/SPARK-28106
When we use add jar in SQL, it will have three step:

  • add jar to HiveClient's classloader
  • HiveClientImpl.runHiveSQL("ADD JAR" + PATH)
  • SessionStateBuilder.addJar

The second step seems has no impact to the whole process. Since event it failed, we still can execute.
The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl
The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map.

Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below:

image

How was this patch tested?

Exist Unit Test
Environment Test

@AngersZhuuuu AngersZhuuuu changed the title [spark-28106] check add jar path exist jar [SPARK-28106] check add jar path exist jar Jun 19, 2019
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember that we don't want to do this, because the JAR might not yet exist at the time the driver is started, as it might be distributed by Spark? but I think I could be misremembering.

@@ -1799,6 +1799,20 @@ class SparkContext(config: SparkConf) extends Logging {
// For local paths with backslashes on Windows, URI throws an exception
addJarFile(new File(path))
} else {
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't want scaladoc syntax here, and the comment doesn't add anything anyway

val hadoopPath = new Path(schemeCorrectedPath)
val fs = hadoopPath.getFileSystem(hadoopConfiguration)
if(!fs.exists(hadoopPath))
throw new FileNotFoundException(s"Jar ${schemeCorrectedPath} not found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, why not check this below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, why not check this below?
When we use "ADD JAR" SQL command, it will call SessionResourceBuilder's addJar method.Then it call SparkContext's addJar method. It truly happen that when we add jar path with HDFS schema, it don't check . May be we can add this check in SessionResourceBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I change this check to SessionResourceBuilder. Then only sql query will cause this check, won't impact start process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the potential impact if we add this change in SparkContext#addJar? What I can think of is that will delay the start process as each remote jar will be checked. Also do we need to add a similar check in SparkContext#addFile API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao when Add File, it will call fs.getFileStatus, it will check if the path is a file or a dir, this action will return exception when we add a wrong path of file.

19/06/20 14:59:45 ERROR org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: "Error executing query, currentState RUNNING, " java.io.FileNotFoundException: /userd at org.apache.hadoop.fs.viewfs.InodeTree.resolve(InodeTree.java:403) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:377) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1546) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1510) at org.apache.spark.sql.execution.command.AddFileCommand.run(resources.scala:50) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at org.apache.spark.sql.SparkSession.sql(SparkSessi

Copy link
Contributor

@jerryshao jerryshao Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem does not only exist in using ADD JAR, normally if you call SparkContext#addJar, it will also be failed. So my thinking is that it could be fixed in addJar, rather than a separate method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao I was to focused on SQL engine. you said is right. Maybe I should check more with @srowen. If this problem checked, I will make a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao How about my latest change .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me I would prefer to add the check in addJar not a separate method, which also keep align with addFile (it will also throw an exception in place when file is not found).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao sorry, when I @ you, I forget to push mu code from local to GitHub. ==

@jerryshao
Copy link
Contributor

Please change the PR title to follow the Spark pattern like others. Also please remove the PR description template sentence and add your own.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-28106] check add jar path exist jar [SPARK-28106][SQL] When add jar, check path exist . Jun 20, 2019
@AngersZhuuuu AngersZhuuuu changed the title [SPARK-28106][SQL] When add jar, check path exist . [SPARK-28106][SQL] When add jar, check path exist first. Jun 20, 2019
@AngersZhuuuu AngersZhuuuu changed the title [SPARK-28106][SQL] When add jar, check path exist first. [SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first. Jun 20, 2019
@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @GregOwen Could you take a look at this PR?

@SparkQA
Copy link

SparkQA commented Jun 23, 2019

Test build #106804 has finished for PR 24909 at commit 44b5462.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't it be possible that the jar path isn't accessible at driver, but only at executors?

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jun 23, 2019

Can't it be possible that the jar path isn't accessible at driver, but only at executors?

Special case, some jar may be used only in executor, but seem's we can't check it in driver.
For add jar , local file will be add to RPC's file server, then executor can get it. For remote file, we just make sure it exist ,then let executor to get it. That's enough.
But if driver can reach but executor can't, that should be a ENV setting up problem

@SparkQA
Copy link

SparkQA commented Jun 23, 2019

Test build #106806 has finished for PR 24909 at commit 63b7c6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GregOwen
Copy link
Contributor

@gatorsmile This PR LGTM.
I checked with @yunzoud and she says that she doesn't know of any applications that currently use the "add a jar that doesn't yet exist" feature that @srowen mentions in his comment. If we're concerned about breaking those workflows, we can add a Spark conf to decide whether or not to fail fast.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106924 has finished for PR 24909 at commit cf98646.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106925 has finished for PR 24909 at commit 71af716.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106926 has finished for PR 24909 at commit e863d20.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106927 has finished for PR 24909 at commit 4bb4e89.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106928 has finished for PR 24909 at commit f53fe21.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107575 has finished for PR 24909 at commit 8d0f3f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val jarPath = "hdfs:///no/path/to/TestUDTF.jar"
sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
sc.addJar(jarPath)
assert(sc.listJars().filter(_.contains("TestUDTF.jar")).size == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: How about .forall(j => !j.contains("TestUDTF.jar"))? or just check .filter(...).isEmpty
I guess this is about the best that can be done for a test without an FS to test against. So the behavior change here is that the bad path isn't added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed the test judge code .
Yeah, if path don't add, the error won't happen.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107604 has finished for PR 24909 at commit da76d97.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val jarPath = "hdfs:///no/path/to/TestUDTF.jar"
sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
sc.addJar(jarPath)
assert(sc.listJars().forall(!_..contains("TestUDTF.jar")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have an extra period here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==
Before commit code, accidentally hit the keyboard, have change it .

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107605 has finished for PR 24909 at commit 8820641.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2019

Test build #107625 has finished for PR 24909 at commit 03dcfaf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

@srowen
Test failed , but seems not my change's problem.

@SparkQA
Copy link

SparkQA commented Jul 13, 2019

Test build #4820 has started for PR 24909 at commit 03dcfaf.

@@ -1792,12 +1792,36 @@ class SparkContext(config: SparkConf) extends Logging {
}
}

def addRemoteJarFile(path: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to change to checkRemoteJarFile, here in this method it only checks the jar file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin addFileJar will also check jar exists. all same to local jar file .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addJarFile also adds the jar file to fileserver, that's the key purpose there, not just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable, done .

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107666 has finished for PR 24909 at commit 780a2b5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

Jenkins, retest this please.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jul 15, 2019

Jenkins, retest this please.

Recently, SparkQA always return unreasonable status.
Return unit test failed , but I can't find which one .

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107674 has finished for PR 24909 at commit 780a2b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #4822 has finished for PR 24909 at commit 780a2b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jul 15, 2019

looks fine to me.

Sorry jumping in late on the reviews. on the old discussion about whether we need to let people add a jar which doesn't exist yet, I agree with everybody else that there isn't a good reason to keep the old behavior, we should change it.

@AngersZhuuuu
Copy link
Contributor Author

looks fine to me.

Sorry jumping in late on the reviews. on the old discussion about whether we need to let people add a jar which doesn't exist yet, I agree with everybody else that there isn't a good reason to keep the old behavior, we should change it.

Maybe for gurantee core start up process. Ignor bad path or stop core early. throw exception is ok for STS and SparkSQLCLI. bu not good for start up process.

@jerryshao
Copy link
Contributor

Jenkins, retest this please.

@jerryshao
Copy link
Contributor

To avoid some flaky tests, run jenkins again. Overall LGTM.

@SparkQA
Copy link

SparkQA commented Jul 16, 2019

Test build #107712 has finished for PR 24909 at commit 780a2b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

Thanks for the fix, merging to master branch.

@jerryshao jerryshao closed this in be4a552 Jul 16, 2019
vinodkc pushed a commit to vinodkc/spark that referenced this pull request Jul 18, 2019
…Context, check jar path exist first.

## What changes were proposed in this pull request?
ISSUE :  https://issues.apache.org/jira/browse/SPARK-28106
When we use add jar in SQL, it will have three step:

- add jar to HiveClient's classloader
- HiveClientImpl.runHiveSQL("ADD JAR" + PATH)
- SessionStateBuilder.addJar

The second step seems has no impact to the whole process. Since event it failed, we still can execute.
The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl
The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map.

Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below:

![image](https://user-images.githubusercontent.com/46485123/59817635-4a527f80-9353-11e9-9e08-9407b2b54023.png)

## How was this patch tested?
Exist Unit Test
Environment Test

Closes apache#24909 from AngersZhuuuu/SPARK-28106.

Lead-authored-by: Angers <angers.zhu@gamil.com>
Co-authored-by: 朱夷 <zhuyi01@corp.netease.com>
Signed-off-by: jerryshao <jerryshao@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants