Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23729][CORE] Respect URI fragment when resolving globs #20853

Closed
wants to merge 5 commits into from

Conversation

misutoth
Copy link
Contributor

What changes were proposed in this pull request?

Firstly, glob resolution will not result in swallowing the remote name part (that is preceded by the # sign) in case of --files or --archives options

Moreover in the special case of multiple resolutions when the remote naming does not make sense and error is returned.

How was this patch tested?

Enhanced current test and wrote additional test for the error case

…hich is meant to be the remote name

In case glob resolution results multiple items for a file with a remote name, an error is displayed.
@squito
Copy link
Contributor

squito commented Mar 19, 2018

Jenkins, ok to test

@vanzin
Copy link
Contributor

vanzin commented Mar 19, 2018

nit: pr title should describe the solution, not the problem. e.g. "Respect URI fragment when resolving globs" is a description of the solution.

@misutoth misutoth changed the title [SPARK-23729][SS] Glob resolution is done without the fragment part which is meant to be the remote name [SPARK-23729][SS] Respect URI fragment when resolving globs Mar 19, 2018
@vanzin
Copy link
Contributor

vanzin commented Mar 19, 2018

Also this is not related to structured streaming, so [ss] is wrong. This is a core change.

@misutoth misutoth changed the title [SPARK-23729][SS] Respect URI fragment when resolving globs [SPARK-23729][CORE] Respect URI fragment when resolving globs Mar 19, 2018
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change [SS] which is structured streaming to something else like [CORE].

@@ -105,11 +105,17 @@ class SparkSubmitSuite

// Necessary to make ScalaTest 3.x interrupt a thread on the JVM like ScalaTest 2.2.x
implicit val defaultSignaler: Signaler = ThreadSignaler
var dir: File = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the tests doesn't use this dir at all. Why create it for all the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this too. I wanted to avoid making it an Option or doing a not very nice null check. I can do that later though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean more like put something here which is used by more than 2 tests. There are ~40 tests which are just creating and deleting this directory without any benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make sure the directory is deleted even if the test fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is a good example here: test("launch simple application with spark-submit with redaction")

val jars = "/jar1,/jar2" // --jars
val files = "local:/file1,file2" // --files
val archives = "file:/archive1,archive2" // --archives
val archives = s"file:/archive1,${dir.toPath.toAbsolutePath.toString}/*.zip#archive3"
// --archives
val pyFiles = "py-file1,py-file2" // --py-files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does --py-files support renaming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the doc only --files and --archives support it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YARN's Client.scala supports renaming for everything that uses the distributed cache, even if that's not explicitly called out in the docs.

@@ -657,6 +667,31 @@ class SparkSubmitSuite
conf3.get(PYSPARK_PYTHON.key) should be ("python3.5")
}

var cleanExit = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it used for?

} catch {
case e: SparkException =>
printErrorAndExit(e.getMessage)
throw new RuntimeException("Unreachable production code")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I have a feeling it's a bit overkill compared to the other occurences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part do you mean and overkill?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new RuntimeException...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise execution just continues in the test itself where exitFn does not stop it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just let the exception propagate? That's what a lot of this code does... then you don't need to change this file at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the directory deletion is hooked into JVM shutdown. So I will let this to do the housekeeping for us and will avoid a new field either.

val renameAs = if (spath.length > 1) Some(spath(1)) else None
val resolved: Array[String] = resoloveGlobPath(spath(0), hadoopConf)
resolved match {
case array: Array[String] if !renameAs.isEmpty && array.length>1 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: array.length > 1

val spath = path.split('#')
val renameAs = if (spath.length > 1) Some(spath(1)) else None
val resolved: Array[String] = resoloveGlobPath(spath(0), hadoopConf)
resolved match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified something like this: (renameAs, resolved) match...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole match block is a little ugly, but I'll wait to see how you implement Gabor's suggestion...

}.getOrElse(Array(path))
val spath = path.split('#')
val renameAs = if (spath.length > 1) Some(spath(1)) else None
val resolved: Array[String] = resoloveGlobPath(spath(0), hadoopConf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: resolveGlobPath

case array: Array[String] if !renameAs.isEmpty && array.length>1 =>
throw new SparkException(
s"${spath(1)} resolves ambiguously to multiple files: ${array.mkString(",")}")
case array: Array[String] if !renameAs.isEmpty => array.map( _ + "#" + renameAs.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can find some meaningful name for array which makes me hard to read the code.

Option(fs.globStatus(new Path(uri))).map { status =>
status.filter(_.isFile).map(_.getPath.toUri.toString)
}.getOrElse(Array(path))
val spath = path.split('#')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Utils.resolveURI as before? Parsing URIs by hand is very sketchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It took some time to clone a URI without the fragment part though but next version will include that.

val spath = path.split('#')
val renameAs = if (spath.length > 1) Some(spath(1)) else None
val resolved: Array[String] = resoloveGlobPath(spath(0), hadoopConf)
resolved match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole match block is a little ugly, but I'll wait to see how you implement Gabor's suggestion...

} catch {
case e: SparkException =>
printErrorAndExit(e.getMessage)
throw new RuntimeException("Unreachable production code")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just let the exception propagate? That's what a lot of this code does... then you don't need to change this file at all.

@misutoth
Copy link
Contributor Author

Maybe just let the exception propagate? That's what a lot of this code does... then you don't need to change this file at all.

@vanzin I want to present an error on the CLI. This is what the printErrorAndExit does as I understood.

On the other hand I simplified to just rethrowing the Exception e

@SparkQA
Copy link

SparkQA commented Mar 19, 2018

Test build #88379 has finished for PR 20853 at commit 50b5ad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}.getOrElse(Array(path))
val (base, fragment) = splitOnFragment(Utils.resolveURI(path))
(resolveGlobPath(base, hadoopConf), fragment) match {
case (resolved: Array[String], Some(_)) if resolved.length > 1 => throw new SparkException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type inference is not working here?

(resolveGlobPath(base, hadoopConf), fragment) match {
case (resolved: Array[String], Some(_)) if resolved.length > 1 => throw new SparkException(
s"${base.toString} resolves ambiguously to multiple files: ${resolved.mkString(",")}")
case (resolved: Array[String], Some(namedAs)) => resolved.map( _ + "#" + namedAs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88385 has finished for PR 20853 at commit 9f391de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88389 has finished for PR 20853 at commit 8a12452.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val (base, fragment) = splitOnFragment(path)
(resolveGlobPath(base, hadoopConf), fragment) match {
case (resolved, Some(_)) if resolved.length > 1 => throw new SparkException(
s"${base.toString} resolves ambiguously to multiple files: ${resolved.mkString(",")}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: resolved.mkString(", ")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no space used here before. Actually there should not be any space in the resulting list. Tests also rely on this.

(withoutFragment, fragment)
}

private def resolveGlobPath(uri: URI, hadoopConf: Configuration): Array [String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Array[String]

try {
doPrepareSubmitEnvironment(args, conf)
} catch {
case e: SparkException => printErrorAndExit(e.getMessage); throw e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

case e: SparkException =>
    printErrorAndExit(e.getMessage)
    throw e

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88406 has finished for PR 20853 at commit 5515da6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88430 has finished for PR 20853 at commit 5515da6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88429 has finished for PR 20853 at commit 5515da6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things otherwise looks good.

private def splitOnFragment(path: String): (URI, Option[String]) = {
val uri = Utils.resolveURI(path)
val withoutFragment = new URI(uri.getScheme, uri.getSchemeSpecificPart, null)
val fragment = if (uri.getFragment != null) Some(uri.getFragment) else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option(uri.getFragment)

Files.createFile(archive2)
val jars = "/jar1,/jar2" // --jars
val files = "local:/file1,file2" // --files
// --archives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment. I know the other test has them, but I'd just remove these from this new code, since they don't add any useful information.

val archives = s"file:/archive1,${dir.toPath.toAbsolutePath.toString}/*.zip#archive3"
val pyFiles = "py-file1,py-file2" // --py-files

// Test files and archives (Yarn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88476 has finished for PR 20853 at commit ce5273d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 22, 2018

Merging to master / 2.3.

asfgit pushed a commit that referenced this pull request Mar 22, 2018
Firstly, glob resolution will not result in swallowing the remote name part (that is preceded by the `#` sign) in case of `--files` or `--archives` options

Moreover in the special case of multiple resolutions when the remote naming does not make sense and error is returned.

Enhanced current test and wrote additional test for the error case

Author: Mihaly Toth <misutoth@gmail.com>

Closes #20853 from misutoth/glob-with-remote-name.

(cherry picked from commit 0604bea)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@asfgit asfgit closed this in 0604bea Mar 22, 2018
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
Firstly, glob resolution will not result in swallowing the remote name part (that is preceded by the `#` sign) in case of `--files` or `--archives` options

Moreover in the special case of multiple resolutions when the remote naming does not make sense and error is returned.

Enhanced current test and wrote additional test for the error case

Author: Mihaly Toth <misutoth@gmail.com>

Closes apache#20853 from misutoth/glob-with-remote-name.

(cherry picked from commit 0604bea)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
Ref: HADOOP-42709
Firstly, glob resolution will not result in swallowing the remote name part (that is preceded by the `#` sign) in case of `--files` or `--archives` options

Moreover in the special case of multiple resolutions when the remote naming does not make sense and error is returned.

Enhanced current test and wrote additional test for the error case

Author: Mihaly Toth <misutoth@gmail.com>

Closes apache#20853 from misutoth/glob-with-remote-name.

(cherry picked from commit 0604bea)

RB=1500362
BUG=LIHADOOP-42709
G=superfriends-reviewers
R=fli,mshen,yezhou,edlu
A=fli,xhzhang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants