New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR #14960
Conversation
I would like to make sure that this pass both AppVeyor CI for SparkR and Jenkins ones. So, I will remove |
// Make sure /C:/ part becomes /C/. | ||
val windowsUri = new URI(path.substring(1)) | ||
val driveLetter = windowsUri.getScheme | ||
s"/$driveLetter/${windowsUri.getSchemeSpecificPart()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this logic is taken after Hadoop codes. Please let me know if there are equivalent functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed? I think resolveURI
already treat Windows-style path.
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala#L398
Or, did you find another case which this method cannot treat properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. yeap, Firstly, I didn't add this logic and run the test with this diff (HyukjinKwon/spark@master...HyukjinKwon:a136206dad6011d6ad112f33417790ad3c6a9912)
it produced the output here
https://gist.github.com/HyukjinKwon/f3f9a36dde88028ca09fd417b6ce5c68
(several test failures were removed here anyway).
So, I corrected this (without knowing we are testing that case). with the diff here (HyukjinKwon/spark@master...HyukjinKwon:b648e4f2d5aae072748a97550cfcf832c57d9315)
it produced the output here, https://gist.github.com/HyukjinKwon/0c42b2c208e06c59525d91087252d9b0
(all test failures except for two cases were removed here).
This seems it reduces roughly 10ish test failures. So I thought this is a legitimate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check and will look into this deeper. I didn't notice we have that test anyway. Thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the original test failures FYI - https://gist.github.com/shivaram/7693df7bd54dc81e2e7d1ce296c41134
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option is to of course just use the Hadoop Path
class and do something like new Path(path).toURI
-- I think they handle C:/
correctly. I don't know if this affects other functionality though (like SPARK-11227) and we should check with @sarutak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, meanwhile, I will try to use that and run tests. Thanks for your quick feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way @shivaram mentioned works well and doesn't affect SPARK-11227.
@HyukjinKwon You can fix this problem with the way but if you will fix resolveURI
, adding new test cases to UtilsSuite
is desireble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me please just use new Path(...).toUri
directly to deal with this if it seems okay.
I just tried to fix resolveURI
to use new Path(path).toUri
instead of new URI(path)
but I found it breaks existing tests for resolveURI
. It seems parsing special characters differently, for example , #
"hdfs:/root/spark.jar[%23]app.jar" did not equal "hdfs:/root/spark.jar[#]app.jar"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O.K, let's fix the other problem of resolveURI
in another PR.
Test build #64936 has finished for PR 14960 at commit
|
Yup, finally we got a green - https://ci.appveyor.com/project/HyukjinKwon/spark/build/77-SPARK-17339-fix-r! |
Test build #64972 has finished for PR 14960 at commit
|
@@ -992,7 +992,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli | |||
|
|||
// This is a hack to enforce loading hdfs-site.xml. | |||
// See SPARK-11227 for details. | |||
FileSystem.get(new URI(path), hadoopConfiguration) | |||
FileSystem.get(new Path(path).toUri, hadoopConfiguration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor question I had was how this would work with comma separated list of file names as we allow that in textFile (for example at https://github.com/HyukjinKwon/spark/blob/790d5b2304473555d1edf113f9bbee3034134fac/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L323)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's handled upstream in SparkSession/SQLContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I didn't know it supports comma separated path. BTW, we still can use spark.sparkContext.textFile(..)
though.
I took a look and it seems okay though (but it's ugly and hacky).
If the first given path is okay, it seems working fine. It looks only getScheme
and getAuth
are used in FileSystem.get(..)
(I tracked down the FileSystem.get(..)
and related function calls.)
So, iff the first path is correct, it seems getAuthority
and getScheme
give a correct ones to get a file system.
For example, the path http://localhost:8080/a/b,http://localhost:8081/c/d
parses the URI as below:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is known that is hacky and ugly, maybe we can make this separate to another issue (although I am careful to say this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc - @sarutak WDYT? is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure what part of the URI we are using here. If its just the scheme, authority then I think its fine to use that from the first path. FWIW there is a method in Hadoop to parse comma separated path strings but its private [1].
IMHO this problem existed even before this PR so I'm fine not fixing it here if thats okay with @sarutak
Thanks @HyukjinKwon - LGTM but for a minor comment I had inline. |
I found we can replace @HyukjinKwon You can replace them in this PR or not do it. |
@sarutak Ah, I will do this here. Thanks! |
I re-ran the test after this commit - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r Let's wait and see :) - if there is any problem with this, I will revert this change. |
seems to fail to build:
|
Yeap, I quickly fixed and re-ran :). Thanks! |
Looks like it passed ! LGTM pending our Jenkins tests |
Test build #65026 has finished for PR 14960 at commit
|
retest this please |
Test build #65030 has finished for PR 14960 at commit
|
LGTM. Merging this into |
I noticed this PR is not able to be merged cleanly to |
Sure! I will tomorrow. |
…n hadoopFile and newHadoopFile APIs ## What changes were proposed in this pull request? This PR backports #14960 ## How was this patch tested? AppVeyor - https://ci.appveyor.com/project/HyukjinKwon/spark/build/86-backport-SPARK-17339-r Author: hyukjinkwon <gurwls223@gmail.com> Closes #15008 from HyukjinKwon/backport-SPARK-17339.
What changes were proposed in this pull request?
This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.
How was this patch tested?
Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r