Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR #14960

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 5, 2016

What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

@HyukjinKwon
Copy link
Member Author

I would like to make sure that this pass both AppVeyor CI for SparkR and Jenkins ones. So, I will remove [WIP] if the tests go okay for both.

// Make sure /C:/ part becomes /C/.
val windowsUri = new URI(path.substring(1))
val driveLetter = windowsUri.getScheme
s"/$driveLetter/${windowsUri.getSchemeSpecificPart()}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this logic is taken after Hadoop codes. Please let me know if there are equivalent functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed? I think resolveURI already treat Windows-style path.
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala#L398

Or, did you find another case which this method cannot treat properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. yeap, Firstly, I didn't add this logic and run the test with this diff (HyukjinKwon/spark@master...HyukjinKwon:a136206dad6011d6ad112f33417790ad3c6a9912)

it produced the output here

https://gist.github.com/HyukjinKwon/f3f9a36dde88028ca09fd417b6ce5c68

(several test failures were removed here anyway).

So, I corrected this (without knowing we are testing that case). with the diff here (HyukjinKwon/spark@master...HyukjinKwon:b648e4f2d5aae072748a97550cfcf832c57d9315)

it produced the output here, https://gist.github.com/HyukjinKwon/0c42b2c208e06c59525d91087252d9b0

(all test failures except for two cases were removed here).

This seems it reduces roughly 10ish test failures. So I thought this is a legitimate change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check and will look into this deeper. I didn't notice we have that test anyway. Thanks for pointing this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is to of course just use the Hadoop Path class and do something like new Path(path).toURI -- I think they handle C:/ correctly. I don't know if this affects other functionality though (like SPARK-11227) and we should check with @sarutak

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, meanwhile, I will try to use that and run tests. Thanks for your quick feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way @shivaram mentioned works well and doesn't affect SPARK-11227.
@HyukjinKwon You can fix this problem with the way but if you will fix resolveURI, adding new test cases to UtilsSuite is desireble.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me please just use new Path(...).toUri directly to deal with this if it seems okay.

I just tried to fix resolveURI to use new Path(path).toUri instead of new URI(path) but I found it breaks existing tests for resolveURI. It seems parsing special characters differently, for example , #

"hdfs:/root/spark.jar[%23]app.jar" did not equal "hdfs:/root/spark.jar[#]app.jar"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.K, let's fix the other problem of resolveURI in another PR.

@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64936 has finished for PR 14960 at commit f42c182.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Yup, finally we got a green - https://ci.appveyor.com/project/HyukjinKwon/spark/build/77-SPARK-17339-fix-r!

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-17339][SPARKR][CORE] Fix Windows path issues in SparkContext causing the tests failures in SparkR on Windows [SPARK-17339][SPARKR][CORE] Fix Windows path issues in SparkContext causing the tests failures in SparkR on Windows Sep 6, 2016
@HyukjinKwon HyukjinKwon changed the title [SPARK-17339][SPARKR][CORE] Fix Windows path issues in SparkContext causing the tests failures in SparkR on Windows [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR Sep 6, 2016
@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64972 has finished for PR 14960 at commit 790d5b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -992,7 +992,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli

// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
FileSystem.get(new URI(path), hadoopConfiguration)
FileSystem.get(new Path(path).toUri, hadoopConfiguration)
Copy link
Contributor

@shivaram shivaram Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question I had was how this would work with comma separated list of file names as we allow that in textFile (for example at https://github.com/HyukjinKwon/spark/blob/790d5b2304473555d1edf113f9bbee3034134fac/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L323)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's handled upstream in SparkSession/SQLContext.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I didn't know it supports comma separated path. BTW, we still can use spark.sparkContext.textFile(..) though.
I took a look and it seems okay though (but it's ugly and hacky).

If the first given path is okay, it seems working fine. It looks only getScheme and getAuth are used in FileSystem.get(..) (I tracked down the FileSystem.get(..) and related function calls.)

So, iff the first path is correct, it seems getAuthority and getScheme give a correct ones to get a file system.

For example, the path http://localhost:8080/a/b,http://localhost:8081/c/d parses the URI as below:

2016-09-07 10 19 11

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is known that is hacky and ugly, maybe we can make this separate to another issue (although I am careful to say this)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc - @sarutak WDYT? is my understanding correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure what part of the URI we are using here. If its just the scheme, authority then I think its fine to use that from the first path. FWIW there is a method in Hadoop to parse comma separated path strings but its private [1].

IMHO this problem existed even before this PR so I'm fine not fixing it here if thats okay with @sarutak

[1] https://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/mapred/FileInputFormat.html#line.467

@shivaram
Copy link
Contributor

shivaram commented Sep 6, 2016

Thanks @HyukjinKwon - LGTM but for a minor comment I had inline.

@sarutak
Copy link
Member

sarutak commented Sep 7, 2016

I found we can replace FileSystem.get in SparkContext#hadoopFile and SparkContext.newAPIHadoopFile with FileSystem.getLocal like SparkContext#hadoopRDD so once they are replaced, we need not discuss the case of comma-separated file list.

@HyukjinKwon You can replace them in this PR or not do it.

@HyukjinKwon
Copy link
Member Author

@sarutak Ah, I will do this here. Thanks!

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 7, 2016

I re-ran the test after this commit - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Let's wait and see :) - if there is any problem with this, I will revert this change.

@felixcheung
Copy link
Member

seems to fail to build:

[INFO] Compiling 468 Scala sources and 74 Java sources to C:\projects\spark\core\target\scala-2.11\classes...
[ERROR] C:\projects\spark\core\src\main\scala\org\apache\spark\SparkContext.scala:995: type mismatch;
 found   : org.apache.spark.SparkConf
 required: org.apache.hadoop.conf.Configuration
[ERROR]     FileSystem.getLocal(conf)

@HyukjinKwon
Copy link
Member Author

Yeap, I quickly fixed and re-ran :). Thanks!

@shivaram
Copy link
Contributor

shivaram commented Sep 7, 2016

Looks like it passed ! LGTM pending our Jenkins tests

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65026 has finished for PR 14960 at commit 41aaaf1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65030 has finished for PR 14960 at commit 41aaaf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Sep 7, 2016

LGTM. Merging this into master and branch-2.0. Thanks @HyukjinKwon !
And thanks @shivaram and @felixcheung for the review !

@asfgit asfgit closed this in 6b41195 Sep 7, 2016
@sarutak
Copy link
Member

sarutak commented Sep 7, 2016

I noticed this PR is not able to be merged cleanly to branch-2.0.
@HyukjinKwon If you would like to merge this into branch-2.0, please feel free to open another PR.

@HyukjinKwon
Copy link
Member Author

Sure! I will tomorrow.

asfgit pushed a commit that referenced this pull request Sep 8, 2016
…n hadoopFile and newHadoopFile APIs

## What changes were proposed in this pull request?

This PR backports #14960

## How was this patch tested?

AppVeyor - https://ci.appveyor.com/project/HyukjinKwon/spark/build/86-backport-SPARK-17339-r

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15008 from HyukjinKwon/backport-SPARK-17339.
@HyukjinKwon HyukjinKwon deleted the SPARK-17339 branch January 2, 2018 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants