-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22146] FileNotFoundException while reading ORC files containing special characters #19368
Conversation
@@ -58,7 +58,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable | |||
options: Map[String, String], | |||
files: Seq[FileStatus]): Option[StructType] = { | |||
OrcFileOperator.readSchema( | |||
files.map(_.getPath.toUri.toString), | |||
files.map(_.getPath.toString), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, the URI is right but this isn't expecting a URI
@@ -274,4 +274,16 @@ class OrcSourceSuite extends OrcSuite { | |||
)).get.toString | |||
} | |||
} | |||
|
|||
test("SPARK-22146: read ORC files containing special characters") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @mgaido91 .
Could we generalize a test case to cover all file-based data sources? Or, at least,for Parquet, too?
cc @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dongjoon-hyun,
thank you for your review.
My only concern is that while ORC support is in the hive module, all the other datasources are in the sql module. So, where should we put the generalized test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DDL cases, we have HiveDDLSuite extends DDLSuite. But, in this case, I guess maybe HiveQuerySuite? I think the location will be settled during further reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do it in MetastoreDataSourcesSuite
Test build #3937 has finished for PR 19368 at commit
|
@@ -1345,6 +1344,17 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
} | |||
} | |||
|
|||
Seq("orc", "parquet").foreach { format => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the other built-in formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which other formats should I include here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet
, json
, orc
, csv
, text
?
Retest this please. |
test this please |
Test build #82290 has finished for PR 19368 at commit
|
LGTM |
Thanks! Merged to master |
Thanks @gatorsmile! Since this is a bug, should I create a PR to backport this fix to branch-2.2 too? |
…g special characters ## What changes were proposed in this pull request? Reading ORC files containing special characters like '%' fails with a FileNotFoundException. This PR aims to fix the problem. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes apache#19368 from mgaido91/SPARK-22146.
…g special characters ## What changes were proposed in this pull request? Reading ORC files containing special characters like '%' fails with a FileNotFoundException. This PR aims to fix the problem. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19368 from mgaido91/SPARK-22146.
@mgaido91 I just pushed it to 2.2 branch, since the risk of this fix is small. |
Looking @ this, things would be a lot less brittle if there wasn't a round trip Path -> String -> Path. I'm thinking of Windows paths here in particular. Other than tests, which pass in a string (possibly from File.getAbsolutePath); everything appears be downconverting the path and then up again. Is there any fundamental reason why Path isn't being passed around? |
the main reason why I chose not to refactor to pass a |
…g special characters ## What changes were proposed in this pull request? Reading ORC files containing special characters like '%' fails with a FileNotFoundException. This PR aims to fix the problem. ## How was this patch tested? Added UT. Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes apache#19368 from mgaido91/SPARK-22146.
What changes were proposed in this pull request?
Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.
How was this patch tested?
Added UT.