Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22146] FileNotFoundException while reading ORC files containing special characters #19368

Closed
wants to merge 3 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

How was this patch tested?

Added UT.

@@ -58,7 +58,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
options: Map[String, String],
files: Seq[FileStatus]): Option[StructType] = {
OrcFileOperator.readSchema(
files.map(_.getPath.toUri.toString),
files.map(_.getPath.toString),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, the URI is right but this isn't expecting a URI

@@ -274,4 +274,16 @@ class OrcSourceSuite extends OrcSuite {
)).get.toString
}
}

test("SPARK-22146: read ORC files containing special characters") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mgaido91 .
Could we generalize a test case to cover all file-based data sources? Or, at least,for Parquet, too?

cc @gatorsmile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun,
thank you for your review.
My only concern is that while ORC support is in the hive module, all the other datasources are in the sql module. So, where should we put the generalized test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DDL cases, we have HiveDDLSuite extends DDLSuite. But, in this case, I guess maybe HiveQuerySuite? I think the location will be settled during further reviews.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do it in MetastoreDataSourcesSuite

@SparkQA
Copy link

SparkQA commented Sep 27, 2017

Test build #3937 has finished for PR 19368 at commit c31990c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1345,6 +1344,17 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
}
}

Seq("orc", "parquet").foreach { format =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the other built-in formats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which other formats should I include here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet, json, orc, csv, text?

@dongjoon-hyun
Copy link
Member

Retest this please.

@gatorsmile
Copy link
Member

test this please

@SparkQA
Copy link

SparkQA commented Sep 28, 2017

Test build #82290 has finished for PR 19368 at commit ee8e88f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

gatorsmile commented Sep 29, 2017

Thanks! Merged to master

@asfgit asfgit closed this in 161ba7e Sep 29, 2017
@mgaido91
Copy link
Contributor Author

Thanks @gatorsmile! Since this is a bug, should I create a PR to backport this fix to branch-2.2 too?

mgaido91 added a commit to mgaido91/spark that referenced this pull request Sep 29, 2017
…g special characters

## What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes apache#19368 from mgaido91/SPARK-22146.
asfgit pushed a commit that referenced this pull request Sep 29, 2017
…g special characters

## What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes #19368 from mgaido91/SPARK-22146.
@gatorsmile
Copy link
Member

@mgaido91 I just pushed it to 2.2 branch, since the risk of this fix is small.

@mgaido91 mgaido91 deleted the SPARK-22146 branch September 29, 2017 16:12
@steveloughran
Copy link
Contributor

Looking @ this, things would be a lot less brittle if there wasn't a round trip Path -> String -> Path. I'm thinking of Windows paths here in particular. Other than tests, which pass in a string (possibly from File.getAbsolutePath); everything appears be downconverting the path and then up again.

Is there any fundamental reason why Path isn't being passed around?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Oct 2, 2017

the main reason why I chose not to refactor to pass a Path around is that the work in the end is performed by a function which accepts String and this function is used also in other cases and I wanted not to make the fix impact too big.

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…g special characters

## What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes apache#19368 from mgaido91/SPARK-22146.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants