Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema #19470

Closed
wants to merge 4 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 11, 2017

What changes were proposed in this pull request?

Before Hive 2.0, ORC File schema has invalid column names like _col1 and _col2. This is a well-known limitation and there are several Apache Spark issues with spark.sql.hive.convertMetastoreOrc=true. This PR ignores ORC File schema and use Spark schema.

How was this patch tested?

Pass the newly added test case.

@SparkQA
Copy link

SparkQA commented Oct 11, 2017

Test build #82620 has finished for PR 19470 at commit d11ce09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Oct 11, 2017
@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile and @cloud-fan .
Could you review this PR?

@gatorsmile
Copy link
Member

Could you create test cases with the different schemas between files and hive metastore.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile .
Sure. I assume that you want to check the regression here. Could you tell me the degree of difference?

Here, this PR is focusing on missing-columns scenario after ADD COLUMNS. This is a typical use cases by customers.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 11, 2017

I remember we previously hit multiple issues due to the schema difference between the actual orc-file schema and the metastore schema. Just ensure it still exists and it does not make the current support worse.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 11, 2017

Ya, that was my question, too.

  • What kind of difference does Spark support, especially in ORC? Apache Spark only supports HiveFileFormat so far, not old OrcFileFormat.
  • In addition, there is no Schema Merging. Randomly (usually the bigging ORC file?), the first correct ORC file schema is used now. For old ORC cases, those file schema are meaningless like _colX.

For me, HiveMetastore schema is the only valid one in Apache Spark.

@dongjoon-hyun
Copy link
Member Author

To be clear, I'll file another JIRA about ORC status on mismatched column orders.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Oct 12, 2017

Hi, @gatorsmile .

  • For mismatched column orders: SPARK-22267 shows the current status of ORC.
  • For mismatched column types: Parquet also does not support.

Based on the above, I'll proceed to add more test cases in order to prevent regression.

}

// This test case is added to prevent regression.
test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is different") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to prevent regression according to your request, @gatorsmile ~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird to have a test verifying a bug, I think it's good enough to have a JIRA tracking this bug.

@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82701 has finished for PR 19470 at commit 8ac1acf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
if (maybePhysicalSchema.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

val isEmptyFile = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf)).isEmpty
if (isEmptyFile) {
  ...
} else ...

@@ -138,8 +138,7 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
if (maybePhysicalSchema.isEmpty) {
Iterator.empty
} else {
val physicalSchema = maybePhysicalSchema.get
OrcRelation.setRequiredColumns(conf, physicalSchema, requiredSchema)
OrcRelation.setRequiredColumns(conf, dataSchema, requiredSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work? seems here we lie to the orc reader about the physical schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, we only need to pass the required column indices to orc reader.

case (field, ordinal) =>
var ref = oi.getStructFieldRef(field.name)
if (ref == null) {
val maybeIndex = dataSchema.getFieldIndex(field.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requiredSchema is guaranteed to be contained in the dataSchema.


iterator.map { value =>
val raw = deserializer.deserialize(value)
var i = 0
val length = fieldRefs.length
while (i < length) {
val fieldValue = oi.getStructFieldData(raw, fieldRefs(i))
val fieldRef = fieldRefs(i)
val fieldValue = if (fieldRef == null) null else oi.getStructFieldData(raw, fieldRefs(i))
if (fieldValue == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

if (fieldRef == null) {
  row.setNull...
} else {
  val fieldValue = ...
  ...
}

)

checkAnswer(
sql(s"SELECT * FROM $db.t"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please list all columns here instead of *, to make the test more clear

Row(null, "12"))

checkAnswer(
sql(s"SELECT * FROM $db.t"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@dongjoon-hyun
Copy link
Member Author

@cloud-fan . Thank you so much for review!
I updated the PR except one: IffieldValue is null, we also use setNull again in else. So, the current one is simpler.

if (fieldRef == null) {
  row.setNull...
} else {
  val fieldValue = ...
  ...
}

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-14387][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema Oct 13, 2017
@@ -2050,4 +2050,60 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}
}

test("SPARK-18355 Use Spark schema to read ORC table instead of ORC file schema") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the test case for checking the other formats?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it depends on the CONVERT_METASTORE_XXX conf, maybe also test parquet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'll add parquet, too.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@gatorsmile
Copy link
Member

LGTM too.

Seq("true", "false").foreach { value =>
withSQLConf(
HiveUtils.CONVERT_METASTORE_ORC.key -> value,
HiveUtils.CONVERT_METASTORE_PARQUET.key -> value) {
Copy link
Member

@viirya viirya Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you separate orc and parquet to two tests in fact, maybe you just need to test against one config at one time, i.e., orc -> HiveUtils.CONVERT_METASTORE_ORC, parquet -> HiveUtils.CONVERT_METASTORE_PARQUET.key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya . For that, yes, we can, but that will be a little-bit overkill.

@viirya
Copy link
Member

viirya commented Oct 13, 2017

One minor comment doesn't affect this. LGTM.

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82718 has finished for PR 19470 at commit ef2123e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82720 has finished for PR 19470 at commit 8e7fe9b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun
Copy link
Member Author

R failure seems to be irrelevant.

[error] running /home/jenkins/workspace/SparkPullRequestBuilder/R/run-tests.sh ; process was terminated by signal 9
Attempting to post to Github...

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82724 has finished for PR 19470 at commit 8e7fe9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Now, it's passed again. :)

@asfgit asfgit closed this in e6e3600 Oct 13, 2017
asfgit pushed a commit that referenced this pull request Oct 13, 2017
… ORC table instead of ORC file schema

Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema.

Pass the newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19470 from dongjoon-hyun/SPARK-18355.

(cherry picked from commit e6e3600)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @cloud-fan , @gatorsmile , and @viirya !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-18355 branch October 13, 2017 15:45
@dongjoon-hyun
Copy link
Member Author

BTW, @cloud-fan . Could you review #18460 , too? I think we need your final approval. :)

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
… ORC table instead of ORC file schema

Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema.

Pass the newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#19470 from dongjoon-hyun/SPARK-18355.

(cherry picked from commit e6e3600)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants