[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944

haiboself · 2019-03-03T08:28:54Z

What changes were proposed in this pull request?

Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: SPARK-26932

How was this patch tested?

doc build

AmplabJenkins · 2019-03-03T08:34:36Z

Can one of the admins verify this patch?

dongjoon-hyun · 2019-03-03T17:00:44Z

docs/sql-migration-guide-upgrade.md


  - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

+  - Since Spark 2.4, Spark uses native ORC in default, Which cause Hive 2.1.1 cannot read ORC table created by Spark 2.4. Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. To set `false` to `spark.sql.hive.convertMetastoreOrc` and set `hive` to `spark.sql.orc.impl` restores the previous behavior.


Hi, @haiboself .

Line 172 is the better place to append this kind of notice.

https://github.com/apache/spark/pull/23944/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R172

BTW, Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. looks insufficient because it's too indirect.

If you want to say Hive 2.3.0+ is generating ORC tables which Hive 2.1.1 cannot read, please write directly here.

Thanks for your proposal.

dongjoon-hyun

Also, spark.sql.orc.impl=hive is enough. Let's remove spark.sql.hive.convertMetastoreOrc since it is irrelevant to what you aim to describe here.

haiboself · 2019-03-04T08:49:36Z

@dongjoon-hyun Thanks for your proposal, I would like to change to following content, is it right?

- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. ORC tables created by Spark 2.4 native ORC writer cannot be read by Hive 2.1.1. Use `spark.sql.orc.impl=hive` will restores the previous behavior.

dongjoon-hyun · 2019-03-05T19:59:14Z

docs/sql-migration-guide-upgrade.md

  - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970.

-  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. ORC tables created by Spark 2.4 native ORC writer cannot be read by Hive 2.1.1. Use `spark.sql.orc.impl=hive` will restores the previous behavior.


Thanks. Almost good. I'll revise a little when I merge this.

Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes #23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2019-03-05T20:15:45Z

Thank you, @haiboself . Merged to master/branch-2.4.

Oops. My bad. This is not passed Jenkins although this is purely doc change. I'll monitor Jenkins build status.

Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes apache#23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default

82592d8

dongjoon-hyun reviewed Mar 3, 2019

View reviewed changes

dongjoon-hyun requested changes Mar 3, 2019

View reviewed changes

improvement

1261e73

dongjoon-hyun changed the title ~~[SPARK-26932][DOC]Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default~~ [SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue Mar 5, 2019

dongjoon-hyun reviewed Mar 5, 2019

View reviewed changes

dongjoon-hyun closed this in c27caea Mar 5, 2019

haiboself deleted the SPARK-26932 branch March 6, 2019 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944

[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944

Uh oh!

haiboself commented Mar 3, 2019

Uh oh!

AmplabJenkins commented Mar 3, 2019

Uh oh!

dongjoon-hyun Mar 3, 2019

Uh oh!

dongjoon-hyun Mar 3, 2019 •

edited

Loading

Uh oh!

haiboself Mar 5, 2019

Uh oh!

dongjoon-hyun left a comment

Uh oh!

haiboself commented Mar 4, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun Mar 5, 2019

Uh oh!

haiboself Mar 6, 2019

Uh oh!

dongjoon-hyun commented Mar 5, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

		- Since Spark 2.4, Spark uses native ORC in default, Which cause Hive 2.1.1 cannot read ORC table created by Spark 2.4. Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. To set `false` to `spark.sql.hive.convertMetastoreOrc` and set `hive` to `spark.sql.orc.impl` restores the previous behavior.

[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944

[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944

Uh oh!

Conversation

haiboself commented Mar 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Mar 3, 2019

Uh oh!

dongjoon-hyun Mar 3, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiboself Mar 5, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

haiboself commented Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun Mar 5, 2019

Choose a reason for hiding this comment

Uh oh!

haiboself Mar 6, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun Mar 3, 2019 •

edited

Loading

haiboself commented Mar 4, 2019 •

edited

Loading

dongjoon-hyun commented Mar 5, 2019 •

edited

Loading