Skip to content

Conversation

@haiboself
Copy link
Contributor

What changes were proposed in this pull request?

Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: SPARK-26932

How was this patch tested?

doc build

@AmplabJenkins
Copy link

Can one of the admins verify this patch?


- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.

- Since Spark 2.4, Spark uses native ORC in default, Which cause Hive 2.1.1 cannot read ORC table created by Spark 2.4. Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. To set `false` to `spark.sql.hive.convertMetastoreOrc` and set `hive` to `spark.sql.orc.impl` restores the previous behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @haiboself .

Line 172 is the better place to append this kind of notice.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. looks insufficient because it's too indirect.

If you want to say Hive 2.3.0+ is generating ORC tables which Hive 2.1.1 cannot read, please write directly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your proposal.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, spark.sql.orc.impl=hive is enough. Let's remove spark.sql.hive.convertMetastoreOrc since it is irrelevant to what you aim to describe here.

@haiboself
Copy link
Contributor Author

haiboself commented Mar 4, 2019

@dongjoon-hyun Thanks for your proposal, I would like to change to following content, is it right?

- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. ORC tables created by Spark 2.4 native ORC writer cannot be read by Hive 2.1.1. Use `spark.sql.orc.impl=hive` will restores the previous behavior.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-26932][DOC]Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default [SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue Mar 5, 2019
- Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970.

- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. ORC tables created by Spark 2.4 native ORC writer cannot be read by Hive 2.1.1. Use `spark.sql.orc.impl=hive` will restores the previous behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Almost good. I'll revise a little when I merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

dongjoon-hyun pushed a commit that referenced this pull request Mar 5, 2019
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see:  [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932)

doc build

Closes #23944 from haiboself/SPARK-26932.

Authored-by: Bo Hai <haibo-self@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c27caea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 5, 2019

Thank you, @haiboself . Merged to master/branch-2.4.

Oops. My bad. This is not passed Jenkins although this is purely doc change. I'll monitor Jenkins build status.

@haiboself haiboself deleted the SPARK-26932 branch March 6, 2019 01:29
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see:  [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932)

doc build

Closes apache#23944 from haiboself/SPARK-26932.

Authored-by: Bo Hai <haibo-self@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c27caea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see:  [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932)

doc build

Closes apache#23944 from haiboself/SPARK-26932.

Authored-by: Bo Hai <haibo-self@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c27caea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see:  [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932)

doc build

Closes apache#23944 from haiboself/SPARK-26932.

Authored-by: Bo Hai <haibo-self@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c27caea)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants