-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue #23944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
docs/sql-migration-guide-upgrade.md
Outdated
|
|
||
| - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. | ||
|
|
||
| - Since Spark 2.4, Spark uses native ORC in default, Which cause Hive 2.1.1 cannot read ORC table created by Spark 2.4. Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. To set `false` to `spark.sql.hive.convertMetastoreOrc` and set `hive` to `spark.sql.orc.impl` restores the previous behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @haiboself .
Line 172 is the better place to append this kind of notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, Refer to [HIVE-16683](https://issues.apache.org/jira/browse/HIVE-16683) for details. looks insufficient because it's too indirect.
If you want to say Hive 2.3.0+ is generating ORC tables which Hive 2.1.1 cannot read, please write directly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your proposal.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, spark.sql.orc.impl=hive is enough. Let's remove spark.sql.hive.convertMetastoreOrc since it is irrelevant to what you aim to describe here.
|
@dongjoon-hyun Thanks for your proposal, I would like to change to following content, is it right? |
| - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. | ||
|
|
||
| - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. | ||
| - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. ORC tables created by Spark 2.4 native ORC writer cannot be read by Hive 2.1.1. Use `spark.sql.orc.impl=hive` will restores the previous behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Almost good. I'll revise a little when I merge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes #23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
Thank you, @haiboself . Merged to master/branch-2.4. Oops. My bad. This is not passed Jenkins although this is purely doc change. I'll monitor Jenkins build status. |
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes apache#23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes apache#23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes apache#23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: SPARK-26932
How was this patch tested?
doc build