-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22279][SQL] Enable convertMetastoreOrc
by default
#21186
Conversation
Test build #89941 has finished for PR 21186 at commit
|
@gatorsmile and @cloud-fan . |
convertMetastoreOrc
and add convertMetastore.TableProperty
confconvertMetastoreOrc
and add convertMetastoreTableProperty
conf
Hi, @gatorsmile .
|
Retest this please. |
Test build #89986 has finished for PR 21186 at commit
|
The failures are irrelevant to this PR. |
Retest this please. |
Test build #89998 has finished for PR 21186 at commit
|
|
Ya. I also thought like that before, @cloud-fan . Please consider an existing customer environment like the unit test cases. For some Parquet tables having table properties like Since this is a behavior change, we need to document it and had better provide options for this. We can remove this at Apache Spark 3.0. |
Test build #90149 has finished for PR 21186 at commit
|
docs/sql-programming-guide.md
Outdated
@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |||
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. | |||
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. | |||
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround. | |||
- Since Spark 2.4, Spark uses its own ORC support by default instead of Hive SerDe for better performance during Hive metastore table access. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. | |||
- Since Spark 2.4, Spark supports table properties while converting Parquet/ORC Hive tables. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please polish the migration guide w.r.t. https://issues.apache.org/jira/browse/SPARK-24175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
docs/sql-programming-guide.md
Outdated
@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |||
- Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0. | |||
- Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. | |||
- In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround. | |||
- In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior. | |||
- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan and @gatorsmile . I updated according to the guideline SPARK-24175.
Test build #90213 has finished for PR 21186 at commit
|
Retest this please. |
Test build #90263 has finished for PR 21186 at commit
|
I'll split this into two PRs in order to make it easy to review. |
convertMetastoreOrc
and add convertMetastoreTableProperty
confconvertMetastoreOrc
by default
To reduce the review scope, |
Test build #90332 has finished for PR 21186 at commit
|
can you resolve the conflicts? |
Sure, it's rebased now. |
Test build #90417 has finished for PR 21186 at commit
|
thanks, merging to master! |
Thank you, @cloud-fan ! |
We reverted `spark.sql.hive.convertMetastoreOrc` at apache#20536 because we should not ignore the table-specific compression conf. Now, it's resolved via [SPARK-23355](apache@8aa1d7b). Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#21186 from dongjoon-hyun/SPARK-24112.
What changes were proposed in this pull request?
We reverted
spark.sql.hive.convertMetastoreOrc
at #20536 because we should not ignore the table-specific compression conf. Now, it's resolved via SPARK-23355.How was this patch tested?
Pass the Jenkins.