[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default #21186

dongjoon-hyun · 2018-04-27T21:15:08Z

What changes were proposed in this pull request?

We reverted spark.sql.hive.convertMetastoreOrc at #20536 because we should not ignore the table-specific compression conf. Now, it's resolved via SPARK-23355.

How was this patch tested?

Pass the Jenkins.

SparkQA · 2018-04-27T23:50:00Z

Test build #89941 has finished for PR 21186 at commit 5383299.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-04-29T18:13:48Z

@gatorsmile and @cloud-fan .
Could you review this PR? This is a first try after the reverting (#20536).
If we have more things to do for this, please let me know.

dongjoon-hyun · 2018-05-01T16:33:34Z

Hi, @gatorsmile .
Do you think we need to split this into two separate PRs? If you want, I will split this.

SPARK-22279 Enable convertMetastoreOrc by default
SPARK-24112 Add spark.sql.hive.convertMetastoreTableProperty for backward compatiblility

dongjoon-hyun · 2018-05-01T16:34:57Z

Retest this please.

SparkQA · 2018-05-01T18:13:17Z

Test build #89986 has finished for PR 21186 at commit 5383299.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-01T18:41:22Z

The failures are irrelevant to this PR.

dongjoon-hyun · 2018-05-01T18:41:29Z

Retest this please.

SparkQA · 2018-05-01T21:17:52Z

Test build #89998 has finished for PR 21186 at commit 5383299.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-02T14:23:30Z

spark.sql.hive.convertMetastoreTableProperty looks unnecessary to me...

dongjoon-hyun · 2018-05-02T17:26:08Z

Ya. I also thought like that before, @cloud-fan .

Please consider an existing customer environment like the unit test cases. For some Parquet tables having table properties like TBLPROPERTIES (parquet.compression 'NONE'), it was ignored by default before Apache Spark 2.4. After upgrading cluster, Spark will write uncompressed file which is different from Apache Spark 2.3 and old.

Since this is a behavior change, we need to document it and had better provide options for this. We can remove this at Apache Spark 3.0.

SparkQA · 2018-05-03T20:21:23Z

Test build #90149 has finished for PR 21186 at commit b9ed640.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-04T02:23:30Z

docs/sql-programming-guide.md

@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see
  - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
  - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
  - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.
+  - Since Spark 2.4, Spark uses its own ORC support by default instead of Hive SerDe for better performance during Hive metastore table access. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+  - Since Spark 2.4, Spark supports table properties while converting Parquet/ORC Hive tables. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.


please polish the migration guide w.r.t. https://issues.apache.org/jira/browse/SPARK-24175

dongjoon-hyun · 2018-05-04T19:31:18Z

docs/sql-programming-guide.md

@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see
  - Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set `true` to `spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the previous behavior. This option will be removed in Spark 3.0.
  - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception.
  - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary workaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. To set `false` to `spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
+  - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+


@cloud-fan and @gatorsmile . I updated according to the guideline SPARK-24175.

SparkQA · 2018-05-04T22:02:20Z

Test build #90213 has finished for PR 21186 at commit b746702.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-06T02:06:11Z

Retest this please.

SparkQA · 2018-05-06T04:38:35Z

Test build #90263 has finished for PR 21186 at commit b746702.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-07T16:46:59Z

I'll split this into two PRs in order to make it easy to review.

dongjoon-hyun · 2018-05-07T17:12:14Z

To reduce the review scope, convertMetastoreTableProperty goes to #21259 .

SparkQA · 2018-05-07T19:54:05Z

Test build #90332 has finished for PR 21186 at commit ddd6872.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-09T04:41:10Z

can you resolve the conflicts?

dongjoon-hyun · 2018-05-09T17:15:05Z

Sure, it's rebased now.

SparkQA · 2018-05-09T19:45:41Z

Test build #90417 has finished for PR 21186 at commit 10e8319.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-10T05:38:30Z

thanks, merging to master!

dongjoon-hyun · 2018-05-10T15:44:56Z

Thank you, @cloud-fan !

We reverted `spark.sql.hive.convertMetastoreOrc` at apache#20536 because we should not ignore the table-specific compression conf. Now, it's resolved via [SPARK-23355](apache@8aa1d7b). Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#21186 from dongjoon-hyun/SPARK-24112.

dongjoon-hyun changed the title ~~[SPARK-22279][SPARK-24112] Enable convertMetastoreOrc and add convertMetastore.TableProperty conf~~ [SPARK-22279][SPARK-24112] Enable convertMetastoreOrc and add convertMetastoreTableProperty conf May 1, 2018

cloud-fan reviewed May 4, 2018

View reviewed changes

dongjoon-hyun commented May 4, 2018

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-22279][SPARK-24112] Enable convertMetastoreOrc and add convertMetastoreTableProperty conf~~ [SPARK-22279][SQL] Enable convertMetastoreOrc by default May 7, 2018

[SPARK-22279] Enable convertMetastoreOrc by default

10e8319

asfgit closed this in e3d4349 May 10, 2018

dongjoon-hyun deleted the SPARK-24112 branch May 10, 2018 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default #21186

[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default #21186

dongjoon-hyun commented Apr 27, 2018 •

edited

Loading

SparkQA commented Apr 27, 2018

dongjoon-hyun commented Apr 29, 2018 •

edited

Loading

dongjoon-hyun commented May 1, 2018

dongjoon-hyun commented May 1, 2018

SparkQA commented May 1, 2018

dongjoon-hyun commented May 1, 2018

dongjoon-hyun commented May 1, 2018

SparkQA commented May 1, 2018

cloud-fan commented May 2, 2018

dongjoon-hyun commented May 2, 2018

SparkQA commented May 3, 2018

cloud-fan May 4, 2018

dongjoon-hyun May 4, 2018

dongjoon-hyun May 4, 2018

SparkQA commented May 4, 2018

dongjoon-hyun commented May 6, 2018

SparkQA commented May 6, 2018

dongjoon-hyun commented May 7, 2018 •

edited

Loading

dongjoon-hyun commented May 7, 2018

SparkQA commented May 7, 2018

cloud-fan commented May 9, 2018

dongjoon-hyun commented May 9, 2018

SparkQA commented May 9, 2018

cloud-fan commented May 10, 2018

dongjoon-hyun commented May 10, 2018

[SPARK-22279][SQL] Enable convertMetastoreOrc by default #21186

[SPARK-22279][SQL] Enable convertMetastoreOrc by default #21186

Conversation

dongjoon-hyun commented Apr 27, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 27, 2018

dongjoon-hyun commented Apr 29, 2018 • edited Loading

dongjoon-hyun commented May 1, 2018

dongjoon-hyun commented May 1, 2018

SparkQA commented May 1, 2018

dongjoon-hyun commented May 1, 2018

dongjoon-hyun commented May 1, 2018

SparkQA commented May 1, 2018

cloud-fan commented May 2, 2018

dongjoon-hyun commented May 2, 2018

SparkQA commented May 3, 2018

cloud-fan May 4, 2018

Choose a reason for hiding this comment

dongjoon-hyun May 4, 2018

Choose a reason for hiding this comment

dongjoon-hyun May 4, 2018

Choose a reason for hiding this comment

SparkQA commented May 4, 2018

dongjoon-hyun commented May 6, 2018

SparkQA commented May 6, 2018

dongjoon-hyun commented May 7, 2018 • edited Loading

dongjoon-hyun commented May 7, 2018

SparkQA commented May 7, 2018

cloud-fan commented May 9, 2018

dongjoon-hyun commented May 9, 2018

SparkQA commented May 9, 2018

cloud-fan commented May 10, 2018

dongjoon-hyun commented May 10, 2018

[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default #21186

[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default #21186

dongjoon-hyun commented Apr 27, 2018 •

edited

Loading

dongjoon-hyun commented Apr 29, 2018 •

edited

Loading

dongjoon-hyun commented May 7, 2018 •

edited

Loading