-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566
[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566
Conversation
Opened #8568 for the same purpose but against branch-1.5. |
This PR was originally part of the closed PR #7679, which aimed to refactor Parquet write path for better interoperability. I put too many things into that one and decided to split it into several smaller ones to ease code review. |
Test build #41917 has finished for PR 8566 at commit
|
b3f7877
to
85bbfde
Compare
Test build #42413 has finished for PR 8566 at commit
|
retest this please |
Test build #43161 has finished for PR 8566 at commit
|
LGTM. Merging to master. |
## What changes were proposed in this pull request? Some improvements: 1. Point out we are using both Spark SQ native syntax and HQL syntax in the example 2. Avoid using the same table name with temp view, to not confuse users. 3. Create the external hive table with a directory that already has data, which is a more common use case. 4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config was introduced by #8566 and has nothing to do with Hive. 5. Remove `repartition` and `coalesce` example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, `spark.sql.files.maxRecordsPerFile` also controls it. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #20081 from cloud-fan/minor.
We introduced SQL option
spark.sql.parquet.followParquetFormatSpec
while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files.This option defaults to
false
and is marked as a non-public option (isPublic = false
) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it tospark.sql.parquet.writeLegacyFormat
, and invert its default value (the two option names have opposite meanings).Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format.