Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8566

Conversation

liancheng
Copy link
Contributor

We introduced SQL option spark.sql.parquet.followParquetFormatSpec while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files.

This option defaults to false and is marked as a non-public option (isPublic = false) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to spark.sql.parquet.writeLegacyFormat, and invert its default value (the two option names have opposite meanings).

Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format.

@liancheng
Copy link
Contributor Author

Opened #8568 for the same purpose but against branch-1.5.

@liancheng
Copy link
Contributor Author

This PR was originally part of the closed PR #7679, which aimed to refactor Parquet write path for better interoperability. I put too many things into that one and decided to split it into several smaller ones to ease code review.

@SparkQA
Copy link

SparkQA commented Sep 2, 2015

Test build #41917 has finished for PR 8566 at commit b3f7877.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng liancheng force-pushed the spark-10400/deprecate-follow-parquet-format-spec branch from b3f7877 to 85bbfde Compare September 14, 2015 07:24
@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42413 has finished for PR 8566 at commit 85bbfde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Stddev(child: Expression) extends StddevAgg(child)
    • case class StddevPop(child: Expression) extends StddevAgg(child)
    • case class StddevSamp(child: Expression) extends StddevAgg(child)
    • abstract class StddevAgg(child: Expression) extends AlgebraicAggregate
    • abstract class StddevAgg1(child: Expression) extends UnaryExpression with PartialAggregate1
    • case class Stddev(child: Expression) extends StddevAgg1(child)
    • case class StddevPop(child: Expression) extends StddevAgg1(child)
    • case class StddevSamp(child: Expression) extends StddevAgg1(child)
    • case class ComputePartialStd(child: Expression) extends UnaryExpression with AggregateExpression1
    • case class ComputePartialStdFunction (
    • case class MergePartialStd(
    • case class MergePartialStdFunction(
    • case class StddevFunction(

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 2, 2015

Test build #43161 has finished for PR 8566 at commit 85bbfde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Oct 2, 2015

LGTM. Merging to master.

@asfgit asfgit closed this in 01cd688 Oct 2, 2015
@liancheng liancheng deleted the spark-10400/deprecate-follow-parquet-format-spec branch October 2, 2015 01:04
@liancheng liancheng restored the spark-10400/deprecate-follow-parquet-format-spec branch October 6, 2015 23:11
asfgit pushed a commit that referenced this pull request Dec 26, 2017
## What changes were proposed in this pull request?
Some improvements:
1. Point out we are using both Spark SQ native syntax and HQL syntax in the example
2. Avoid using the same table name with temp view, to not confuse users.
3. Create the external hive table with a directory that already has data, which is a more common use case.
4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config was introduced by #8566 and has nothing to do with Hive.
5. Remove `repartition` and `coalesce` example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, `spark.sql.files.maxRecordsPerFile` also controls it.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20081 from cloud-fan/minor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants