[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

10110346 · 2018-09-06T10:43:14Z

What changes were proposed in this pull request?

I think we should configure the Parquet buffer size when using Parquet format.
Because for HDFS, dfs.block.size is configurable, sometimes we hope the block size of parquet to be consistent with it.
And whether this parameter spark.sql.files.maxPartitionBytes is best consistent with the Parquet block size when using Parquet format?
Also we may want to shrink Parquet block size in some tests.

How was this patch tested?

N/A

HyukjinKwon · 2018-09-06T11:46:20Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -123,6 +123,9 @@ class ParquetFileFormat
    // Sets compression scheme
    conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)

+    // Sets Parquet block size
+    conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)


For clarification, we are already able to set this via parquet.block.size but this PR proposes an alias for it, right?

Yes, we are already able to set this via parquet.block.size,
I think we should add this parameter into "sql-programming-guide.md"

I doubt if it is common enough to have an alias and document this in sql-programming-guide.md. Other configurations like parquet.page.size, parquet.enable.dictionary or parquet.writer.version are also rather similarly used as much as that configuration in my experience.

I wouldn't add this for now.

Sounds reasonable. I close it now， thanks

SparkQA · 2018-09-06T13:52:09Z

Test build #95752 has finished for PR 22350 at commit 3485b52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AnirudhVyas · 2021-06-15T20:56:28Z

Sorry its not clear what is the setting - how do we adjust this setting, I wish to adjust parquet block size.

HyukjinKwon · 2021-06-16T01:10:51Z

@AnirudhVyas, set parquet.block.size Hadoop configuration at Spark context or set it as an option e.g., option("parquet.block.size", "...")

fix

3485b52

HyukjinKwon reviewed Sep 6, 2018

View reviewed changes

10110346 closed this Sep 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

10110346 commented Sep 6, 2018

HyukjinKwon Sep 6, 2018

10110346 Sep 6, 2018

HyukjinKwon Sep 6, 2018 •

edited

10110346 Sep 7, 2018

SparkQA commented Sep 6, 2018

AnirudhVyas commented Jun 15, 2021

HyukjinKwon commented Jun 16, 2021

[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

Conversation

10110346 commented Sep 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Sep 6, 2018

Choose a reason for hiding this comment

10110346 Sep 6, 2018

Choose a reason for hiding this comment

HyukjinKwon Sep 6, 2018 • edited

Choose a reason for hiding this comment

10110346 Sep 7, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 6, 2018

AnirudhVyas commented Jun 15, 2021

HyukjinKwon commented Jun 16, 2021

HyukjinKwon Sep 6, 2018 •

edited