Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25356][SQL]Add Parquet block size option to SparkSQL configuration #22350

Closed
wants to merge 1 commit into from

Conversation

10110346
Copy link
Contributor

@10110346 10110346 commented Sep 6, 2018

What changes were proposed in this pull request?

I think we should configure the Parquet buffer size when using Parquet format.
Because for HDFS, dfs.block.size is configurable, sometimes we hope the block size of parquet to be consistent with it.
And whether this parameter spark.sql.files.maxPartitionBytes is best consistent with the Parquet block size when using Parquet format?
Also we may want to shrink Parquet block size in some tests.

How was this patch tested?

N/A

@@ -123,6 +123,9 @@ class ParquetFileFormat
// Sets compression scheme
conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)

// Sets Parquet block size
conf.setInt(ParquetOutputFormat.BLOCK_SIZE, sparkSession.sessionState.conf.parquetBlockSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification, we are already able to set this via parquet.block.size but this PR proposes an alias for it, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are already able to set this via parquet.block.size,
I think we should add this parameter into "sql-programming-guide.md"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if it is common enough to have an alias and document this in sql-programming-guide.md. Other configurations like parquet.page.size, parquet.enable.dictionary or parquet.writer.version are also rather similarly used as much as that configuration in my experience.

I wouldn't add this for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I close it now, thanks

@SparkQA
Copy link

SparkQA commented Sep 6, 2018

Test build #95752 has finished for PR 22350 at commit 3485b52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@10110346 10110346 closed this Sep 7, 2018
@AnirudhVyas
Copy link

Sorry its not clear what is the setting - how do we adjust this setting, I wish to adjust parquet block size.

@HyukjinKwon
Copy link
Member

@AnirudhVyas, set parquet.block.size Hadoop configuration at Spark context or set it as an option e.g., option("parquet.block.size", "...")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants