Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47222][SQL] fileCompressionFactor also applied to the size of the table #45329

Closed
wants to merge 1 commit into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Feb 29, 2024

What changes were proposed in this pull request?

This PR makes spark.sql.sources.fileCompressionFactor also applied to the size of the table.

Why are the changes needed?

To keep the behavior consistent. For example:

bin/spark-shell --conf spark.sql.catalogImplementation=in-memory
scala> spark.range(5).write.parquet("/tmp/spark/parquet")
                                                                                
scala> spark.sql("set spark.sql.sources.fileCompressionFactor=2.0")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("create table t(id long) using parquet location '/tmp/spark/parquet'")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from t").explain("cost")
== Optimized Logical Plan ==
Relation spark_catalog.default.t[id#13L] parquet, Statistics(sizeInBytes=5.2 KiB) // The sizeInBytes is location.sizeInBytes * spark.sql.sources.fileCompressionFactor

scala> sql("ANALYZE TABLE t COMPUTE STATISTICS noscan")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from t").explain("cost")
== Optimized Logical Plan ==
Relation spark_catalog.default.t[id#17L] parquet, Statistics(sizeInBytes=2.6 KiB) // The sizeInBytes is table statistics

After this PR:

scala> spark.sql("select * from t").explain("cost")
== Optimized Logical Plan ==
Relation spark_catalog.default.t[id#17L] parquet, Statistics(sizeInBytes=5.2 KiB) // The sizeInBytes is table statistics * compressionFactor

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Feb 29, 2024
@wangyum
Copy link
Member Author

wangyum commented Feb 29, 2024

cc @cloud-fan

Copy link

github-actions bot commented Jun 9, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 9, 2024
@github-actions github-actions bot closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant