Skip to content

[SPARK-34251][SQL] Fix table stats calculation by TRUNCATE TABLE#31350

Closed
MaxGekk wants to merge 4 commits intoapache:masterfrom
MaxGekk:fix-stats-in-trunc-table
Closed

[SPARK-34251][SQL] Fix table stats calculation by TRUNCATE TABLE#31350
MaxGekk wants to merge 4 commits intoapache:masterfrom
MaxGekk:fix-stats-in-trunc-table

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 26, 2021

What changes were proposed in this pull request?

  1. Take into account the SQL config spark.sql.statistics.size.autoUpdate.enabled in the TRUNCATE TABLE command as other commands do.
  2. Re-calculate actual table size in fs. Before the changes, TRUNCATE TABLE always sets table size to 0 in stats.

Why are the changes needed?

This fixes the bug that is demonstrated by the example:

  1. Create a partitioned table with 2 non-empty partitions:
spark-sql> CREATE TABLE tbl (c0 int, part int) PARTITIONED BY (part);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> INSERT INTO tbl PARTITION (part=1) SELECT 1;
spark-sql> ANALYZE TABLE tbl COMPUTE STATISTICS;
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics	4 bytes, 2 rows
...
  1. Truncate only one partition:
spark-sql> TRUNCATE TABLE tbl PARTITION (part=1);
spark-sql> SELECT * FROM tbl;
0	0
  1. The table is still non-empty but TRUNCATE TABLE reseted stats:
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics	0 bytes, 0 rows
...

Does this PR introduce any user-facing change?

It could impact on performance of following queries.

How was this patch tested?

Added new test to StatisticsCollectionSuite:

$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsCollectionSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsSuite"

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39103/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39103/

@github-actions github-actions bot added the SQL label Jan 26, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 26, 2021

@dongjoon-hyun @sunchao @cloud-fan @HyukjinKwon Could you review this fix, please.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding) - thanks @MaxGekk

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 0d08e22 Jan 27, 2021
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?
1. Take into account the SQL config `spark.sql.statistics.size.autoUpdate.enabled` in the `TRUNCATE TABLE` command as other commands do.
2. Re-calculate actual table size in fs. Before the changes, `TRUNCATE TABLE` always sets table size to 0 in stats.

### Why are the changes needed?
This fixes the bug that is demonstrated by the example:
1. Create a partitioned table with 2 non-empty partitions:
```sql
spark-sql> CREATE TABLE tbl (c0 int, part int) PARTITIONED BY (part);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> INSERT INTO tbl PARTITION (part=1) SELECT 1;
spark-sql> ANALYZE TABLE tbl COMPUTE STATISTICS;
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics	4 bytes, 2 rows
...
```
2. Truncate only one partition:
```sql
spark-sql> TRUNCATE TABLE tbl PARTITION (part=1);
spark-sql> SELECT * FROM tbl;
0	0
```
3. The table is still non-empty but `TRUNCATE TABLE` reseted stats:
```
spark-sql> DESCRIBE TABLE EXTENDED tbl;
...
Statistics	0 bytes, 0 rows
...
```

### Does this PR introduce _any_ user-facing change?
It could impact on performance of following queries.

### How was this patch tested?
Added new test to `StatisticsCollectionSuite`:
```
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsCollectionSuite"
$ build/sbt -Phive -Phive-thriftserver "test:testOnly *StatisticsSuite"
```

Closes apache#31350 from MaxGekk/fix-stats-in-trunc-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments