[SPARK-24626] [SQL] Improve location size calculation in Analyze Table command #21608
What changes were proposed in this pull request?
Currently, Analyze table calculates table size sequentially for each partition. We can parallelize size calculations over partitions.
Results : Tested on a table with 100 partitions and data stored in S3.
Without changes :
How was this patch tested?
Simple unit test.
Yes, In the case where the data is stored in S3 I noticed a significant difference.
Some rough numbers - When done serially for a table in S3 with 100 partitions, the calculateTotalSize method took about 90 seconds vs 30-40 seconds when done in parallel.