Use ThreadUtils.parmap for optimize #1315

Kimahriman · 2022-08-05T11:30:37Z

Description

Resolves #1220

Uses ThreadUtils.parmap to parallelize the compaction instead of parallel collections. This should improve the "tail" of the compaction execution. It seems to be that the parallel collections method buckets each job into maxThreads groups and then executes each group with one of the threads in the pool. I think this is more of a proper queue based approach so any remaining tasks can be done by any free thread.

How was this patch tested?

Exsting UTs

Does this PR introduce any user-facing changes?

Just tail performance gain for optimize commands.

Kimahriman · 2022-08-05T11:32:45Z

Note: I haven't actually run this at scale to know for sure that this does what it says it should do with the tail of the execution, more just logically I think the change make sense.

Also, I might make a follow up to allow failures by default instead of trying to do the fail fast behavior I was going for before. Having something running for hours and writing dozens of TB of data for none of it to be committed at the end because of a bad partition is painful 😅

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

Kimahriman · 2022-08-06T17:01:53Z

Threw this together to try to induce some skew.

from datetime import datetime
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from delta.tables import DeltaTable

spark = SparkSession.builder.config('spark.databricks.delta.optimize.maxThreads', '2').getOrCreate()

columns = 100
df = (spark.range(1_000_000)
    .withColumn('part', F.col('id') % 5)
    .withColumns({f'c{i}': F.rand() for i in range(columns)})
    .repartition('part')
)
# Create 10 files per partition
for _ in range(10):
    df.write.format('delta').partitionBy('part').mode('append').save('optimize-test')

df = (spark.range(100_000)
    .withColumn('part', (F.col('id') % 5) + 5)
    .withColumns({f'c{i}': F.rand() for i in range(columns)})
    .repartition('part')
)
for _ in range(10):
    df.write.format('delta').partitionBy('part').mode('append').save('optimize-test')

df = (spark.range(10_000)
    .withColumn('part', (F.col('id') % 5) + 10)
    .withColumns({f'c{i}': F.rand() for i in range(columns)})
    .repartition('part')
)
for _ in range(10):
    df.write.format('delta').partitionBy('part').mode('append').save('optimize-test')

table = DeltaTable.forPath(spark, "/path/to/optimize-test")
start = datetime.now()
table.optimize().executeCompaction()
duration = datetime.now() - start
print(duration)

Ran it three times each:
Master was 142-160 seconds
This branch was 119 - 129 seconds

zsxwing

LGTM. Thanks for the benchmark result!

talecsander · 2023-05-11T13:10:05Z

could you please add it also in version 2.0 ?

Use ThreadUtils.parmap for optimize

129fa9a

Kimahriman mentioned this pull request Aug 5, 2022

Use ThreadUtils.parmap for optimize #979

Closed

scottsand-db reviewed Aug 5, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala Show resolved Hide resolved

zsxwing approved these changes Aug 8, 2022

View reviewed changes

zsxwing added the waiting for merge label Aug 8, 2022

scottsand-db approved these changes Aug 8, 2022

View reviewed changes

tdas closed this in de7ba23 Aug 11, 2022

allisonport-db added this to the 2.1.0 milestone Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ThreadUtils.parmap for optimize #1315

Use ThreadUtils.parmap for optimize #1315

Kimahriman commented Aug 5, 2022

Kimahriman commented Aug 5, 2022

Kimahriman commented Aug 6, 2022

zsxwing left a comment

talecsander commented May 11, 2023

Use ThreadUtils.parmap for optimize #1315

Use ThreadUtils.parmap for optimize #1315

Conversation

Kimahriman commented Aug 5, 2022

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Kimahriman commented Aug 5, 2022

Kimahriman commented Aug 6, 2022

zsxwing left a comment

Choose a reason for hiding this comment

talecsander commented May 11, 2023