Merge produces too many small files in each directory. #345

jaehc · 2020-03-05T10:06:15Z

Hi,

I am not sure I am allowed to put on a question here. If not, please understand me.

I am currently examining whether we could use delta for our data repository. I have followed use cases in the tutorial and found out there are a few to make clear before I start using it.

One of which is that when I combine new data into an existing one with a merge command, it ends up with too many files in each directory.

The delta table I intend to use is partitioned by 'dt', 'dept' and is going to be accessed by presto backed by Hive Metastore.

Before this, I normally use 'DISTRIBUTE BY' that reorganizes data according to the partition spec so that the data that belongs to the same partition is guaranteed to go to a single writer task. This leads to reducing the number of files in each partition directory.

With delta, it doesn't seem to allow me to take this approach because there might not be any way to repartition data before inserting when I use a Merge command. The number of files to be generated is determined by 'spark.sql.shuffle.partition' after joining new data with old data.

Is there any way to handle this? Or any workaround?

Thank you.
Jaehong.

tdas · 2020-03-07T00:26:42Z

This is a known issue. As you correctly said, the only knob right now is spark.sql.shuffle.partition. There are other folks who have faced the same issue - https://delta-users.slack.com/archives/CJ70UCSHM/p1582931151094700

The suggested solution is to add a configuration that will repartition the data before merging. Here is the issue to officially track this - #349

jaehc · 2020-03-07T03:55:58Z

Thank you @tdas for your answer and for referring to the slack channel. I felt that there would be others having the same issue.

I will close this issue and follow #349.

Besides, I could resort to compaction made occur regularly if many small files were only a problem, but my main problem is writing performance. Each batch cycle produces several hundred partitions, which cause writer tasks to shuffle on their own, really slowing down the whole batch process. So, as a workaround for now, I am trying to reduce the number of partitions generated.

tdas · 2020-03-09T00:28:41Z

@jaehc thank you. yes, we understand the issue, and hopefully we can have the repartition based solution soon enough.

tdas mentioned this issue Mar 7, 2020

Add option to repartition by partition column in merge to reduce the number of files #349

Closed

jaehc closed this as completed Mar 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge produces too many small files in each directory. #345

Merge produces too many small files in each directory. #345

jaehc commented Mar 5, 2020

tdas commented Mar 7, 2020

jaehc commented Mar 7, 2020 •

edited

Loading

tdas commented Mar 9, 2020

Merge produces too many small files in each directory. #345

Merge produces too many small files in each directory. #345

Comments

jaehc commented Mar 5, 2020

tdas commented Mar 7, 2020

jaehc commented Mar 7, 2020 • edited Loading

tdas commented Mar 9, 2020

jaehc commented Mar 7, 2020 •

edited

Loading