Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge produces too many small files in each directory. #345

Closed
jaehc opened this issue Mar 5, 2020 · 3 comments
Closed

Merge produces too many small files in each directory. #345

jaehc opened this issue Mar 5, 2020 · 3 comments

Comments

@jaehc
Copy link

jaehc commented Mar 5, 2020

Hi,

I am not sure I am allowed to put on a question here. If not, please understand me.

I am currently examining whether we could use delta for our data repository. I have followed use cases in the tutorial and found out there are a few to make clear before I start using it.

One of which is that when I combine new data into an existing one with a merge command, it ends up with too many files in each directory.

The delta table I intend to use is partitioned by 'dt', 'dept' and is going to be accessed by presto backed by Hive Metastore.

Before this, I normally use 'DISTRIBUTE BY' that reorganizes data according to the partition spec so that the data that belongs to the same partition is guaranteed to go to a single writer task. This leads to reducing the number of files in each partition directory.

With delta, it doesn't seem to allow me to take this approach because there might not be any way to repartition data before inserting when I use a Merge command. The number of files to be generated is determined by 'spark.sql.shuffle.partition' after joining new data with old data.

Is there any way to handle this? Or any workaround?

Thank you.
Jaehong.

@tdas
Copy link
Contributor

tdas commented Mar 7, 2020

This is a known issue. As you correctly said, the only knob right now is spark.sql.shuffle.partition. There are other folks who have faced the same issue - https://delta-users.slack.com/archives/CJ70UCSHM/p1582931151094700

The suggested solution is to add a configuration that will repartition the data before merging. Here is the issue to officially track this - #349

@jaehc
Copy link
Author

jaehc commented Mar 7, 2020

Thank you @tdas for your answer and for referring to the slack channel. I felt that there would be others having the same issue.

I will close this issue and follow #349.

Besides, I could resort to compaction made occur regularly if many small files were only a problem, but my main problem is writing performance. Each batch cycle produces several hundred partitions, which cause writer tasks to shuffle on their own, really slowing down the whole batch process. So, as a workaround for now, I am trying to reduce the number of partitions generated.

@jaehc jaehc closed this as completed Mar 7, 2020
@tdas
Copy link
Contributor

tdas commented Mar 9, 2020

@jaehc thank you. yes, we understand the issue, and hopefully we can have the repartition based solution soon enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants