-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge is slower than expected and loads more than expected into memory. #2573
Comments
It needs to scan the entire table if you don't use partitioning, if you do partition then you need to give an explicit partition predicate to reduce the amount of partitions you read |
Thanks for the reply @ion-elgreco , why does it need to scan the entire table into memory before it starts writing data? Is this just a lack of optimization, or is there something fundamental to what merge is doing that prevents this kind of optimization? |
It needs to scan the entire table because it needs to find out which rows the merge into condition applies to. |
I understand this. However, it doesn't need to hold the entire table in memory while it is performing the merge. It could do this in a streaming fashion – this is more or less what you get out of the box with datafusion. |
To answer my own questions here:
This could be avoided if we somehow explicitly told DataFusion to fully exhaust one file at a time so that data could be flushed. I can imagine using partitioning to break the merge into many operations so that all the data is not pulled into memory at once. But in my case, I'd probably rather put the effort towards not using Merge. |
Environment
Delta-rs version: 0.17.3
Binding: Rust
Environment:
Bug
What happened:
Given a source table with ~100M rows in it, stored as a delta lake table in S3, sorted by "rideid". (This is ~38 parquet files that are about 100MB each). I'm trying to "upsert" 1 row using code that looks like this:
(COLUMNS is simply an array that contains all 13 of the columns in the table.) This consumes all of my computer's memory then crashes.
I've tried partitioning the data by using a hash of the rideid, this doesn't seem to change the fact that I run out of memory and cannot run this operation.
What you expected to happen:
This is pretty surprising, because I can run the join in DataFusion pretty efficiently:
This query takes about 2 seconds and consumes only ~500MB of memory. I can build my own upsert on top of this sort of datafusion query, delete, and write that seems to work fine.
How to reproduce it:
Implement an "upsert" operation using merge on a table with ~100M rows, observe how much memory this consumes.
More details:
The text was updated successfully, but these errors were encountered: