Merge is slower than expected and loads more than expected into memory. #2573

adamfaulkner-at · 2024-06-05T15:33:15Z

Environment

Delta-rs version: 0.17.3

Binding: Rust

Environment:

Cloud provider: AWS
OS: Ubuntu 22.04.3 LTS
Other:

Bug

What happened:

Given a source table with ~100M rows in it, stored as a delta lake table in S3, sorted by "rideid". (This is ~38 parquet files that are about 100MB each). I'm trying to "upsert" 1 row using code that looks like this:

        (table, _) = DeltaOps(table)
            .merge(source_df, col("source.rideid").eq(col("target.rideid")))
            .with_source_alias("source")
            .with_target_alias("target")
            .when_not_matched_insert(|insert| {
                COLUMNS
                    .iter()
                    .fold(insert, |insert, &column| {
                        insert.set(column, col(format!("source.{}", column)))
                    })
            })?
            .when_matched_update(|update| {
                COLUMNS.iter().fold(update, |update, &column| {
                    update.update(column, col(format!("source.{}", column)))
                })
            })?
            .await?;

(COLUMNS is simply an array that contains all 13 of the columns in the table.) This consumes all of my computer's memory then crashes.

I've tried partitioning the data by using a hash of the rideid, this doesn't seem to change the fact that I run out of memory and cannot run this operation.

What you expected to happen:

This is pretty surprising, because I can run the join in DataFusion pretty efficiently:

        let overlap_results = ctx.sql("SELECT target.rideid FROM target LEFT JOIN source ON target.rideid = source.rideid WHERE source.rideid IS NOT NULL").await?.collect().await?;

This query takes about 2 seconds and consumes only ~500MB of memory. I can build my own upsert on top of this sort of datafusion query, delete, and write that seems to work fine.

How to reproduce it:

Implement an "upsert" operation using merge on a table with ~100M rows, observe how much memory this consumes.

More details:

ion-elgreco · 2024-06-07T11:19:10Z

It needs to scan the entire table if you don't use partitioning, if you do partition then you need to give an explicit partition predicate to reduce the amount of partitions you read

adamfaulkner-at · 2024-06-07T22:46:03Z

Thanks for the reply @ion-elgreco , why does it need to scan the entire table into memory before it starts writing data? Is this just a lack of optimization, or is there something fundamental to what merge is doing that prevents this kind of optimization?

vegarsti · 2024-06-22T05:14:10Z

Thanks for the reply @ion-elgreco , why does it need to scan the entire table into memory before it starts writing data? Is this just a lack of optimization, or is there something fundamental to what merge is doing that prevents this kind of optimization?

It needs to scan the entire table because it needs to find out which rows the merge into condition applies to.

adamfaulkner-at · 2024-06-28T23:18:50Z

It needs to scan the entire table because it needs to find out which rows the merge into condition applies to.

I understand this. However, it doesn't need to hold the entire table in memory while it is performing the merge. It could do this in a streaming fashion – this is more or less what you get out of the box with datafusion.

adamfaulkner-at · 2024-07-22T15:46:29Z

To answer my own questions here:

why does it need to scan the entire table into memory before it starts writing data

MergeBarrier holds on to all records for a particular file until either a delete, update, or insert is encountered, or until the input data is fully exhausted. This means that in workloads with a large input set, where merge does not typically have deletes, updates, and inserts, the entire dataset will usually be buffered in memory.

This could be avoided if we somehow explicitly told DataFusion to fully exhaust one file at a time so that data could be flushed.

I can imagine using partitioning to break the merge into many operations so that all the data is not pulled into memory at once. But in my case, I'd probably rather put the effort towards not using Merge.

adamfaulkner-at added the bug Something isn't working label Jun 5, 2024

ion-elgreco added the question Further information is requested label Jun 7, 2024

adamfaulkner-at mentioned this issue Jul 23, 2024

Allow Multiple DeltaOps in a single commit. #2698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge is slower than expected and loads more than expected into memory. #2573

Merge is slower than expected and loads more than expected into memory. #2573

adamfaulkner-at commented Jun 5, 2024

ion-elgreco commented Jun 7, 2024

adamfaulkner-at commented Jun 7, 2024

vegarsti commented Jun 22, 2024

adamfaulkner-at commented Jun 28, 2024

adamfaulkner-at commented Jul 22, 2024

Merge is slower than expected and loads more than expected into memory. #2573

Merge is slower than expected and loads more than expected into memory. #2573

Comments

adamfaulkner-at commented Jun 5, 2024

Environment

Bug

ion-elgreco commented Jun 7, 2024

adamfaulkner-at commented Jun 7, 2024

vegarsti commented Jun 22, 2024

adamfaulkner-at commented Jun 28, 2024

adamfaulkner-at commented Jul 22, 2024