Reusing intermediate results causes memory issues #854

hendrikmakait · 2024-02-06T21:50:46Z

Problem

Whenever we reuse intermediate results and there is a pipeline breaker (such as shuffles, joins, reductions, or groupby operations), it forces us to materialize the entire intermediate result (thus, it is breaking the pipelining that we could utilize for reuse for example between multiple element-wise operations).

This result materialization puts a hard limit on our ability to scale as I have observed in multiple TPC-H benchmark queries.

To illustrate this, run these two snippets on a cluster of your choice

With full intermediate result materialization

from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # To compute mean, we have to fully materialize df, and we won't free its data
        # until we have reused the chunks to compute a partial of the sum. 
        mean = df["x"].mean()
        df[df["x"] > mean].sum().compute()

Without full intermediate result materialization

from dask_expr.datasets import timeseries

from distributed import Client

if __name__ == "__main__":
    with Client() as client:
        print(client.dashboard_link)
        df = timeseries(start="2000-01-01", end="2020-12-31", freq="100ms", dtypes={"x": float})
        # Compute the mean beforehand so that we don't have to keep all of `df` in memory
        mean = df["x"].mean().compute()
        df[df["x"] > mean].sum().compute()

Possible solution

The easiest approach would be to never reuse any intermediate results. This has a few downsides:

Non-deterministic functions will lead to unexpected results
We waste a lot of computational resources on recomputations

...but it will allow us to scale.

We can certainly get smarter about intermediate result materialization, but this will require some effort depending on how smart we want to be. (There's a body of (ongoing) research and implementations in the database world we could draw from.)

phofl · 2024-02-07T09:25:41Z

This is concerning

fjetter · 2024-02-07T09:38:38Z

The two code examples above are identical, aren't they?

mrocklin · 2024-02-07T13:35:15Z

FWIW this is a long-standing issue. dask/dask#874

hendrikmakait · 2024-02-07T13:51:34Z

The two code examples above are identical, aren't they?

Fixed, I messed up the copy+paste.

hendrikmakait · 2024-02-12T16:26:46Z

Summary from multiple offline conversations with @phofl (and @fjetter):

We are mostly concerned about reuse after a reducer. Another example is df["y"] = df["x"] + df["x"].sum(). There are other possible scenarios where reuse may be problematic because of long processing chains that need to finish first, but they are also dependent on our scheduling algorithm and should be tackled once we see them happening in the real world.

phofl · 2024-02-15T15:30:54Z

@fjetter and I chatted a little about this offline. There are 2 different approaches we can take:

Disconnect the IO nodes from the rest of the expression if we encounter a Pipeline breaker so that we can compute them independently. This means that we have to adjust the IO nodes from the reduction, which might be far away.

This is relatively easy to do, but it breaks outer assumption that our optimisations are only local and messes up the dependent tracking in the process. I moved away from this a little, because I don't like the bandaids we would need and it doesn't properly fit into our model.

Add a parameter like _branch_id to all of our expressions that is initialised with 0 or so by default. If a pipeline breaker is encountered, it increments this id and we bubble the increment down to the io nodes, e.g. basically iterating over dependencies of the pipeline breaker until we reach an IO node (very simplified explanation). This fits nicely into our local optimization approach since we can do one step at a time, similar to how projections move around. That said, I don't want to add this new parameter to every class, so we would need to enforce this somehow in the base class. That's something I am thinking about at the moment.

hendrikmakait mentioned this issue Feb 6, 2024

[TPC-H] Query 21 times out at scale 100 coiled/benchmarks#1362

Open

phofl mentioned this issue Feb 14, 2024

DNM: Handle pipeline breakers through avoiding reuse #873

Open

phofl self-assigned this Feb 14, 2024

phofl mentioned this issue Feb 16, 2024

Add branch_id to distinguish between reusable branches and pipeline breakers #883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusing intermediate results causes memory issues #854

Reusing intermediate results causes memory issues #854

hendrikmakait commented Feb 6, 2024 •

edited

phofl commented Feb 7, 2024

fjetter commented Feb 7, 2024

mrocklin commented Feb 7, 2024

hendrikmakait commented Feb 7, 2024

hendrikmakait commented Feb 12, 2024

phofl commented Feb 15, 2024

Reusing intermediate results causes memory issues #854

Reusing intermediate results causes memory issues #854

Comments

hendrikmakait commented Feb 6, 2024 • edited

phofl commented Feb 7, 2024

fjetter commented Feb 7, 2024

mrocklin commented Feb 7, 2024

hendrikmakait commented Feb 7, 2024

hendrikmakait commented Feb 12, 2024

phofl commented Feb 15, 2024

hendrikmakait commented Feb 6, 2024 •

edited