How to Run DAG Steps in Parallel While Passing Data In-Memory? #17494

takerfume · 2023-10-30T21:29:11Z

takerfume
Oct 30, 2023

I'm trying to figure out how to run DAG steps in parallel while passing data between them through memory to minimize I/O latency. I've written the following code to test this scenario.

When I run ./run_materialize.py, I noticed that although the data transfer between the DAG steps is fast (thanks to it being in-memory), the parallel parts within graph_asset_example are not running in parallel as expected.

Is there a way to run these portions in parallel (multi-process) while keeping the data in-memory between steps to avoid additional I/O latency?

Here is the content of ./tutorial/init.py:

from dagster import (
    asset,
    define_asset_job,
    AssetSelection,
    Definitions,
    FilesystemIOManager,
    op,
    # graph_asset,
    graph,
    DynamicOut,
    DynamicOutput,
    AssetsDefinition,
    AssetKey,
)

from typing import List, Any
import random

import time


@asset()
def example_asset_0() -> int:
    return 1


@asset(io_manager_key="fs_io_manager")
def example_asset_1(example_asset_0: int) -> int:
    return example_asset_0 + 1


@asset()
def example_asset_2(example_asset_1: int) -> int:
    return example_asset_1 + 1


@op(out=DynamicOut(int))
def return_dynamic(input_val: int):
    print(input_val)
    # outputs = []
    for idx, page_key in enumerate(range(random.randint(5, 10))):
        yield DynamicOutput(page_key, mapping_key=str(idx))


@op()
def op1(value: int) -> int:
    time.sleep(10)
    return value


@op()
def op2(value: int) -> int:
    return value


@op()
def path_through(collected_list: List[Any]) -> List[Any]:
    return collected_list


@graph()
def graph_asset_example(input_val: int):
    result = path_through(return_dynamic(input_val).map(op1).map(op2).collect())
    return result


dynamic_graph_asset = AssetsDefinition.from_graph(
    graph_asset_example,
    keys_by_input_name={"input_val": AssetKey("example_asset_2")},
    keys_by_output_name={"result": AssetKey("dynamic_graph_asset_yeah")},
    metadata_by_output_name={"result": {"num_records": 1}},
)


images_job = define_asset_job(name="images_job", selection=AssetSelection.all())

defs = Definitions(
    assets=[
        example_asset_0,
        example_asset_1,
        example_asset_2,
        dynamic_graph_asset,
    ],
    jobs=[
        images_job,
    ],
    resources={
        "fs_io_manager": FilesystemIOManager(),
    },
)

And here is the content of ./run_materialize.py:

import tutorial

from dagster import load_assets_from_modules, materialize_to_memory

all_assets = load_assets_from_modules([tutorial])
materialize_to_memory(all_assets)

Answered by sryza

Oct 31, 2023

Yes, I do think it would be possible to implement a custom IO manager that uses Redis as a backend for in-memory data exchange. You might be able to use op hooks for freeing the memory.

View full answer

sryza · 2023-10-30T22:57:45Z

sryza
Oct 30, 2023
Maintainer

Hi @takerfume - it's not currently possible to run DAG steps in parallel while passing data in memory. This largely boils down to limitations with Python itself: Python's global interpreter lock makes it so that multithreaded Python code mostly cannot take advantage of underlying OS-level parallelism.

#4041 is the issue where we're tracking adding multi-threaded in-process execution. The reason we haven't prioritized this so far is that, because of the global interpreter lock, it would only help with parallelism when most of the computation is delegated outside of Python.

5 replies

takerfume Oct 31, 2023
Author

Hi @sryza , thank you for your response.

Is it possible to implement in-memory data transfer between parallel DAG steps using serialization and Inter-Process Communication(IPC) like multiprocessing module?

I read your response on the limitations due to Python's GIL.
Is there a way similar to Python's multiprocessing module, which handles this kind of situation by serializing and deserializing objects, and then passing them through inter-process communication (IPC)?

The multiprocessing module effectively bypasses the GIL limitations by serializing the data, and could potentially be a viable solution for this issue in Dagster as well.

I would love to hear your thoughts on this. Thank you for your time and for the great work you're doing on Dagster.

Best regards

takerfume Oct 31, 2023
Author

@sryza

Hi
I have an additional question about speeding up data exchange between ops and assets in Dagster. I've been thinking about creating a custom IO manager that uses Redis as a backend for in-memory data exchange.
This would be particularly useful when both Dagster and Redis are running on the same EC2 instance, essentially eliminating network latency.

Would implementing such a custom IO manager be possible?

In additon, having implemented a custom Redis-based I/O Manager, I'm wondering if it's possible to free the memory for assets or outputs from previous steps once we move on to the next step in the pipeline. Can this be achieved?

P.S. Please allow me to ask follow-up questions in case your initial answers lead to more queries on my end.

sryza Oct 31, 2023
Maintainer

Yes, I do think it would be possible to implement a custom IO manager that uses Redis as a backend for in-memory data exchange. You might be able to use op hooks for freeing the memory.

Answer selected by takerfume

takerfume Oct 31, 2023
Author

thank you! I will try redis IO Manager!

takerfume Nov 13, 2023
Author

Hey @sryza

Just wanted to drop in and say a big thank you! I wrote the following code for a Redis IO manager, and it worked well.
Thanks to your guidance, I was able to successfully integrate Redis into my workflow.

I'm planning to use this Redis IO manager for experiments involving processing a large number of images, leveraging the substantial memory capacity of EC2 instances. If I come across any interesting insights, I'll be sure to share them here.

Here's the code that did the trick:

from dagster import IOManager, InputContext, OutputContext
import redis
from typing import Union
import pickle

class RedisIOManager(IOManager):
    def __init__(self, redis_host: str, redis_port: int):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port)

    def handle_output(self, context: OutputContext, obj: object) -> None:
        key = self._get_redis_key(context)
        pickled_obj = pickle.dumps(obj)
        self.redis_client.set(key, pickled_obj)

    def load_input(self, context: InputContext) -> object:
        key = self._get_redis_key(context)
        pickled_obj = self.redis_client.get(key)
        if pickled_obj is None:
            raise ValueError(f"Object not found for key {key} in Redis.")
        if isinstance(pickled_obj, bytes):
            return pickle.loads(pickled_obj)
        else:
            raise ValueError(f"Unexpected data type for key {key}: {type(pickled_obj)}")

    def _get_redis_key(self, context: Union[InputContext, OutputContext]) -> str:
        print(f"context.asset_key.path: {context.asset_key.path}")
        return "|".join(context.asset_key.path)

sryza · 2023-11-13T22:16:24Z

sryza
Nov 13, 2023
Maintainer

Reopened discussion because closed discussions don't appear in search results

0 replies

gustavo-delfosim · 2024-04-10T14:25:51Z

gustavo-delfosim
Apr 10, 2024

@takerfume thanks for sharing your code. I am facing a very similar problem and it will be helpful.

Yes, I do think it would be possible to implement a custom IO manager that uses Redis as a backend for in-memory data exchange. You might be able to use op hooks for freeing the memory.

@sryza can you elaborate on how you use those hooks? My main concern is the scenario where an Op A has ops B1 and B2 downstream of it. In that case, the memory could only be freed after both B1 and B2 are successful. Do you have any code draft that you could share?

2 replies

sryza Apr 11, 2024
Maintainer

@gustavo-delfosim alas I don't think I'll have bandwidth to be able to put together code for this. The hooks might need to use methods like DagsterInstance.get_run_step_stats to see which steps are still outstanding to know whether memory can be freed.

gustavo-delfosim Apr 22, 2024

Hi @sryza . No problem, I will try to work on it and share the results. My implementation idea is roughly as follows:

Assume Op B1 is successful. The success_hook will trigger.
From the success_hook, find all upstream Ops.
For each upstream Op find all downstream Ops. Op A will be one of the upstream Ops, and Ops B1 and B2will be their downstream.
Take all the downstream Ops and pass them to DagsterInstance.get_run_step_stats. If all of them are successful, clean up the result of the upstream

Right now, my main problem is implementing step 2 by using the context: HookContext to find which op's (or should I say steps?) have been executed upstream of the current op which is triggering the hook. I am searching in the docs (and also digging through source code), but without success so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Run DAG Steps in Parallel While Passing Data In-Memory? #17494

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to Run DAG Steps in Parallel While Passing Data In-Memory? #17494

takerfume Oct 30, 2023

Replies: 3 comments · 7 replies

sryza Oct 30, 2023 Maintainer

takerfume Oct 31, 2023 Author

takerfume Oct 31, 2023 Author

sryza Oct 31, 2023 Maintainer

takerfume Oct 31, 2023 Author

takerfume Nov 13, 2023 Author

sryza Nov 13, 2023 Maintainer

gustavo-delfosim Apr 10, 2024

sryza Apr 11, 2024 Maintainer

gustavo-delfosim Apr 22, 2024

takerfume
Oct 30, 2023

Replies: 3 comments 7 replies

sryza
Oct 30, 2023
Maintainer

takerfume Oct 31, 2023
Author

takerfume Oct 31, 2023
Author

sryza Oct 31, 2023
Maintainer

takerfume Oct 31, 2023
Author

takerfume Nov 13, 2023
Author

sryza
Nov 13, 2023
Maintainer

gustavo-delfosim
Apr 10, 2024

sryza Apr 11, 2024
Maintainer