Initial DaskRunner for Beam #22421

alxmrs · 2022-07-22T21:46:13Z

Here, I've created a minimum viable Apache Beam runner for Dask. My approach is to visit a Beam Pipeline and translate PCollections into Dask Bags.

In this version, I have supported enough operations to make test pipeline asserts work. The test themselves are not comprehensive. Further, there are many Bag operations that could be translated for greater efficiency.

CC: @pabloem

Fixes: #18962

Original PR discussion can be found here: alxmrs#1

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

codecov · 2022-09-05T22:24:36Z

Codecov Report

Merging #22421 (f9cf45a) into master (107a43d) will decrease coverage by 0.14%.
The diff coverage is 82.04%.

@@            Coverage Diff             @@
##           master   #22421      +/-   ##
==========================================
- Coverage   73.35%   73.21%   -0.15%     
==========================================
  Files         719      728       +9     
  Lines       95800    96272     +472     
==========================================
+ Hits        70276    70482     +206     
- Misses      24212    24479     +267     
+ Partials     1312     1311       -1

Flag	Coverage Δ
python	`82.80% <86.85%> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/go/pkg/beam/core/runtime/graphx/translate.go	`38.42% <0.00%> (ø)`
sdks/go/pkg/beam/core/runtime/xlangx/expand.go	`0.00% <0.00%> (ø)`
sdks/go/pkg/beam/schema.go	`35.29% <ø> (ø)`
...ython/apache_beam/runners/interactive/sql/utils.py	`76.09% <ø> (ø)`
sdks/python/apache_beam/transforms/combiners.py	`93.43% <ø> (ø)`
sdks/python/apache_beam/typehints/row_type.py	`100.00% <ø> (ø)`
...apache_beam/typehints/native_type_compatibility.py	`85.52% <33.33%> (-1.06%)`	⬇️
sdks/python/apache_beam/typehints/opcodes.py	`85.35% <50.00%> (-0.26%)`	⬇️
...dks/python/apache_beam/runners/dask/dask_runner.py	`86.45% <86.45%> (ø)`
.../python/apache_beam/typehints/trivial_inference.py	`96.15% <87.50%> (-0.27%)`	⬇️
... and 42 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

alxmrs · 2022-09-21T21:36:52Z

@TomAugspurger: I'm having trouble running my unit tests. My tests used to work, but now I'm noticing infinite loops when running them on a local cluster (the default scheduler).

In my last commit, I changed the client.gather command to async mode, and this let me hit a timeout error. Here, it appears that the Beam tests pass (asserts behave as expected within a compute graph), however, the client never stops running and times out after 4 seconds.

Do you have any idea of what's going on? One key difference between my set up now and when I wrote this is that I'm not on a M1 Mac (ARM64). Could this cause my problem?

CC: @rabernat @cisaacstern @pabloem

TomAugspurger · 2022-09-22T11:52:16Z

In my last commit, I changed the client.gather command to async mode, and this let me hit a timeout error

The

self.client.gather(self.futures, errors='raise', asynchronous=True)

looks incorrect inside of a regular def function. That would typically need to be await self.client.gather inside of an async function, since asynchronous=True makes that return a coroutine that needs to be awaited.

Can you expand on the desire for asynchronous=True there? The timeout wasn't working properly without it? FWIW, I don't see the infinite loops locally, even with asynchronous=True.

TomAugspurger · 2022-09-22T11:57:04Z

sdks/python/apache_beam/runners/dask/dask_runner.py

+
+    dask_options = options.view_as(DaskOptions).get_all_options(
+      drop_default=True)
+    client = ddist.Client(**dask_options)


How does Beam typically handle the lifetime of runners? In the tests, I see warnings about re-using port 8787 from Dask, since the client (and cluster) aren't being completely cleaned up between tests.

Is it more common for beam to create (and clean up) the runner? Or would users typically create it?

This is my first runner – @pabloem can probably weigh in better than I can wrt your question. However, what makes sense to me is that each Beam runner should clean up its environment between each run, including in tests.

This probably should happen in the DaskRunnerResult object. Do you have any recommendations on the best way to clean up dask (distributed)?

In a single scope,

with distributed.Client(...) as client: ...

But in this case, as you say, you'll need to call it after the results are done. So I think that something like

client.close() client.cluster.close()

should do the trick (assuming that beam is the one managing the lifetime of the client.

If you want to rely on the user having a client active, you can call dask.distributed.get_client(), which will raise a ValueError if one hasn't already been created.

alxmrs · 2022-09-22T15:53:43Z

looks incorrect inside of a regular def function.

Yes – thanks for pointing this out. This makes sense to me, looking further at the documentation.

Can you expand on the desire for asynchronous=True there?

I... really am just trying things to stop hitting an infinite loop. This got me to a timeout error when run in tests. Though, when running e2e in Pangeo-Forge, I definitely experience a runtime error complaining that I wasn't in an async def.

FWIW, I don't see the infinite loops locally, even with asynchronous=True.

Interesting! Do the tests pass for you? What is your environment like? I'm concerned that I'm hitting another architecture issue with ARM.

Thanks for taking a look at this, Tom.

- CoGroupByKey is broken due to how tags are used with GroupByKey - GroupByKey should output `[('0', None), ('1', 1)]`, however it actually outputs: [(None, ('1', 1)), (None, ('0', None))] - Once that is fixed, we may have test pipelines work on Dask.

…initial tests pass.

…te loop.

pabloem · 2022-10-19T18:43:01Z

Run Python PreCommit

pabloem · 2022-10-19T20:38:57Z

Run Python PreCommit

pabloem · 2022-10-20T00:27:12Z

the only test that is giving trouble should be easy to fix or skip for now. I'll review the PR as is and maybe we'll merge it soon

alxmrs · 2022-10-20T02:35:16Z

Thanks Pablo. I think I can easily fix it – I'm having trouble reproducing the issue on my local environment due to my M1 woes.

pabloem · 2022-10-21T05:04:02Z

ok I've taken a look. The code, in fact, looks so clean that I'm very happy to merge.

pabloem · 2022-10-21T16:24:53Z

Run Python PreCommit

pabloem · 2022-10-21T17:47:33Z

Run Python PreCommit

pabloem · 2022-10-21T20:32:06Z

Run Python PreCommit

pabloem · 2022-10-24T16:56:22Z

ugggg haha can't get a passing precommit even though the tests are unrelated.

pabloem · 2022-10-24T16:56:31Z

Run Python PreCommit

sdks/python/apache_beam/runners/dask/dask_runner_test.py

…gspurger

…gspurger!).

pabloem · 2022-10-24T21:47:51Z

Run Python PreCommit

pabloem · 2022-10-24T21:48:12Z

sorry about the crazy flakiness. Something is going on recently with our precommits...

pabloem · 2022-10-25T02:20:05Z

ugggg incredibly enough, this issue reproduces only very occasionally in my environment.

pabloem · 2022-10-25T02:20:11Z

Run Python PreCommit

pabloem · 2022-10-25T16:53:54Z

given no changes anywhere close to the current flaky tests, I will merge.

pabloem · 2022-10-25T16:53:59Z

LGTM

alxmrs · 2022-10-25T17:43:09Z

Wohoo!

Abacn · 2022-10-25T21:33:57Z

Thanks. Python Precommit showing following test failure: https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/6286/

apache_beam.runners.dask.dask_runner_test.DaskOptionsTest.test_parser_destinations__agree_with_dask_client

AssertionError: 'vpt_vp_arg3' not found in ['address', 'loop', 'timeout', 'set_as_default', 'scheduler_file', 'security', 'asynchronous', 'name', 'heartbeat_interval', 'serializers', 'deserializers', 'extensions', 'direct_to_workers', 'connection_limit', 'kwargs']

apache_beam.runners.dask.dask_runner_test.DaskRunnerRunPipelineTest.test_create

TypeError: __init__() got an unexpected keyword argument 'vpt_vp_arg3'

pabloem · 2022-10-26T04:51:44Z

thanks Yi for pointing this out

@pabloem

* WIP: Created a skeleton dask runner implementation. * WIP: Idea for a translation evaluator. * Added overrides and a visitor that translates operations. * Fixed a dataclass typo. * Expanded translations. * Core idea seems to be kinda working... * First iteration on DaskRunnerResult (keep track of pipeline state). * Added minimal set of DaskRunner options. * WIP: Alllmost got asserts to work! The current status is: - CoGroupByKey is broken due to how tags are used with GroupByKey - GroupByKey should output `[('0', None), ('1', 1)]`, however it actually outputs: [(None, ('1', 1)), (None, ('0', None))] - Once that is fixed, we may have test pipelines work on Dask. * With a great 1-liner from @pabloem, groupby is fixed! Now, all three initial tests pass. * Self-review: Cleaned up dask runner impl. * Self-review: Remove TODOs, delete commented out code, other cleanup. * First pass at linting rules. * WIP, include dask dependencies + test setup. * WIP: maybe better dask deps? * Skip dask tests depending on successful import. * Fixed setup.py (missing `,`). * Added an additional comma. * Moved skipping logic to be above dask import. * Fix lint issues with dask runner tests. * Adding destination for client address. * Changing to async produces a timeout error instead of stuck in infinite loop. * Close client during `wait_until_finish`; rm async. * Supporting side-inputs for ParDo. * Revert "Close client during `wait_until_finish`; rm async." This reverts commit 09365f6. * Revert "Changing to async produces a timeout error instead of stuck in infinite loop." This reverts commit 676d752. * Adding -dask tox targets onto the gradle build * wip - added print stmt. * wip - prove side inputs is set. * wip - prove side inputs is set in Pardo. * wip - rm asserts, add print * wip - adding named inputs... * Experiments: non-named side inputs + del `None` in named inputs. * None --> 'None' * No default side input. * Pass along args + kwargs. * Applied yapf to dask sources. * Dask sources passing pylint. * Added dask extra to docs gen tox env. * Applied yapf from tox. * Include dask in mypy checks. * Upgrading mypy support to python 3.8 since py37 support is deprecated in dask. * Manually installing an old version of dask before 3.7 support was dropped. * fix lint: line too long. * Fixed type errors with DaskRunnerResult. Disabled mypy type checking in dask. * Fix pytype errors (in transform_evaluator). * Ran isort. * Ran yapf again. * Fix imports (one per line) * isort -- alphabetical. * Added feature to CHANGES.md. * ran yapf via tox on linux machine * Change an import to pass CI. * Skip isort error; needed to get CI to pass. * Skip test logic may favor better with isort. * (Maybe) the last isort fix. * Tested pipeline options (added one fix). * Improve formatting of test. * Self-review: removing side inputs. In addition, adding a more helpful property to the base DaskBagOp (tranform). * add dask to coverage suite in tox. * Capture value error in assert. * Change timeout value to 600 seconds. * ignoring broken test * Update CHANGES.md * Using reflection to test the Dask client constructor. * Better method of inspecting the constructor parameters (thanks @TomAugspurger!). Co-authored-by: Pablo E <pabloem@apache.org> Co-authored-by: Pablo <pabloem@users.noreply.github.com>

github-actions bot added the python label Jul 22, 2022

alxmrs force-pushed the dask-runner-mvp branch from b96d3cd to 3c4204d Compare September 19, 2022 23:35

TomAugspurger reviewed Sep 22, 2022

View reviewed changes

alxmrs force-pushed the dask-runner-mvp branch from 1b6ec0f to 282e2aa Compare October 9, 2022 13:36

alxmrs added 22 commits October 10, 2022 15:47

WIP: Created a skeleton dask runner implementation.

79d4603

WIP: Idea for a translation evaluator.

248ec70

Added overrides and a visitor that translates operations.

42452ca

Fixed a dataclass typo.

1da2ddd

Expanded translations.

14885a3

Core idea seems to be kinda working...

fca2420

First iteration on DaskRunnerResult (keep track of pipeline state).

6dd1ada

Added minimal set of DaskRunner options.

6675687

With a great 1-liner from @pabloem, groupby is fixed! Now, all three …

2e3a126

…initial tests pass.

Self-review: Cleaned up dask runner impl.

6467b0e

Self-review: Remove TODOs, delete commented out code, other cleanup.

793ba86

First pass at linting rules.

e535792

WIP, include dask dependencies + test setup.

8e32668

WIP: maybe better dask deps?

318afc2

Skip dask tests depending on successful import.

b01855f

Fixed setup.py (missing ,).

2c2eb8d

Added an additional comma.

e64e9eb

Moved skipping logic to be above dask import.

69b118b

Fix lint issues with dask runner tests.

9ffc8d8

Adding destination for client address.

8a2afb7

Changing to async produces a timeout error instead of stuck in infini…

93f02f1

…te loop.

alxmrs mentioned this pull request Oct 19, 2022

Demo day 2022-10-27 dask/community#282

Closed

pabloem added 2 commits October 21, 2022 00:57

ignoring broken test

1a60a5e

Update CHANGES.md

c1037f8

Using reflection to test the Dask client constructor.

9e79ffd

TomAugspurger reviewed Oct 24, 2022

View reviewed changes

sdks/python/apache_beam/runners/dask/dask_runner_test.py Outdated Show resolved Hide resolved

Better method of inspecting the constructor parameters (thanks @TomAu…

f9cf45a

…gspurger!).

pabloem merged commit 76761db into apache:master Oct 25, 2022

alxmrs mentioned this pull request Oct 25, 2022

Dask Runner Options test failure. #23838

Open

cisaacstern mentioned this pull request Oct 5, 2023

Add DaskBakery pangeo-forge/pangeo-forge-runner#109

Draft

cisaacstern mentioned this pull request Dec 9, 2023

Switch to different, stable hash algorithm in Bag dask/dask#6723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial DaskRunner for Beam #22421

Initial DaskRunner for Beam #22421

alxmrs commented Jul 22, 2022 •

edited

Loading

codecov bot commented Sep 5, 2022 •

edited

Loading

alxmrs commented Sep 21, 2022

TomAugspurger commented Sep 22, 2022 •

edited

Loading

TomAugspurger Sep 22, 2022

alxmrs Sep 22, 2022

TomAugspurger Sep 22, 2022

alxmrs commented Sep 22, 2022

pabloem commented Oct 19, 2022

pabloem commented Oct 19, 2022

pabloem commented Oct 20, 2022

alxmrs commented Oct 20, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

alxmrs commented Oct 25, 2022

Abacn commented Oct 25, 2022

pabloem commented Oct 26, 2022

Initial DaskRunner for Beam #22421

Initial DaskRunner for Beam #22421

Conversation

alxmrs commented Jul 22, 2022 • edited Loading

GitHub Actions Tests Status (on master branch)

codecov bot commented Sep 5, 2022 • edited Loading

Codecov Report

alxmrs commented Sep 21, 2022

TomAugspurger commented Sep 22, 2022 • edited Loading

TomAugspurger Sep 22, 2022

Choose a reason for hiding this comment

alxmrs Sep 22, 2022

Choose a reason for hiding this comment

TomAugspurger Sep 22, 2022

Choose a reason for hiding this comment

alxmrs commented Sep 22, 2022

pabloem commented Oct 19, 2022

pabloem commented Oct 19, 2022

pabloem commented Oct 20, 2022

alxmrs commented Oct 20, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 21, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 24, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

pabloem commented Oct 25, 2022

alxmrs commented Oct 25, 2022

Abacn commented Oct 25, 2022

pabloem commented Oct 26, 2022

alxmrs commented Jul 22, 2022 •

edited

Loading

codecov bot commented Sep 5, 2022 •

edited

Loading

TomAugspurger commented Sep 22, 2022 •

edited

Loading