New execution interface #19

davidselassie · 2022-02-17T01:40:20Z

As you can tell by the branch name, this started out small and then
snowballed! Let me know how you think these examples feel and we can
discuss if this is the right course for our API. I'm open to new
ideas, but I also think this helps guide users along to better use
cases.

Backstory

Previously, on Bytewax:

Executor.build_and_run() wasn't great because a single function is
used for starting up a dataflow you want to execute locally and
manually assembling a cluster of processes.

The difference between "singleton" vs "partitioned" input iterators
and making the input part of the Dataflow definition wasn't great
because it ties the behavior of the dataflow to the behavior of the
input.

The capture operator wasn't great because it didn't give you the
control you needed to return data from multiple workers or background
processes.

I think all these things are symptoms of trying to handle all use
cases from all points in the stack, rather than building primitives
and abstraction layers.

Changes

See the docstrings and tests for examples!

Execution

Let's make explicit that there are different contexts you might run a
dataflow. Removes Executor.build_and_run and adds four different
entry points in a "stack" of complexity:

run_sync() which takes a dataflow and some input, runs it
synchronously as a single worker in the existing Python thread, and
returns the output to that thread. This is what you'd use in tests
and simple notebook work.
run_cluster() which takes a dataflow and some input, starts a
local cluster of processes, runs it, waits for the cluster to finish
work, then collects thre results, and returns the output to that
thread. This is what you'd use in a notebook if you need parallelism
or higher throughput.
main_cluster() which starts up a cluster of local processes,
coordinates the addresses and process IDs between them, runs a
dataflow on it, and waits for it to finish. This has a partitioned
"input builder" and an "output builder" (discussed below). This is
what you'd use if you'd want to write a standalone script or example
that does some higher throughput processing.
main_proc() which lets you manually craft a single process for use
in a cluster. This is what you'd use when crafting your k8s cluster.

main_cluster() is built upon proc_main() but adds in some process
pool managment.

run_cluster() is built upon main_cluster() but adds in the IPC
necessary to get your data back to the calling process super
conveniently.

Input

Since you can express "singleton" input as a "partitioned" input, the
latter is the more fundamental concept, so that's what the lowest
level main_proc() and main_cluster() functions take: an input
builder function which will be called once on each worker that returns
the input that worker should work on.

The higher level execution contexts that you'd want to use from a
notebook or a single thread (run_sync() and run_cluster()) then
handle paritioning for you and provide a nice interface where you can
send in a whole Python list and not need to think more about it.

Output

Make yours, like mine.

Since the above scheme for input feels good, let's try copying the
approach for output: At the lowest level there's a "partitioned output
builder" which returns a callback that each worker thread can use to
write the output it sees.

The higher level execution contexts can then make an output builder
function that collects data to send back to the main process for you.

This change means that the capture operator doesn't need to take any
functions; it just marks what parts of the dataflow are output. I
think this will change slightly with the introduction of branching
dataflows (something like marking different captures with a tag
maybe?).

Arg Parsers

Since the above execution entry points all are Python functions, add
some convenience methods which parse arguments from command line or
env vars. This will make it easier to craft your "main" function in a
cluster or standalone script context.

Updates all the examples to use these parsers so we can go back to
using -w2 and -n2 on the command line. Although some of the
examples show off using main_proc() and so would need different
command line arguments.

As you can tell by the branch name, this started out small and then snowballed! Let me know how you think these examples feel and we can discuss if this is the right course for our API. I'm open to new ideas, but I also think this helps guide users along to better use cases. Backstory ========= _Previously, on Bytewax:_ `Executor.build_and_run()` wasn't great because a single function is used for starting up a dataflow you want to execute locally and manually assembling a cluster of processes. The difference between "singleton" vs "partitioned" input iterators and making the input part of the Dataflow definition wasn't great because it ties the behavior of the dataflow to the behavior of the input. The capture operator wasn't great because it didn't give you the control you needed to return data from multiple workers or background processes. I think all these things are symptoms of trying to handle all use cases from all points in the stack, rather than building primitives and abstraction layers. Changes ======= See the docstrings and tests for examples! Execution --------- Let's make explicit that there are different contexts you might run a dataflow. Removes `Executor.build_and_run` and adds four different entry points in a "stack" of complexity: - `run_sync()` which takes a dataflow and some input, runs it synchronously as a single worker in the existing Python thread, and returns the output to that thread. This is what you'd use in tests and simple notebook work. - `run_cluster()` which takes a dataflow and some input, starts a local cluster of processes, runs it, waits for the cluster to finish work, then collects thre results, and returns the output to that thread. This is what you'd use in a notebook if you need parallelism or higher throughput. - `main_cluster()` which starts up a cluster of local processes, coordinates the addresses and process IDs between them, runs a dataflow on it, and waits for it to finish. This has a partitioned "input builder" and an "output builder" (discussed below). This is what you'd use if you'd want to write a standalone script or example that does some higher throughput processing. - `main_proc()` which lets you manually craft a single process for use in a cluster. This is what you'd use when crafting your k8s cluster. `main_cluster()` is built upon `proc_main()` but adds in some process pool managment. `run_cluster()` is built upon `main_cluster()` but adds in the IPC necessary to get your data back to the calling process super conveniently. Input ----- Since you can express "singleton" input as a "partitioned" input, the latter is the more fundamental concept, so that's what the lowest level `main_proc()` and `main_cluster()` functions take: an input builder function which will be called once on each worker that returns the input that worker should work on. The higher level execution contexts that you'd want to use from a notebook or a single thread (`run_sync()` and `run_cluster()`) then handle paritioning for you and provide a nice interface where you can send in a whole Python list and not need to think more about it. Output ------ _Make yours, like mine._ Since the above scheme for input feels good, let's try copying the approach for output: At the lowest level there's a "partitioned output builder" which returns a callback that each worker thread can use to write the output it sees. The higher level execution contexts can then make an output builder function that collects data to send back to the main process for you. This change means that the capture operator doesn't need to take any functions; it just marks what parts of the dataflow are output. I think this will change slightly with the introduction of branching dataflows (something like marking different captures with a tag maybe?). Arg Parsers ----------- Since the above execution entry points all are Python functions, add some convenience methods which parse arguments from command line or env vars. This will make it easier to craft your "main" function in a cluster or standalone script context. Updates all the examples to use these parsers so we can go back to using `-w2` and `-n2` on the command line. Although some of the examples show off using `main_proc()` and so would need different command line arguments.

davidselassie · 2022-02-17T01:50:34Z

Hmm. I guess tests are failing. I tried to add a dep on a new library multiprocess because it supports sending lambdas between processes (which I guess vanilla Python doesn't!). I thought adding it to the pyproject.toml would be enough, but I guess not? My tests are passing locally currently.

examples/basic.py

pysrc/bytewax/inp.py

… procs

…e input for now

davidselassie · 2022-02-18T23:12:48Z

Gosh found some more pickling dark magic: it seems like you can't use isinstance on dataclasses sent through pickling because they are dynamically generated. Something related to https://stackoverflow.com/questions/620844/why-do-i-get-unexpected-behavior-in-python-isinstance-after-pickling but I'm not exactly sure what's up.

whoahbot · 2022-02-22T17:41:23Z

examples/basic.py

 flow.map(double)
 flow.map(minus_one)
 flow.map(stringy)
-flow.inspect(peek)
+flow.capture()


There is a strong dependency now between flow.capture() and the shape of dataflows like this one. I wonder if we can make this more explicit when running dataflows this way.

If you remove this line, you don't see any output from running the dataflow, which makes sense, but may be confusing at first.

Recap of my understanding of our discussion for posterity:

Our old patterns of "use inspect(print) as "output" are bad habits in the new era of formal output. Output needs to live in a layer above the dataflow itself so that it can adapt to the different execution contexts, otherwise it breaks our rule of "the same dataflow should have the same behavior". We should update our examples with more context to show proper, more nuanced usage.

Capture also takes on more of a role in non-linear dataflows. You can't just output "the last step" since there isn't necessarily a well defined "last step". So we won't be able to do this automatically in general.

I do think it's worthwhile to raise if a dataflow is missing any capture steps. Perhaps there are use cases where that makes sense, but my spidy sense is that it's an indication that you're running a dataflow for its side effects and are possibly playing fast and loose with execution order or worker identity. We can be conservative now and remove that exception later if we find valid use cases.

whoahbot · 2022-02-22T17:50:44Z

Great work! I really like the new shape of input/output.

I was brainstorming over the weekend if there is a set of descriptive names for main_proc, run_cluster, etc. that help describe the different modes of running dataflow processes and workers, but haven't really come up with anything convincing.

davidselassie · 2022-02-23T00:38:11Z

Putting our decision tree here for posterity. We did some brainstorming today on what the shape of the API should look like and the kinds of decisions you need to make to decide on your execution type. This isn't any final naming or anything.

graph TD
    S1(Wake Up) -.Eat Coffee.-> S2(Start)

    S2 --> P(In Process)
    --> P2(No Process Coordination)
    --> P3(No IO Distribution)
    --> P4[run]

    S2 --> M(Multiprocess)

    M --> MA(Automatic Process Coordination)
    
    MA --> MAA1(Automatic IO Distribution)
    --> MAA2[run_cluster]

    MA --> MAM1(Manual IO Distribution)
    --> MAM2[spawn_cluster]

    M --> M1(Manual Process Coordination)
    --> M2(Manual IO Distribution)
    --> M3[cluster_main]

davidselassie · 2022-02-23T01:11:04Z

Thanks for all the discussion today. Updates with some new names. But, yes, let's wait and see if there are any insights from building more demos and examples in the next few days.

pysrc/bytewax/__init__.py

src/lib.rs

pysrc/bytewax/__init__.py

davidselassie requested review from whoahbot and blakestier February 17, 2022 01:51

Fix dependencies for bytewax package.

6b13e68

whoahbot force-pushed the args-helper branch from c0a02eb to 6b13e68 Compare February 17, 2022 18:02

whoahbot and others added 5 commits February 17, 2022 10:41

Don't forbid pypi index.

7949cd2

Try installing wheels directly.

67f60a8

Adding installing multiprocess manually in test step

0874b58

Removing force-reinstall flag

aea55ae

Applying linux changes to macos and windows

16b8600

awmatheson mentioned this pull request Feb 17, 2022

[FEATURE] Work with Jupyter notebook? #20

Closed

miccioest added 2 commits February 18, 2022 09:10

Try installing local whl file without no-index

838db10

Preventing compiling all python versions on each job

5fe7e9b

whoahbot reviewed Feb 18, 2022

View reviewed changes

examples/basic.py Outdated Show resolved Hide resolved

pysrc/bytewax/inp.py Show resolved Hide resolved

davidselassie added 4 commits February 18, 2022 15:03

total_worker_count -> worker_count

425cd16

Rust serde needs to use dill library to match sending data to cluster…

fbd7a30

… procs

Reifys cluster input so generators will work; doesn't support infinit…

197aecf

…e input for now

Makes all the examples run again

55f80a3

whoahbot reviewed Feb 22, 2022

View reviewed changes

davidselassie added 2 commits February 22, 2022 17:07

Renames to brainstorming names

7a816ca

Merge branch 'main' into args-helper

ea022fa

davidselassie added 4 commits February 22, 2022 17:16

Allows and tests multiple capture

e6ce32f

Reformat all

c4fd79c

Internal run

34c80bd

Error if dataflow is missing capture

18c6d36

davidselassie added 2 commits February 23, 2022 11:25

Updates readme

a717ca4

Changelog with new names

cf2b0af

blakestier reviewed Feb 23, 2022

View reviewed changes

pysrc/bytewax/__init__.py Outdated Show resolved Hide resolved

whoahbot reviewed Feb 24, 2022

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

whoahbot reviewed Feb 24, 2022

View reviewed changes

pysrc/bytewax/__init__.py Outdated Show resolved Hide resolved

davidselassie added 4 commits February 23, 2022 16:52

Examples all use capture now

130a47f

run_cluster exits with capture error rather than hanging

dfb0b92

Fixes typos

8e118e0

Renames run_ to _run.

d9300be

davidselassie force-pushed the args-helper branch from 0b8c42d to d9300be Compare February 24, 2022 00:56

whoahbot merged commit 2ddd741 into main Feb 24, 2022

whoahbot deleted the args-helper branch February 24, 2022 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New execution interface #19

New execution interface #19

davidselassie commented Feb 17, 2022

davidselassie commented Feb 17, 2022

davidselassie commented Feb 18, 2022

whoahbot Feb 22, 2022

davidselassie Feb 23, 2022

whoahbot commented Feb 22, 2022

davidselassie commented Feb 23, 2022 •

edited

davidselassie commented Feb 23, 2022

New execution interface #19

New execution interface #19

Conversation

davidselassie commented Feb 17, 2022

Backstory

Changes

Execution

Input

Output

Arg Parsers

davidselassie commented Feb 17, 2022

davidselassie commented Feb 18, 2022

whoahbot Feb 22, 2022

Choose a reason for hiding this comment

davidselassie Feb 23, 2022

Choose a reason for hiding this comment

whoahbot commented Feb 22, 2022

davidselassie commented Feb 23, 2022 • edited

davidselassie commented Feb 23, 2022

davidselassie commented Feb 23, 2022 •

edited