[BEAM-7926] Data-centric Interactive Part2 by nika-qubit · Pull Request #10346 · apache/beam

nika-qubit · 2019-12-11T01:15:00Z

Added pipeline_fragment module to build pipeline fragments including
only transforms necessary to produce user-desired PCollections and
implicitly execute the fragment to emit data for user-desired
PCollections.
Added the notion of cached PCollection completeness. Only cache of
completely computed PCollections can be read. p.run() of the
InteractiveRunner by default displays the pipeline graph and doesn't use
any cached intermediate PCollections but generates new cache for them.
Whenever a pipeline fragment is executed, by default the implicit
pipeline run doesn't display pipeline graph but uses available cached
intermediate PCollections and generates cache for PCollections haven't
been computed.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

nika-qubit · 2019-12-11T01:20:40Z

R: @davidyan74
R: @rohdesamuel
PTAL.

Adding Pablo as the committer.
R: @pabloem

Thank you all!

rohdesamuel · 2019-12-11T17:47:57Z

Taking a look

nika-qubit · 2019-12-11T18:32:48Z

Taking a look

Split into 2 PRs.
This adds pipeline_fragment module with necessary dependency changes.
The following PR will implement show and add other minor changes.

davidyan74 · 2019-12-11T19:44:19Z

sdks/python/apache_beam/runners/interactive/pipeline_fragment_test.py

Can we also add a test that the original pipeline is intact? i.e. p still has the PTransform "Cube".

Added the test

rohdesamuel · 2019-12-11T20:58:13Z

Thank you for splitting it up, currently taking a look.

rohdesamuel · 2019-12-11T17:51:05Z

sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py

The method already does bounds checking, this is redundant. It's okay to just do data_sample = data.head(25).

Got it, will remove the bound checking.

rohdesamuel · 2019-12-11T17:51:58Z

sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py

"Displays the first n entries of the PCollection". It might also be helpful to parameterize n for the user, also.

This is not an exposed API. Users wouldn't be able to directly invoke it. Think it as the UI displayed in the terminal when the user invokes show(pcoll) in ipython shell.
We can have a head() API in the interactive_beam module and we don't necessarily need to display a data-table but return a list like result.get(pcoll).

rohdesamuel · 2019-12-11T18:44:13Z

sdks/python/apache_beam/runners/interactive/interactive_environment.py

You can replace these four lines with self._computed_pcolls.update(pc for pc in pcolls).

rohdesamuel · 2019-12-11T18:47:37Z

sdks/python/apache_beam/runners/interactive/interactive_runner.py

I know you didn't add it in this PR, but in general it is easier to understand "positive" parameters. enable_display with a default of True is easier to understand than skip_display with default of False (since skip_display with False is a double negative).

This is added by an internal Googler and their code in Google3 need/set this value. So let's probably leave it be since we cannot make an atomic change that doesn't break things in Google3 at some point. I'll contact the author when we've settled down our changes.

Ah gotcha, np then

rohdesamuel · 2019-12-11T19:22:33Z

sdks/python/apache_beam/runners/interactive/pipeline_fragment.py

I'm curious as to why use the PipelineVisitor to manipulate the graph instead of doing something like the PipelineAnalyzer which manipulates the proto?

I think the proto way is less intuitive than traversing the pipeline as a graph directly. At least for the user pipeline.
With PipelineVisitor, all the logic can be applied by traversing the pipeline graph only once.
If you are traversing the pipeline of the user-defined pipeline instance, you can immediately query for metadata such as id(pcoll), which PTransform produces the pcoll, what are the inputs, side inputs and outputs of a PTransform easily.

With proto, there is a missing link between protos and whatever has been defined by the user that lives in memory.

I leave it be for now, but I still have my reservations. I think that this works by some weird trick of the implementation. It's actually more surprising that the API allows for the modification of the graph while iterating over it.

Got it. Yeah, I had the same hunch. But then think about how we mutate linked list, and here, mutating a graph (actually a tree). It's very common to move things around in a tree/graph, like all the tree balancing logic, they all happen in-place without relying on converting a tree to a serialized representation. It's actually not a "bad" practice as long as we are not mutating the pipeline defined by the user.
A proto and a pipeline object are basically two representations of a graph similar to adjacency list vs. graph object. Proto is more platform-agnostic and language-agnostic while the pipeline object is only platform-agnostic (but represented in Python, or need different implementations in different platforms).
To me, proto is something that gets passed across systems, and I'd like it to be used as immutable medium when I pass the representation of a pipeline object around. When a mutation is needed, we deserialize it into a mutable pipeline object.
And the pipeline object is designed to be mutable and to be used to construct pipelines, then we just mutate it when we see fit.

rohdesamuel · 2019-12-11T22:31:51Z

sdks/python/apache_beam/runners/interactive/pipeline_fragment.py

Why do we need to build correlations?

show(pcoll1, pcoll2, pcoll3) given by the user shows PCollections defined in the user pipeline.
When building a fragment, the pipeline deduced is a runner pipeline (mutated standalone copy of the user pipeline). Without the correlations, we don't know what PCollections are pcoll1, pcoll2, pcoll3 in the runner pipeline anymore.

rohdesamuel · 2019-12-11T22:34:10Z

sdks/python/apache_beam/runners/interactive/pipeline_fragment_test.py

Can you also please make a test that the PipelineFragment produces the correct output?

Sure, adding a test to run the pipeline fragment and check the result.

rohdesamuel · 2019-12-11T22:35:41Z

sdks/python/apache_beam/runners/interactive/pipeline_instrument.py

Why do we need this map? What does it do?

It tells us the correlation between PCollections from the runner pipeline instance and PCollections from the user pipeline instance.
When querying the values(), it also tells us what PCollections are cached.
You can even build a pipeline fragment or call show() with the values().

rohdesamuel · 2019-12-11T22:36:44Z

sdks/python/apache_beam/runners/interactive/interactive_runner.py

Can you test this please?

Added a test that runs a pipeline, verifies the computed PCollections and checks the result of computed PCollections.

rohdesamuel · 2019-12-11T22:37:25Z

sdks/python/apache_beam/runners/interactive/pipeline_instrument_test.py

Why do you need to mark the completeness here?

Because with our design, there are only 2 APIs mark the completeness: show and run.
This unit test has neither but it tries to test the execution path where cache is already available for read.

We can change it to p.run() then instrument again, but this might be more descriptive.

nika-qubit · 2019-12-12T19:13:50Z

Run Portable_Python PreCommit

rohdesamuel · 2019-12-12T23:45:56Z

LGTM

nika-qubit · 2019-12-16T20:39:19Z

R: @pabloem

This PR is ready for your review! Thank you very much!

nika-qubit · 2019-12-20T21:51:05Z

Resolved merge conflicts and force pushed.

nika-qubit · 2020-01-02T18:18:25Z

Run Python PreCommit

nika-qubit · 2020-01-02T18:19:28Z

Run Python PreCommit

nika-qubit · 2020-01-02T18:57:11Z

Run Python PreCommit

nika-qubit · 2020-01-04T00:31:13Z

Run Python PreCommit

nika-qubit · 2020-01-06T18:45:10Z

retest this please

* Manual merge of apache#10346 * use real coders * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * address comments * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * Implements the StreamingCacheManager cache_manager.py: - Add the 'write' method to write an element to the cache - Refactor the 'source/sink' methods to return a PTransform - Refactor the 'read' method to return a generator instead of the full list - Refactor the ReadCache and WriteCache methods to use the source/sink methods directly cache_manager_test.py: - Refactor to use the new implementations streaming_cache.py: - Refactor to implement the CacheManager interface - Refactor 'Reader' class to take in a list of headers - Refactor 'read' method to read the headers from the cache - Refactor the 'reader' method into the 'read_multiple' method to read multiple PCollections from cache given their headers. streaming_cache_test.py: - Refactor to use the new implementations - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. pipeline_instrument.py: - Refactor the _read_cache method to only use the ReadCache method pipeline_instrument_test.py: - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. - Simplify the mock writes and tests to use the InMemoryCache

nika-qubit · 2020-01-16T21:43:12Z

Rebased to upstream head.
retest this please.

nika-qubit · 2020-01-17T17:19:05Z

R: @pabloem
Friendly ping :)

pabloem · 2020-01-22T01:19:16Z

looking

nika-qubit · 2020-01-22T18:48:02Z

Rebased to upstream head.
Retest this please.

pabloem · 2020-01-22T18:49:24Z

Retest this please

pabloem · 2020-01-22T18:50:26Z

Retest this please

tvalentyn · 2020-01-27T18:27:18Z

Run Python PreCommit

nika-qubit · 2020-01-27T19:26:32Z

Run Python PreCommit

nika-qubit · 2020-01-27T21:57:10Z

Run Python PreCommit

nika-qubit · 2020-01-27T21:57:28Z

Retest this please

tvalentyn · 2020-01-27T23:00:16Z

Run Python PreCommit

tvalentyn · 2020-01-27T23:01:09Z

retest this please

tvalentyn · 2020-01-27T23:01:25Z

Run PythonLint PreCommit

tvalentyn · 2020-01-27T23:01:35Z

Run PythonLint PreCommit

pabloem · 2020-01-28T00:08:02Z

Run PythonLint PreCommit

1. Added pipeline_fragment module to build pipeline fragments including only transforms necessary to produce user-desired PCollections and implicitly execute the fragment to emit data for user-desired PCollections. 2. Added the notion of cached PCollection completeness. Only cache of completely computed PCollections can be read. `p.run()` of the InteractiveRunner by default displays the pipeline graph and doesn't use any cached intermediate PCollections but generates new cache for them. 3. Whenever a pipeline fragment is executed, by default the implicit pipeline run doesn't display pipeline graph but uses available cached intermediate PCollections and generates cache for PCollections haven't been computed.

nika-qubit · 2020-01-28T01:55:24Z

Rebased to upstream head to pick up a lint change.

nika-qubit · 2020-01-28T01:55:48Z

Run PythonLint PreCommit

nika-qubit · 2020-01-28T01:56:20Z

Run Python PreCommit

tvalentyn · 2020-01-28T01:56:41Z

Run Python PreCommit

nika-qubit · 2020-01-28T01:57:17Z

Run PythonLint PreCommit

pabloem · 2020-01-28T17:41:54Z

Run PythonLint PreCommit

pabloem · 2020-01-28T17:42:00Z

Run PythonLint PreCommit

nika-qubit · 2020-01-28T18:51:45Z

Run PythonLint PreCommit

nika-qubit · 2020-01-28T18:51:52Z

Run Python PreCommit

nika-qubit · 2020-01-28T19:13:20Z

Retest this please.

nika-qubit · 2020-01-28T19:13:35Z

Retest this please

nika-qubit · 2020-01-28T19:13:48Z

Run PythonLint PreCommit

pabloem · 2020-01-28T19:30:02Z

Run Python PreCommit

pabloem · 2020-01-28T19:30:12Z

Run PythonLint PreCommit

nika-qubit · 2020-01-28T20:04:53Z

Retest this please.

* Manual merge of apache#10346 * use real coders * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * address comments * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * Implements the StreamingCacheManager cache_manager.py: - Add the 'write' method to write an element to the cache - Refactor the 'source/sink' methods to return a PTransform - Refactor the 'read' method to return a generator instead of the full list - Refactor the ReadCache and WriteCache methods to use the source/sink methods directly cache_manager_test.py: - Refactor to use the new implementations streaming_cache.py: - Refactor to implement the CacheManager interface - Refactor 'Reader' class to take in a list of headers - Refactor 'read' method to read the headers from the cache - Refactor the 'reader' method into the 'read_multiple' method to read multiple PCollections from cache given their headers. streaming_cache_test.py: - Refactor to use the new implementations - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. pipeline_instrument.py: - Refactor the _read_cache method to only use the ReadCache method pipeline_instrument_test.py: - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. - Simplify the mock writes and tests to use the InMemoryCache

nika-qubit changed the title ~~[BEAM-7926] Data-centric Interactive Part2~~ [WIP]Separating to 2 PRs [BEAM-7926] Data-centric Interactive Part2 Dec 11, 2019

nika-qubit force-pushed the BEAM-7926-part2 branch from 0712ad7 to dd1a7c5 Compare December 11, 2019 18:31

nika-qubit changed the title ~~[WIP]Separating to 2 PRs [BEAM-7926] Data-centric Interactive Part2~~ [BEAM-7926] Data-centric Interactive Part2 Dec 11, 2019

davidyan74 reviewed Dec 11, 2019

View reviewed changes

davidyan74 approved these changes Dec 11, 2019

View reviewed changes

rohdesamuel suggested changes Dec 11, 2019

View reviewed changes

nika-qubit force-pushed the BEAM-7926-part2 branch from 53e9cc2 to a1a72be Compare December 20, 2019 21:50

nika-qubit force-pushed the BEAM-7926-part2 branch from bf8e862 to ff85619 Compare January 6, 2020 18:18

nika-qubit force-pushed the BEAM-7926-part2 branch from ff85619 to 59c9703 Compare January 16, 2020 21:42

nika-qubit force-pushed the BEAM-7926-part2 branch from 59c9703 to 94c3b0c Compare January 22, 2020 18:47

Ning Kang added 3 commits January 27, 2020 17:54

fix

c9b70f2

Thread-safe mocked ipython

d77778e

nika-qubit force-pushed the BEAM-7926-part2 branch from 2bf1a2b to d77778e Compare January 28, 2020 01:55

pabloem merged commit 302e4d9 into apache:master Jan 28, 2020

Conversation

nika-qubit commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

nika-qubit commented Dec 11, 2019

Uh oh!

rohdesamuel commented Dec 11, 2019

Uh oh!

nika-qubit commented Dec 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohdesamuel commented Dec 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nika-qubit Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nika-qubit commented Dec 12, 2019

Uh oh!

rohdesamuel commented Dec 12, 2019

Uh oh!

nika-qubit commented Dec 16, 2019

Uh oh!

nika-qubit commented Dec 20, 2019

Uh oh!

nika-qubit commented Jan 2, 2020

Uh oh!

nika-qubit commented Jan 2, 2020

Uh oh!

nika-qubit commented Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nika-qubit commented Dec 11, 2019 •

edited

Loading

nika-qubit Dec 11, 2019 •

edited

Loading

nika-qubit commented Jan 2, 2020 •

edited

Loading

nika-qubit commented Jan 22, 2020 •

edited

Loading