[BEAM-7926] Data-centric Interactive Part2#10346
Conversation
|
R: @davidyan74 Adding Pablo as the committer. Thank you all! |
|
Taking a look |
0712ad7 to
dd1a7c5
Compare
Split into 2 PRs. |
There was a problem hiding this comment.
Can we also add a test that the original pipeline is intact? i.e. p still has the PTransform "Cube".
|
Thank you for splitting it up, currently taking a look. |
There was a problem hiding this comment.
The method already does bounds checking, this is redundant. It's okay to just do data_sample = data.head(25).
There was a problem hiding this comment.
Got it, will remove the bound checking.
There was a problem hiding this comment.
"Displays the first n entries of the PCollection". It might also be helpful to parameterize n for the user, also.
There was a problem hiding this comment.
This is not an exposed API. Users wouldn't be able to directly invoke it. Think it as the UI displayed in the terminal when the user invokes show(pcoll) in ipython shell.
We can have a head() API in the interactive_beam module and we don't necessarily need to display a data-table but return a list like result.get(pcoll).
There was a problem hiding this comment.
You can replace these four lines with self._computed_pcolls.update(pc for pc in pcolls).
There was a problem hiding this comment.
I know you didn't add it in this PR, but in general it is easier to understand "positive" parameters. enable_display with a default of True is easier to understand than skip_display with default of False (since skip_display with False is a double negative).
There was a problem hiding this comment.
This is added by an internal Googler and their code in Google3 need/set this value. So let's probably leave it be since we cannot make an atomic change that doesn't break things in Google3 at some point. I'll contact the author when we've settled down our changes.
There was a problem hiding this comment.
I'm curious as to why use the PipelineVisitor to manipulate the graph instead of doing something like the PipelineAnalyzer which manipulates the proto?
There was a problem hiding this comment.
I think the proto way is less intuitive than traversing the pipeline as a graph directly. At least for the user pipeline.
With PipelineVisitor, all the logic can be applied by traversing the pipeline graph only once.
If you are traversing the pipeline of the user-defined pipeline instance, you can immediately query for metadata such as id(pcoll), which PTransform produces the pcoll, what are the inputs, side inputs and outputs of a PTransform easily.
With proto, there is a missing link between protos and whatever has been defined by the user that lives in memory.
There was a problem hiding this comment.
I leave it be for now, but I still have my reservations. I think that this works by some weird trick of the implementation. It's actually more surprising that the API allows for the modification of the graph while iterating over it.
There was a problem hiding this comment.
Got it. Yeah, I had the same hunch. But then think about how we mutate linked list, and here, mutating a graph (actually a tree). It's very common to move things around in a tree/graph, like all the tree balancing logic, they all happen in-place without relying on converting a tree to a serialized representation. It's actually not a "bad" practice as long as we are not mutating the pipeline defined by the user.
A proto and a pipeline object are basically two representations of a graph similar to adjacency list vs. graph object. Proto is more platform-agnostic and language-agnostic while the pipeline object is only platform-agnostic (but represented in Python, or need different implementations in different platforms).
To me, proto is something that gets passed across systems, and I'd like it to be used as immutable medium when I pass the representation of a pipeline object around. When a mutation is needed, we deserialize it into a mutable pipeline object.
And the pipeline object is designed to be mutable and to be used to construct pipelines, then we just mutate it when we see fit.
There was a problem hiding this comment.
Why do we need to build correlations?
There was a problem hiding this comment.
show(pcoll1, pcoll2, pcoll3) given by the user shows PCollections defined in the user pipeline.
When building a fragment, the pipeline deduced is a runner pipeline (mutated standalone copy of the user pipeline). Without the correlations, we don't know what PCollections are pcoll1, pcoll2, pcoll3 in the runner pipeline anymore.
There was a problem hiding this comment.
Can you also please make a test that the PipelineFragment produces the correct output?
There was a problem hiding this comment.
Sure, adding a test to run the pipeline fragment and check the result.
There was a problem hiding this comment.
Why do we need this map? What does it do?
There was a problem hiding this comment.
It tells us the correlation between PCollections from the runner pipeline instance and PCollections from the user pipeline instance.
When querying the values(), it also tells us what PCollections are cached.
You can even build a pipeline fragment or call show() with the values().
There was a problem hiding this comment.
Can you test this please?
There was a problem hiding this comment.
Added a test that runs a pipeline, verifies the computed PCollections and checks the result of computed PCollections.
There was a problem hiding this comment.
Why do you need to mark the completeness here?
There was a problem hiding this comment.
Because with our design, there are only 2 APIs mark the completeness: show and run.
This unit test has neither but it tries to test the execution path where cache is already available for read.
We can change it to p.run() then instrument again, but this might be more descriptive.
|
Run Portable_Python PreCommit |
|
LGTM |
|
R: @pabloem This PR is ready for your review! Thank you very much! |
53e9cc2 to
a1a72be
Compare
|
Resolved merge conflicts and force pushed. |
|
Run Python PreCommit |
3 similar comments
|
Run Python PreCommit |
|
Run Python PreCommit |
|
Run Python PreCommit |
bf8e862 to
ff85619
Compare
|
retest this please |
* Manual merge of apache#10346 * use real coders * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * address comments * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * Implements the StreamingCacheManager cache_manager.py: - Add the 'write' method to write an element to the cache - Refactor the 'source/sink' methods to return a PTransform - Refactor the 'read' method to return a generator instead of the full list - Refactor the ReadCache and WriteCache methods to use the source/sink methods directly cache_manager_test.py: - Refactor to use the new implementations streaming_cache.py: - Refactor to implement the CacheManager interface - Refactor 'Reader' class to take in a list of headers - Refactor 'read' method to read the headers from the cache - Refactor the 'reader' method into the 'read_multiple' method to read multiple PCollections from cache given their headers. streaming_cache_test.py: - Refactor to use the new implementations - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. pipeline_instrument.py: - Refactor the _read_cache method to only use the ReadCache method pipeline_instrument_test.py: - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. - Simplify the mock writes and tests to use the InMemoryCache
ff85619 to
59c9703
Compare
|
Rebased to upstream head. |
|
R: @pabloem |
|
looking |
59c9703 to
94c3b0c
Compare
|
Rebased to upstream head. |
|
Retest this please |
1 similar comment
|
Retest this please |
|
Run Python PreCommit |
2 similar comments
|
Run Python PreCommit |
|
Run Python PreCommit |
|
Retest this please |
|
Run Python PreCommit |
|
retest this please |
|
Run PythonLint PreCommit |
2 similar comments
|
Run PythonLint PreCommit |
|
Run PythonLint PreCommit |
1. Added pipeline_fragment module to build pipeline fragments including only transforms necessary to produce user-desired PCollections and implicitly execute the fragment to emit data for user-desired PCollections. 2. Added the notion of cached PCollection completeness. Only cache of completely computed PCollections can be read. `p.run()` of the InteractiveRunner by default displays the pipeline graph and doesn't use any cached intermediate PCollections but generates new cache for them. 3. Whenever a pipeline fragment is executed, by default the implicit pipeline run doesn't display pipeline graph but uses available cached intermediate PCollections and generates cache for PCollections haven't been computed.
2bf1a2b to
d77778e
Compare
|
Rebased to upstream head to pick up a lint change. |
|
Run PythonLint PreCommit |
|
Run Python PreCommit |
1 similar comment
|
Run Python PreCommit |
|
Run PythonLint PreCommit |
3 similar comments
|
Run PythonLint PreCommit |
|
Run PythonLint PreCommit |
|
Run PythonLint PreCommit |
|
Run Python PreCommit |
|
Retest this please. |
|
Retest this please |
|
Run PythonLint PreCommit |
|
Run Python PreCommit |
|
Run PythonLint PreCommit |
|
Retest this please. |
* Manual merge of apache#10346 * use real coders * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * address comments * Modify the PipelineInstrument class to add the TestStream for unbounded PCollections * Implements the StreamingCacheManager cache_manager.py: - Add the 'write' method to write an element to the cache - Refactor the 'source/sink' methods to return a PTransform - Refactor the 'read' method to return a generator instead of the full list - Refactor the ReadCache and WriteCache methods to use the source/sink methods directly cache_manager_test.py: - Refactor to use the new implementations streaming_cache.py: - Refactor to implement the CacheManager interface - Refactor 'Reader' class to take in a list of headers - Refactor 'read' method to read the headers from the cache - Refactor the 'reader' method into the 'read_multiple' method to read multiple PCollections from cache given their headers. streaming_cache_test.py: - Refactor to use the new implementations - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. pipeline_instrument.py: - Refactor the _read_cache method to only use the ReadCache method pipeline_instrument_test.py: - Create a new class the "InMemoryCache" that is able to write and read PCollections directly from memory instead of from file. - Simplify the mock writes and tests to use the InMemoryCache
only transforms necessary to produce user-desired PCollections and
implicitly execute the fragment to emit data for user-desired
PCollections.
completely computed PCollections can be read.
p.run()of theInteractiveRunner by default displays the pipeline graph and doesn't use
any cached intermediate PCollections but generates new cache for them.
pipeline run doesn't display pipeline graph but uses available cached
intermediate PCollections and generates cache for PCollections haven't
been computed.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.