-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-5501] Interactive Beam -- display issue: number of PTransform executed wrongly displayed #6418
Conversation
Added read_cache_ids() and write_cache_ids() in pipeline ananlyzer. This change unblocks the display to show PTransform actually executed.
This commit shortens some of the variable/function names in PipelineAnalyzer. * pcollection_id => pcoll_id * transform_id => trans_id * top_level => tl The reason for this is because those long variable and function names have been causing quite a pain in fixing linter errors. e.g. super_ultra_long_variable_name = some_class_instance.a_very_long_function_name() becuase Python does not allow wrapping the line before or after '='
R: @pabloem Hi, Pablo. Can you review this for me? |
Added two small comments. LGTM. |
pipeline_proto: (Pipeline proto) | ||
caches_used: (set of str) A set of PCollection IDs of those whose cached | ||
results are used in the execution. | ||
pipeline_analyzer: (PipelineAnalyzer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what this is in a little bit more detail.
|
||
# _pcollection_stats maps pcoll_id to | ||
# { 'cache_label': cache_label, version': version, 'sample': pcoll_in_list } | ||
self._pcollection_stats = {} | ||
for pcoll_id in pipeline_info.all_pcollections(): | ||
for pcoll_id in self._analyzer.tl_referenced_pcoll_ids(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we go from checking all PCs to checking only top-level-referenced ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PCollections are filtered around line 140 by referenced_pcollection, which contains only top_level ones.
The non-top level ones were added to _pcollection_stats but not used anywhere.
This commit is targetting the display issue: the number of (PTransform actually executed) is wrongly displayed. See: display_manager.py: Executin %s of %s transforms. Other changes: * Interface of display_manager. Instead of taking a bunch of parameters, it now queries the variables from pipeline_analyzer. * A couple of comment changes in pipeline_graph and interactive_pipeline_graph
Thank you Sindy. It seems that the Beam community expects merged PRs to have a JIRA associated to them. Sorry about the trouble, but could you add one? |
Created the corresponding JIRA issue: https://issues.apache.org/jira/browse/BEAM-5501 |
This PR broke Python precommits (task :beam-sdks-python:lintPy27_3). |
16:01:15 ************* Module apache_beam.runners.interactive.pipeline_analyzer |
""" | ||
return self._top_level_required_transforms | ||
return self._top_level_required_transforms.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return list(self._top_level_required_transforms) to avoid the lint complaint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also same change maybe needed in line 124 for Python 3 compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to do the change now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The main focus of this PR is to fix the display issue -- the number of PTransform executed is wrongly displayed.
"Executing 8 of 3 transforms" now becomes "Executing 3 of 3 transforms".
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username
) to look at it.Post-Commit Tests Status (on master branch)