Skip to content

[BEAM-7923] Include side effects in p.run#11141

Merged
aaltay merged 1 commit intoapache:masterfrom
nika-qubit:BEAM-7923-fix
Mar 17, 2020
Merged

[BEAM-7923] Include side effects in p.run#11141
aaltay merged 1 commit intoapache:masterfrom
nika-qubit:BEAM-7923-fix

Conversation

@nika-qubit
Copy link
Contributor

@nika-qubit nika-qubit commented Mar 16, 2020

  1. The problem: PCollections never used as inputs and not watched, such as
    sinks without being assigned to variables is currently pruned before p.run().
  2. The change makes sure that these "side effect" PCollections are now
    considered as extended targets and will be executed on p.run().
  3. Note the change will not affect show, head and collect because
    they have an additional pipeline fragment logic that already prunes
    everything unrelated before the instrumenting and the prune logic inside
    instrumenting.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was flaky because the dataframe columns can be built in arbitrary orders. This option makes sure it doesn't take column positioning into consideration since we only care about the equivalence of data.

@nika-qubit
Copy link
Contributor Author

Formatted with yapf.
Lint passed locally.

R: @aaltay
R: @davidyan74
R: @rohdesamuel

PTAL, thx!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to track, mark side effects differently? Does users want to specifically track these pcollections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary. The intended behavior is not ambiguous: When the user uses show, head, collect APIs, these PCollections are excluded completely as the user explicitly wishes. And when the user invokes p.run(), all transforms in the pipeline should be executed as expected.

This change is only to make sure that the prune logic doesn't affect the above intended behavior.

@aaltay
Copy link
Member

aaltay commented Mar 17, 2020

Could you resolve the conflict?

1. PCollections never used as inputs and not watched, such as sinks without being assigned
to variables will be pruned before `p.run()`. The change makes sure that
these side effect PCollections are now considered as extended targets
and will be executed on `p.run()`.
2. Note the change will not affect `show`, `head` and `collect` because
they have an additional pipeline fragment logic that already prunes
everything unrelated before the instrumenting and the prune logic inside
instrumenting.
@nika-qubit
Copy link
Contributor Author

Rebased to resolve merge conflicts and force pushed!

@aaltay aaltay merged commit df482df into apache:master Mar 17, 2020
@aaltay
Copy link
Member

aaltay commented Mar 17, 2020

Merged this, without noticing that test did not run. Fooled by githubs "all green check signs". Please watch the tests, especially the cron ones and see if anything is failing.

Or better, create an empty PR to run the tests.

@nika-qubit nika-qubit deleted the BEAM-7923-fix branch March 18, 2020 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants