Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation #17735

Closed
wants to merge 4 commits into from

Conversation

lw-lin
Copy link
Contributor

@lw-lin lw-lin commented Apr 23, 2017

What changes were proposed in this pull request?

Within the same streaming query, when one StreamingRelation is referred multiple times – e.g. df.union(df) – we should transform it only to one StreamingExecutionRelation, instead of two or more different StreamingExecutionRelations (each of which would have a separate set of source, source logs, ...).

How was this patch tested?

Added two test cases, each of which would fail without this patch.

@SparkQA
Copy link

SparkQA commented Apr 23, 2017

Test build #76080 has finished for PR 17735 at commit bf502a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin lw-lin changed the title [WIP][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation Apr 23, 2017
@lw-lin
Copy link
Contributor Author

lw-lin commented Apr 25, 2017

@zsxwing @brkyvz would you take a look at this? thanks!

.collect { case ser: StreamingExecutionRelation => ser }
assert(executionRelations.size == 2)
assert(executionRelations.distinct.size == 1)
query.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please wrap this in a finally?

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76208 has finished for PR 17735 at commit 98eb3dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76211 has finished for PR 17735 at commit db11db0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented Apr 28, 2017

@brkyvz please take another look

@lw-lin
Copy link
Contributor Author

lw-lin commented May 1, 2017

Jenkins retest this please

@lw-lin
Copy link
Contributor Author

lw-lin commented May 1, 2017

@zsxwing @brkyvz would you take a look at this? thanks!

@SparkQA
Copy link

SparkQA commented May 1, 2017

Test build #76347 has finished for PR 17735 at commit db11db0.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin
Copy link
Contributor Author

lw-lin commented May 1, 2017

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76363 has finished for PR 17735 at commit db11db0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.streamingQuery
val executionRelations =
query
.logicalPlan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call query.awaitInitialization before accessing logicalPlan. Otherwise this test will be flaky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i see.
fixed. thanks!

@zsxwing
Copy link
Member

zsxwing commented May 2, 2017

Looks pretty good except one minor issue in tests.

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76398 has finished for PR 17735 at commit 63ed28a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented May 3, 2017

LGTM. Merging to master/2.2. Thanks for the PR!

asfgit pushed a commit that referenced this pull request May 3, 2017
…treamingRelation should only be transformed to one StreamingExecutionRelation

## What changes were proposed in this pull request?

Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).

## How was this patch tested?

Added two test cases, each of which would fail without this patch.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #17735 from lw-lin/SPARK-20441.

(cherry picked from commit 27f543b)
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
@asfgit asfgit closed this in 27f543b May 3, 2017
@lw-lin lw-lin deleted the SPARK-20441 branch May 4, 2017 14:19
cenyuhai pushed a commit to cenyuhai/spark that referenced this pull request Oct 8, 2017
…treamingRelation should only be transformed to one StreamingExecutionRelation

## What changes were proposed in this pull request?

Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).

## How was this patch tested?

Added two test cases, each of which would fail without this patch.

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#17735 from lw-lin/SPARK-20441.

(cherry picked from commit 27f543b)
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

(cherry picked from commit b1a732f)
cenyuhai pushed a commit to cenyuhai/spark that referenced this pull request Oct 8, 2017
…treamingRelation should only be transformed to one StreamingExecutionRelation

## What changes were proposed in this pull request?

Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).

## How was this patch tested?

Added two test cases, each of which would fail without this patch.

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#17735 from lw-lin/SPARK-20441.

(cherry picked from commit 27f543b)
// "df.logicalPlan" has already used attributes of the previous `output`.
StreamingExecutionRelation(source, output)
case streamingRelation@StreamingRelation(dataSource, _, output) =>
toExecutionRelationMap.getOrElseUpdate(streamingRelation, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using QueryPlan.sameResult? Our dedup could break it, right? cc @zsxwing @brkyvz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants