[SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union #15567

mengxr · 2016-10-20T10:01:14Z

What changes were proposed in this pull request?

When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., rand, randn, and monotonically_increasing_id, the expected semantic is the following:

The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.

However, since we use TaskContext.getPartitionId to get the partition index from the current thread, the values from nondeterministic columns might change if we call union or coalesce after. TaskContext.getPartitionId returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.

See the unit tests below or JIRA for examples.

This PR uses the partition index from RDD.mapPartitionWithIndex instead of TaskContext and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. initializeStatesForPartition(partitionIndex: Int) was added to Projection, Nondeterministic, and Predicate (codegen) and initialized right after object creation in mapPartitionWithIndex. newPredicate now returns a Predicate instance rather than a function for proper initialization.

How was this patch tested?

Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)

cc: @rxin @davies

…titionIndex)

SparkQA · 2016-10-20T10:09:10Z

Test build #67258 has finished for PR 15567 at commit 5b915a7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T11:48:25Z

Test build #67257 has finished for PR 15567 at commit 4dc0255.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T12:00:12Z

Test build #67260 has finished for PR 15567 at commit e2ebd88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-10-20T17:20:27Z

core/src/main/scala/org/apache/spark/rdd/RDD.scala

+
+  /**
+   * [performance] Spark's internal mapPartitions method that skips closure cleaning.
+   */
  private[spark] def mapPartitionsInternal[U: ClassTag](


can we get rid of this?

There are 20+ probably valid use of mapPartitionsInternal. The main problem is that changing it to mapPartitionsWithIndexInternal doesn't really force people to initialize the partition.

rxin · 2016-10-20T17:21:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

@@ -274,12 +274,12 @@ trait Nondeterministic extends Expression {

  private[this] var initialized = false

-  final def setInitialValues(): Unit = {
-    initInternal()
+  final def initializeStatesForPartition(partitionIndex: Int): Unit = {


while you are at it, it'd be great to add some comments documenting the function ...

How about just naming this "initialize"? It is fairly long right now ....

And we just document to say initialize must be called prior to task execution on a partition.

I don't want to overload the name initialize, which is a little vague, how about initStates? Again, the issue is that even with comments we cannot force users to initialize it.

rxin · 2016-10-20T17:27:30Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   */
+  val partitionInitializationStatements: mutable.ArrayBuffer[String] = mutable.ArrayBuffer.empty
+
+  def addPartitionInitializationStatement(statement: String): Unit = {


any reason you are creating this rather than just using addMutableState?

I'm a little worried about introducing more issues by moving initMutableStates out from the constructor. The current implementation at least maintains the existing behavior if we missed initializeStatesForPartition somewhere.

rxin · 2016-10-20T17:29:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

@@ -274,12 +274,12 @@ trait Nondeterministic extends Expression {

  private[this] var initialized = false


should we change this to transient? then it will always get reset to false on a new partition.

rxin · 2016-10-20T17:30:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SparkPartitionID.scala

 */
 @ExpressionDescription(
-  usage = "_FUNC_() - Returns the current partition id of the Spark task",
+  usage = "_FUNC_() - Returns the current partition id",


hmmm this is behavior changing, and there is some value to the old partition id ...

I'd consider introducing a new expression for the proper id and leave the old one as is.

I thought about this. But I don't think the current behavior is the expected behavior from users. This is the same issue as in monotonically_increasing_id.

Yea but it is consistent with TaskContext.partitionId (which is also the name of the function)

The name is SparkPartitionID not TaskContextPartitionID. We should follow the same semantic for non-deterministic expressions.

rxin · 2016-10-20T17:32:35Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala

+   * This is used by non-deterministic expressions to set initial states.
+   * The default implementation does nothing.
+   */
+  def initializeStatesForPartition(partitionIndex: Int): Unit = {}


to make this safer, i'd create an internal variable "isInitialized" similar to the one in nondeterministic expression, and assert in eval if isInitialized is false
.

I didn't test. Would doing that hurt the performance?

I don't think so, since it is in the interpreted path which is already very slow. Also in the normal case the condition will always be false, so CPU branch prediction should work its magic.

rxin · 2016-10-20T17:36:06Z

Reviewing this code makes me realize how painful it is when project/filter are just scala functions ... it'd be much easier to review if they have methods defined (e.g. eval, or execute) ...

rxin · 2016-10-20T17:36:36Z

So my biggest question is whether you've changed all the places to call initialize where projection/predicate are used.

SparkQA · 2016-10-20T18:12:23Z

Test build #67270 has finished for PR 15567 at commit 6659795.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T20:13:02Z

Test build #67277 has finished for PR 15567 at commit 80f26c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2016-10-21T18:29:49Z

@mengxr - I think this PR will also address SPARK-14241.

shearerpmm · 2016-10-21T19:18:07Z

SPARK-14241 doesn't just occur with union and coalesce, it also occurs with filter and probably other operations. Hopefully this PR will address all of those situations. I strongly agree with the expected semantic in the original PR message by mengxr - this has bitten me on multiple occasions.

mengxr · 2016-11-01T17:00:17Z

@rxin I updated the implementation to force initialization in Projection/Expression. This will fail many tests. I fixed all in catalyst, but not yet in sql. I want to propose the following change:

add Deterministic extends PartitionDependent (always set initialized to true)
remove Nondeterministic

Basically, we assume non-deterministic by default unless marked as deterministic. This will require updating all expressions but make the code less messy.

rxin · 2016-11-01T20:36:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+ * Trait for expressions that requires initialization based on the partition index prior to task
+ * execution on a partition.
+ */
+trait PartitionDependent {


as discussed offline this is StatefulExpression

Actually, Projection also need the same trait. Shall we call it PartitionedStateful?

SparkQA · 2016-11-01T21:07:32Z

Test build #67907 has finished for PR 15567 at commit 9dcb249.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-11-02T06:02:35Z

I reverted the changes I made to enforce Projection.initialize, which touched too many files and most of them doesn't really need to handle nondeterministic expressions. The current implementation at least throws a runtime error if a nondeterministic expression didn't get initialized, instead of returning incorrect result.

Renamed initializeStatesForPartition to initialize.

SparkQA · 2016-11-02T08:22:04Z

Test build #67957 has finished for PR 15567 at commit 553c6a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T09:00:30Z

Test build #67959 has finished for PR 15567 at commit ababaa9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-02T18:41:04Z

Merging in master/branch-2.1. Thanks.

…ouldn't change after coalesce or union ## What changes were proposed in this pull request? When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following: - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations. However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column. See the unit tests below or JIRA for examples. This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization. ## How was this patch tested? Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...) cc: rxin davies Author: Xiangrui Meng <meng@databricks.com> Closes #15567 from mengxr/SPARK-14393. (cherry picked from commit 02f2031) Signed-off-by: Reynold Xin <rxin@databricks.com>

…ouldn't change after coalesce or union ## What changes were proposed in this pull request? When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following: - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations. However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column. See the unit tests below or JIRA for examples. This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization. ## How was this patch tested? Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...) cc: rxin davies Author: Xiangrui Meng <meng@databricks.com> Closes apache#15567 from mengxr/SPARK-14393.

gatorsmile · 2017-04-08T17:40:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

      val predicate = newPredicate(condition, child.output)
+      predicate.initialize(0)


Just wondering why FilterExec is not using index to initialize the conditions?

mengxr added 12 commits October 19, 2016 21:30

add test cases

f3b9b10

fix WholeStageCodegen

1a66858

add initializeStateForPartition to Projection

06a39e1

add RDD.mapPartitionsWithIndexInternal

7840c95

fix issue without whole stage codegen

ccd2fe7

Nondeterministic.setInitialValues => initializeStatesForPartition(par…

1ca355e

…titionIndex)

test all code paths

9478fd6

fixed codegen fallback case

bc4ea2c

also initialize predicate

7ffe0ed

fix other nondeterministic functions

38dcb7a

test all nondeterministic functions

da9d261

fix partition initialization in local relation

2ec3206

mengxr force-pushed the SPARK-14393 branch from 4dc0255 to 5b915a7 Compare October 20, 2016 10:02

minor

e2ebd88

mengxr force-pushed the SPARK-14393 branch from 5b915a7 to e2ebd88 Compare October 20, 2016 10:08

mengxr added 2 commits October 20, 2016 08:27

fix catalyst expression tests

0de225d

fix generated predicate

6659795

rxin reviewed Oct 20, 2016

View reviewed changes

checked all mapPartitionsInternal usage

80f26c6

rxin reviewed Oct 20, 2016

View reviewed changes

rxin reviewed Nov 1, 2016

View reviewed changes

mengxr added 2 commits November 1, 2016 22:28

Merge remote-tracking branch 'apache/master' into SPARK-14393.2

ecb4f08

rename to initialize

553c6a5

mengxr force-pushed the SPARK-14393 branch from 9dcb249 to 553c6a5 Compare November 2, 2016 05:48

add doc

ababaa9

asfgit closed this in 02f2031 Nov 2, 2016

yangw1234 mentioned this pull request Feb 6, 2017

[SPARK-19471] AggregationIterator does not initialize the generated result projection before using it #16820

Closed

gatorsmile reviewed Apr 8, 2017

View reviewed changes

		@@ -274,12 +274,12 @@ trait Nondeterministic extends Expression {

		private[this] var initialized = false

		val predicate = newPredicate(condition, child.output)
		predicate.initialize(0)

[SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union #15567

[SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union #15567

Conversation

mengxr commented Oct 20, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin Oct 20, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Oct 20, 2016

rxin commented Oct 20, 2016

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

nchammas commented Oct 21, 2016

shearerpmm commented Oct 21, 2016

mengxr commented Nov 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 1, 2016

mengxr commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

rxin commented Nov 2, 2016

Choose a reason for hiding this comment

mengxr commented Oct 20, 2016 •

edited

rxin Oct 20, 2016 •

edited