Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union #15567

Closed
wants to merge 19 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Oct 20, 2016

What changes were proposed in this pull request?

When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., rand, randn, and monotonically_increasing_id, the expected semantic is the following:

  • The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.

However, since we use TaskContext.getPartitionId to get the partition index from the current thread, the values from nondeterministic columns might change if we call union or coalesce after. TaskContext.getPartitionId returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.

See the unit tests below or JIRA for examples.

This PR uses the partition index from RDD.mapPartitionWithIndex instead of TaskContext and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. initializeStatesForPartition(partitionIndex: Int) was added to Projection, Nondeterministic, and Predicate (codegen) and initialized right after object creation in mapPartitionWithIndex. newPredicate now returns a Predicate instance rather than a function for proper initialization.

How was this patch tested?

Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)

cc: @rxin @davies

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67258 has finished for PR 15567 at commit 5b915a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67257 has finished for PR 15567 at commit 4dc0255.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67260 has finished for PR 15567 at commit e2ebd88.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* [performance] Spark's internal mapPartitions method that skips closure cleaning.
*/
private[spark] def mapPartitionsInternal[U: ClassTag](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 20+ probably valid use of mapPartitionsInternal. The main problem is that changing it to mapPartitionsWithIndexInternal doesn't really force people to initialize the partition.

@@ -274,12 +274,12 @@ trait Nondeterministic extends Expression {

private[this] var initialized = false

final def setInitialValues(): Unit = {
initInternal()
final def initializeStatesForPartition(partitionIndex: Int): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while you are at it, it'd be great to add some comments documenting the function ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just naming this "initialize"? It is fairly long right now ....

And we just document to say initialize must be called prior to task execution on a partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to overload the name initialize, which is a little vague, how about initStates? Again, the issue is that even with comments we cannot force users to initialize it.

*/
val partitionInitializationStatements: mutable.ArrayBuffer[String] = mutable.ArrayBuffer.empty

def addPartitionInitializationStatement(statement: String): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you are creating this rather than just using addMutableState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about introducing more issues by moving initMutableStates out from the constructor. The current implementation at least maintains the existing behavior if we missed initializeStatesForPartition somewhere.

@@ -274,12 +274,12 @@ trait Nondeterministic extends Expression {

private[this] var initialized = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change this to transient? then it will always get reset to false on a new partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

*/
@ExpressionDescription(
usage = "_FUNC_() - Returns the current partition id of the Spark task",
usage = "_FUNC_() - Returns the current partition id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm this is behavior changing, and there is some value to the old partition id ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider introducing a new expression for the proper id and leave the old one as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this. But I don't think the current behavior is the expected behavior from users. This is the same issue as in monotonically_increasing_id.

Copy link
Contributor

@rxin rxin Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but it is consistent with TaskContext.partitionId (which is also the name of the function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is SparkPartitionID not TaskContextPartitionID. We should follow the same semantic for non-deterministic expressions.

* This is used by non-deterministic expressions to set initial states.
* The default implementation does nothing.
*/
def initializeStatesForPartition(partitionIndex: Int): Unit = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this safer, i'd create an internal variable "isInitialized" similar to the one in nondeterministic expression, and assert in eval if isInitialized is false
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test. Would doing that hurt the performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since it is in the interpreted path which is already very slow. Also in the normal case the condition will always be false, so CPU branch prediction should work its magic.

@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

Reviewing this code makes me realize how painful it is when project/filter are just scala functions ... it'd be much easier to review if they have methods defined (e.g. eval, or execute) ...

@rxin
Copy link
Contributor

rxin commented Oct 20, 2016

So my biggest question is whether you've changed all the places to call initialize where projection/predicate are used.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67270 has finished for PR 15567 at commit 6659795.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67277 has finished for PR 15567 at commit 80f26c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor

@mengxr - I think this PR will also address SPARK-14241.

@shearerpmm
Copy link
Contributor

SPARK-14241 doesn't just occur with union and coalesce, it also occurs with filter and probably other operations. Hopefully this PR will address all of those situations. I strongly agree with the expected semantic in the original PR message by mengxr - this has bitten me on multiple occasions.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 1, 2016

@rxin I updated the implementation to force initialization in Projection/Expression. This will fail many tests. I fixed all in catalyst, but not yet in sql. I want to propose the following change:

  • add Deterministic extends PartitionDependent (always set initialized to true)
  • remove Nondeterministic

Basically, we assume non-deterministic by default unless marked as deterministic. This will require updating all expressions but make the code less messy.

* Trait for expressions that requires initialization based on the partition index prior to task
* execution on a partition.
*/
trait PartitionDependent {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline this is StatefulExpression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Projection also need the same trait. Shall we call it PartitionedStateful?

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67907 has finished for PR 15567 at commit 9dcb249.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Nov 2, 2016

I reverted the changes I made to enforce Projection.initialize, which touched too many files and most of them doesn't really need to handle nondeterministic expressions. The current implementation at least throws a runtime error if a nondeterministic expression didn't get initialized, instead of returning incorrect result.

Renamed initializeStatesForPartition to initialize.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67957 has finished for PR 15567 at commit 553c6a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67959 has finished for PR 15567 at commit ababaa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 2, 2016

Merging in master/branch-2.1. Thanks.

asfgit pushed a commit that referenced this pull request Nov 2, 2016
…ouldn't change after coalesce or union

## What changes were proposed in this pull request?

When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
- The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.

However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.

See the unit tests below or JIRA for examples.

This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
## How was this patch tested?

Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)

cc: rxin davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #15567 from mengxr/SPARK-14393.

(cherry picked from commit 02f2031)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 02f2031 Nov 2, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ouldn't change after coalesce or union

## What changes were proposed in this pull request?

When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
- The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.

However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.

See the unit tests below or JIRA for examples.

This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
## How was this patch tested?

Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)

cc: rxin davies

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#15567 from mengxr/SPARK-14393.
val predicate = newPredicate(condition, child.output)
predicate.initialize(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why FilterExec is not using index to initialize the conditions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants