Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL #7194

Closed
wants to merge 31 commits into from

Conversation

JoshRosen
Copy link
Contributor

This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that NaN > anything, NaN == anything, and NaN < anything all return false.

Per the design specified in SPARK-9079, we have decided that NaN = NaN should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36419 has finished for PR 7194 at commit 630ebc5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

One subtlety: there can be multiple float / double bitpatterns that are NaN, so clustered sorting based on the bitpatterns is not always sufficient to properly implement COUNT DISTINCT over a set of grouping columns which may contain NaN values.

@davies
Copy link
Contributor

davies commented Jul 13, 2015

@JoshRosen There are also some problems when join or aggregation (NaN is used a part of key in HashMap). I prefer to turn all NaN into null (during inbound conversion and update mutable row).

@JoshRosen
Copy link
Contributor Author

I'm going to close this PR for now while we explore whether to do the NaN -> null conversions. If we decide not to go with that approach, then we can re-open and revisit.

@JoshRosen JoshRosen closed this Jul 14, 2015
@JoshRosen
Copy link
Contributor Author

Re-opening this after some discussion with @rxin; I'm going to re-work this so that NaN is treated as the maximum value when sorting. Note that this will not fix some of the more general correctness issues with NaNs that appear in grouping keys, etc., but it will at least prevent crashes.

@JoshRosen JoshRosen reopened this Jul 18, 2015
@JoshRosen JoshRosen changed the title [SPARK-8797] [WIP] Fix comparison of NaN values in Spark SQL [SPARK-8797] [SPARK-9146] [WIP] Fix comparison of NaN values in Spark SQL Jul 18, 2015
@JoshRosen
Copy link
Contributor Author

(Still in the process of cleaning this up; pushing only so I can view diffs more nicely in GitHub).

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37683 has finished for PR 7194 at commit d907b5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NaiveBayes(override val uid: String)
    • class KMeans(override val uid: String) extends Estimator[KMeansModel] with KMeansParams
    • class KMeansModel(JavaModel):
    • class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
    • class PCA(JavaEstimator, HasInputCol, HasOutputCol):
    • class PCAModel(JavaModel):
    • abstract class Expression extends TreeNode[Expression] with Product
    • trait Generator extends Expression
    • abstract class UnaryLogExpression(f: Double => Double, name: String)
    • case class Conv(numExpr: Expression, fromBaseExpr: Expression, toBaseExpr: Expression)
    • case class Log(child: Expression) extends UnaryLogExpression(math.log, "LOG")
    • case class Log10(child: Expression) extends UnaryLogExpression(math.log10, "LOG10")
    • case class Log1p(child: Expression) extends UnaryLogExpression(math.log1p, "LOG1P")
    • trait NamedExpression extends Expression
    • abstract class Attribute extends LeafExpression with NamedExpression
    • case class IsNaN(child: Expression) extends UnaryExpression
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging with Product
    • abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Product with Serializable
    • trait HashSemiJoin

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37682 has finished for PR 7194 at commit 630ebc5.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@@ -97,7 +93,8 @@ class UnsafeExternalSortSuite extends SparkPlanTest with BeforeAndAfterAll {
inputDf,
UnsafeExternalSort(sortOrder, global = true, _: SparkPlan, testSpillFrequency = 23),
Sort(sortOrder, global = true, _: SparkPlan),
sortAnswers = false
sortAnswers = false,
compareStrings = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of comparing as String, could you update the Row.equals() to handle NaN (we need to do this eventually)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point; I'll just roll the equality changes into this patch.

@JoshRosen JoshRosen changed the title [SPARK-8797] [SPARK-9146] Fix comparison of NaN values in Spark SQL [SPARK-8797] [SPARK-9146] [SPARK-9145] Support NaN ordering and equality comparisons in Spark SQL Jul 19, 2015
def testNaN(nan: Expression): Unit = {
checkEvaluation(nan === nan, true)
checkEvaluation(nan <=> nan, true)
// checkEvaluation(nan <= nan, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this test case fails even though I updated GeneratedOrdering and the interpreted orderings to support our defined NaN semantics. This implies that we may be using the wrong ordering in the implementation of these expressions.

If it turns out that those expressions are mis-handling orderings in a more general way, then I'll open a separate PR to fix that (I suspect that we'll see similar failures when trying to order byte arrays).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I also see now that I should remove this test and add NaN literals to the equalValues list above.

@yjshen
Copy link
Member

yjshen commented Jul 19, 2015

@JoshRosen , Get it, thanks for explanation.

@JoshRosen JoshRosen changed the title [SPARK-8797] [SPARK-9146] [SPARK-9145] Support NaN ordering and equality comparisons in Spark SQL [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL Jul 19, 2015
@JoshRosen
Copy link
Contributor Author

Alright, I've updated this to fix the binary comparison expression issues and have also implemented canonicalization of NaN values in UnsafeRow.

}
case f1: Float =>
if (!o2.isInstanceOf[Float] ||
(java.lang.Float.isNaN(f1) && !java.lang.Float.isNaN(o2.asInstanceOf[Float]))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compare o2 and o1, can we call nanSafeCompare() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh; looks like I was a bit sloppy here. Yeah, the few extra comparisons in nanSafeCompare isn't a big deal; I'll update this to use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I don't think that we can use nanSafeCompare without breaking existing test code / user code: the old code would allow integers and floats to be compared because Java would handle implicit type conversions. Therefore, for compatibility I think we need to do the same here.

I think it will be clearer to rework this as something like "if f1 is a float and it's NaN, then the other value had better be a NaN float, otherwise fall back to the regular == branch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37748 has finished for PR 7194 at commit 7fe67af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37750 has finished for PR 7194 at commit fbb2a29.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37751 has finished for PR 7194 at commit a702e2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable
    • case class UnresolvedFunction(name: String, children: Seq[Expression])
    • case class UnresolvedStar(table: Option[String]) extends Star with Unevaluable
    • case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable
    • case class UnresolvedAlias(child: Expression)
    • case class Cast(child: Expression, dataType: DataType)
    • trait Unevaluable
    • case class SortOrder(child: Expression, direction: SortDirection)
    • trait AggregateExpression extends Expression with Unevaluable
    • case class Abs(child: Expression)
    • trait CodegenFallback
    • case class CreateArray(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CreateStruct(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CurrentDate() extends LeafExpression with CodegenFallback
    • case class CurrentTimestamp() extends LeafExpression with CodegenFallback
    • case class Explode(child: Expression) extends UnaryExpression with Generator with CodegenFallback
    • case class Literal protected (value: Any, dataType: DataType)
    • case class Hex(child: Expression)
    • case class Unhex(child: Expression)
    • case class PrettyAttribute(name: String) extends Attribute with Unevaluable
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate with CodegenFallback
    • case class NewSet(elementType: DataType) extends LeafExpression with CodegenFallback
    • case class AddItemToSet(item: Expression, set: Expression)
    • case class CombineSets(left: Expression, right: Expression)
    • case class CountSet(child: Expression) extends UnaryExpression with CodegenFallback
    • case class Upper(child: Expression)
    • case class StringFormat(children: Expression*) extends Expression with CodegenFallback
    • case class StringSpace(child: Expression)
    • case class Ascii(child: Expression)
    • case class Base64(child: Expression)
    • case class UnBase64(child: Expression)

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37753 has finished for PR 7194 at commit 88bd73c.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable
    • case class UnresolvedFunction(name: String, children: Seq[Expression])
    • case class UnresolvedStar(table: Option[String]) extends Star with Unevaluable
    • case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Unevaluable
    • case class UnresolvedAlias(child: Expression)
    • case class Cast(child: Expression, dataType: DataType)
    • trait Unevaluable
    • case class SortOrder(child: Expression, direction: SortDirection)
    • trait AggregateExpression extends Expression with Unevaluable
    • case class Abs(child: Expression)
    • trait CodegenFallback
    • case class CreateArray(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CreateStruct(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression with CodegenFallback
    • case class CurrentDate() extends LeafExpression with CodegenFallback
    • case class CurrentTimestamp() extends LeafExpression with CodegenFallback
    • case class Explode(child: Expression) extends UnaryExpression with Generator with CodegenFallback
    • case class Literal protected (value: Any, dataType: DataType)
    • case class Hex(child: Expression)
    • case class Unhex(child: Expression)
    • case class PrettyAttribute(name: String) extends Attribute with Unevaluable
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate with CodegenFallback
    • case class NewSet(elementType: DataType) extends LeafExpression with CodegenFallback
    • case class AddItemToSet(item: Expression, set: Expression)
    • case class CombineSets(left: Expression, right: Expression)
    • case class CountSet(child: Expression) extends UnaryExpression with CodegenFallback
    • case class Upper(child: Expression)
    • case class StringFormat(children: Expression*) extends Expression with CodegenFallback
    • case class StringSpace(child: Expression)
    • case class Ascii(child: Expression)
    • case class Base64(child: Expression)
    • case class UnBase64(child: Expression)

@davies
Copy link
Contributor

davies commented Jul 19, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37754 has finished for PR 7194 at commit 983d4fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37877 has finished for PR 7194 at commit 983d4fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Test failures are due to unrelated known flaky tests, so I'm going to merge this into master.

@rxin
Copy link
Contributor

rxin commented Jul 21, 2015

YAY

@asfgit asfgit closed this in c032b0b Jul 21, 2015
@JoshRosen JoshRosen deleted the nan branch August 29, 2016 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants