[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #21052

mshtelma · 2018-04-12T12:07:36Z

What changes were proposed in this pull request?

During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException.
In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned.

How was this patch tested?

In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section.

…ision by zero can occur. In order to fix this, check was added.

… IN conditions, if the source table is empty, division by zero can occur. In order to fix this, check was added.

…ich were not satisfied) is queried and CBO is turned on, wrong statistics is used, which leads to ClassCastException in FilterEstimation.evaluateInSet

mshtelma · 2018-04-12T12:26:10Z

Regarding the devision by zero in EstimationUtils.scala#L166, I was not able to reproduce it here. (

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

Line 166 in 5cfd5fa

1.0 / bin.ndv.toDouble

)
I can add check there too, in order to be really sure, that this never happens.

maropu · 2018-04-12T13:28:30Z

@wzhfy @gatorsmile could you trigger the tests?

gatorsmile · 2018-04-13T05:33:59Z

ok to test

gatorsmile · 2018-04-13T05:34:23Z

cc @wzhfy Please review this.

SparkQA · 2018-04-13T07:05:02Z

Test build #89316 has finished for PR 21052 at commit 74b6ebd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2018-04-13T12:43:33Z

@mshtelma Usually we describe PR using two sections: What changes were proposed in this pull request? and How was this patch tested?. I think it should be in the template when we open a PR. Could you please update PR description based on the template?

wzhfy · 2018-04-13T12:43:41Z

retest this please

wzhfy · 2018-04-13T12:53:46Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

+  test("evaluateInSet with all zeros") {
+    validateEstimatedStats(
+      Filter(InSet(attrString, Set(3, 4, 5)),
+        StatsTestPlan(Seq(attrString), 10,


change rowCount from 10 to 0? this is more reasonable for an empty table.

yes, this makes sense.
done

wzhfy · 2018-04-13T12:54:07Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

+      Filter(InSet(attrString, Set(3, 4, 5)),
+        StatsTestPlan(Seq(attrString), 10,
+          AttributeMap(Seq(attrString ->
+            ColumnStat(distinctCount = Some(0), min = Some(0), max = Some(0),


min and max should be None?

wzhfy · 2018-04-13T13:05:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/CBOSuite.scala

+
+  import testImplicits._
+
+  test("Simple queries must be working, if CBO is turned on") {


Shall we move it to StatisticsCollectionSuite?
And I think a simple EXPLAIN command on an empty table can just cover the case? We can check the plan's stats (e.g. rowCount == 0) after explain.

I have moved the test to StatisticsCollectionSuite
Done

wzhfy · 2018-04-13T13:08:30Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-        val validQuerySet = hSet.filter { v =>
-          v != null && statsInterval.contains(Literal(v, dataType))
-        }
+        if (colStat.min.isDefined && colStat.max.isDefined) {


check ndv == 0 at the beginning and return Some(0.0? then we don't have to make all these changes

Yes, I have removes the bigger if, and implemented all three checks with one small if

2)Reduced number of changed lines in FilterEstomation.evaluateInSet

SparkQA · 2018-04-13T16:36:31Z

Test build #89339 has finished for PR 21052 at commit 74b6ebd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-13T18:55:36Z

Test build #89349 has finished for PR 21052 at commit 0faa789.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mshtelma · 2018-04-18T08:46:22Z

@wzhfy @maropu Hi guys, is there anything else I should add/change to the PR ?

gatorsmile · 2018-04-18T16:58:21Z

retest this please

SparkQA · 2018-04-18T20:17:34Z

Test build #89521 has finished for PR 21052 at commit 0faa789.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mshtelma · 2018-04-19T21:02:09Z

@gatorsmile the failed tests are not connected to the changes introduced in this PR. Would it make sense to run the test again ?

maropu · 2018-04-19T22:24:22Z

retest this please

SparkQA · 2018-04-20T00:04:38Z

Test build #89596 has finished for PR 21052 at commit 0faa789.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mshtelma · 2018-04-20T08:29:21Z

@maropu Thank you!
this time, it were completely different tests (HiveClientSuites) that have failed.

kiszk · 2018-04-20T19:20:45Z

retest this please

SparkQA · 2018-04-20T23:07:17Z

Test build #89656 has finished for PR 21052 at commit 0faa789.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-04-20T23:43:10Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

@@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared
      }
    }
  }
+
+  test("Simple queries must be working, if CBO is turned on") {
+    withSQLConf(("spark.sql.cbo.enabled", "true")) {


nit: withSQLConf(SQLConf.CBO_ENABLED.key -> "true")

maropu · 2018-04-20T23:43:53Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+        spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')  ").explain()
+      }
+    }
+


nit: drop this line

maropu · 2018-04-20T23:44:00Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+    }
+
+  }
+


maropu · 2018-04-20T23:45:10Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+          .bucketBy(10, "id", "FLD1", "FLD2")
+          .sortBy("id", "FLD1", "FLD2")
+          .saveAsTable("TBL")
+        spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")


nit: you don't need the spark. prefix

maropu · 2018-04-20T23:45:37Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+             WHERE  t1.fld3 IN (-123.23,321.23)
+          """.stripMargin)
+        df2.createTempView("TBL2")
+        spark.sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')  ").explain()


Why this explain() called?

@wzhfy has suggested calling explain in order to trigger query optimization and calling FilterEstimation.evaluateInSet method.
I can call collect() instead.
I think explain() is sufficient for this test.

mshtelma · 2018-04-21T11:16:29Z

@maropu thank you for the suggestions! I have implemented them and pushed the changes.

SparkQA · 2018-04-21T15:15:07Z

Test build #89675 has finished for PR 21052 at commit 8d21488.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-21T21:32:33Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+             WHERE  t1.fld3 IN (-123.23,321.23)
+          """.stripMargin)
+        df2.createTempView("TBL2")
+        sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')  ").explain()


Please do not use explain(). It will output the strings to the console. You can just do this:

sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')").queryExecution.executedPlan

gatorsmile · 2018-04-21T21:33:04Z

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

+             FROM tbl t1
+             JOIN tbl t2 on t1.id=t2.id
+             WHERE  t1.fld3 IN (-123.23,321.23)
+          """.stripMargin)


Nit:

""" |SELECT t1.id, t1.fld1, t1.fld2, t1.fld3 |FROM tbl t1 |JOIN tbl t2 on t1.id=t2.id |WHERE t1.fld3 IN (-123.23,321.23) """.stripMargin)

gatorsmile · 2018-04-21T21:35:31Z

LGTM except two minor comments.

mshtelma · 2018-04-21T22:16:57Z

@gatorsmile I have removed explain() and changed formatting

SparkQA · 2018-04-22T01:51:29Z

Test build #89684 has finished for PR 21052 at commit 8369cbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-22T06:33:32Z

Thanks! Merged to master/2.3

…y zero in a case of empty table with analyzed statistics >What changes were proposed in this pull request? During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException. In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned. >How was this patch tested? In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section. Author: Mykhailo Shtelma <mykhailo.shtelma@bearingpoint.com> Author: smikesh <mshtelma@gmail.com> Closes #21052 from mshtelma/filter_estimation_evaluateInSet_Bugs. (cherry picked from commit c48085a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

tdas · 2018-04-23T20:48:53Z

@gatorsmile this broke 2.3 compilation.
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.3-compile-maven-hadoop-2.6/638/

gatorsmile · 2018-04-23T20:50:31Z

Let me revert it from Spark 2.3

gatorsmile · 2018-04-23T20:59:32Z

@mshtelma Could you submit a backport PR to Spark 2.3?

cloud-fan · 2018-04-24T10:48:48Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

@@ -392,6 +392,10 @@ case class FilterEstimation(plan: Filter) extends Logging {
    val dataType = attr.dataType
    var newNdv = ndv

+    if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty)  {


why colStat.min.isEmpty || colStat.max.isEmpty means empty output? string type always has no max/min

Yeah, we need to correct it in the next PR

mshtelma · 2018-04-25T16:57:12Z

@gatorsmile should I create new PR with these changes for 2.3 branch ? I will do this. Do we need new jira for 2.3 ? or should I reference the existing one ?

gatorsmile · 2018-04-25T19:14:03Z

See my PR #21147. We need to fix the issue first.

Mykhailo Shtelma added 3 commits April 12, 2018 09:42

During evaluation of IN conditions, if the source table is empty, div…

297395e

…ision by zero can occur. In order to fix this, check was added.

Added test case for the the following situation: During evaluation of…

d634dda

… IN conditions, if the source table is empty, division by zero can occur. In order to fix this, check was added.

If an empty dataframe (because of some conditions in parent query, wh…

74b6ebd

…ich were not satisfied) is queried and CBO is turned on, wrong statistics is used, which leads to ClassCastException in FilterEstimation.evaluateInSet

mshtelma mentioned this pull request Apr 12, 2018

[SPARK-23799] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #20913

Closed

wzhfy reviewed Apr 13, 2018

View reviewed changes

1)CBOSuite was moved to StatisticsCollectionSuite

0faa789

2)Reduced number of changed lines in FilterEstomation.evaluateInSet

mshtelma changed the title ~~[SPARK-23799] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics~~ [SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics Apr 18, 2018

maropu reviewed Apr 20, 2018

View reviewed changes

minor fixes

8d21488

gatorsmile reviewed Apr 21, 2018

View reviewed changes

minor fixes

8369cbc

asfgit closed this in c48085a Apr 22, 2018

cloud-fan reviewed Apr 24, 2018

View reviewed changes


		import testImplicits._

		test("Simple queries must be working, if CBO is turned on") {

[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #21052

[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics #21052

Conversation

mshtelma commented Apr 12, 2018 • edited Loading

mshtelma commented Apr 12, 2018

maropu commented Apr 12, 2018

gatorsmile commented Apr 13, 2018

gatorsmile commented Apr 13, 2018

SparkQA commented Apr 13, 2018

wzhfy commented Apr 13, 2018

wzhfy commented Apr 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2018

SparkQA commented Apr 13, 2018

mshtelma commented Apr 18, 2018

gatorsmile commented Apr 18, 2018

SparkQA commented Apr 18, 2018

mshtelma commented Apr 19, 2018

maropu commented Apr 19, 2018

SparkQA commented Apr 20, 2018

mshtelma commented Apr 20, 2018 • edited Loading

kiszk commented Apr 20, 2018

SparkQA commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshtelma commented Apr 21, 2018

SparkQA commented Apr 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Apr 21, 2018

mshtelma commented Apr 21, 2018

SparkQA commented Apr 22, 2018

gatorsmile commented Apr 22, 2018

tdas commented Apr 23, 2018

gatorsmile commented Apr 23, 2018

gatorsmile commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mshtelma commented Apr 25, 2018

gatorsmile commented Apr 25, 2018

mshtelma commented Apr 12, 2018 •

edited

Loading

mshtelma commented Apr 20, 2018 •

edited

Loading