[SPARK-41378][SQL] Support Column Stats in DS v2 #38904

huaxingao · 2022-12-05T00:05:40Z

What changes were proposed in this pull request?

Support Col Stats in DS v2

Why are the changes needed?

Currently only Table stats is supported in DS V2. Column stats should be supported too.

Does this PR introduce any user-facing change?

Yes
ColumnStatistics interface is introduced and added as a part of Statistics

How was this patch tested?

new test

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala

dongjoon-hyun · 2022-12-05T06:11:47Z

Thank you for updates.

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

dongjoon-hyun

cc @aokolnychyi , @sunchao , @liancheng

huaxingao · 2022-12-05T07:10:07Z

also cc @cloud-fan

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java

sunchao

Also curious how this is to be used by Spark

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/Histogram.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java

viirya · 2022-12-06T21:35:23Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

+        // put some fake data for testing only
+        val bin1 = InMemoryHistogramBin(1, 2, 5L)
+        val bin2 = InMemoryHistogramBin(3, 4, 5L)
+        val bin3 = InMemoryHistogramBin(5, 6, 5L)
+        val bin4 = InMemoryHistogramBin(7, 8, 5L)
+        val bin5 = InMemoryHistogramBin(9, 10, 5L)


Hmm, not sure if fake statistics cause will cause unexpected result later? Ideally we should compute real statistics like sizeInBytes and numRows from data .

If it's too complicated, maybe we can just compute max/min for test purpose.

I removed the fake data and computed NDV and null Count for testing purpose.

huaxingao · 2022-12-07T07:26:49Z

Also curious how this is to be used by Spark

The newly added ColumnStatistics is converted to logical ColumnStat in this method and is used in CBO

cloud-fan · 2022-12-07T08:05:04Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java

+  /**
+   * @return number of distinct values in the column
+   */
+  default Optional<BigInteger> distinctCount() {


In CBO, we need the distinct count as a BigInteger because the estimated row count can be very large due to join, generate, etc. But for a single table, do we really need BigInteger?

So, do you suggest java.util.OptionalLong?

Changed to OptionalLong. Thanks for the suggestion!

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

dongjoon-hyun

+1, LGTM from my side.

dongjoon-hyun · 2022-12-07T21:16:42Z

Merged to master for Apache Spark 3.4.0. Thank you, @huaxingao and all!

huaxingao · 2022-12-07T21:23:32Z

Thank you all very much!

cloud-fan · 2022-12-09T08:45:19Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java

@@ -31,4 +35,7 @@
 public interface Statistics {
  OptionalLong sizeInBytes();
  OptionalLong numRows();
+  default Optional<Map<NamedReference, ColumnStatistics>> columnStats() {


shall we use empty map to indicate no column stats? Catalyst column stats also use map directly.

cloud-fan · 2022-12-09T08:53:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

+      val colNames = tableSchema.fields.map(_.name)
+      var i = 0
+      for (col <- colNames) {
+        val fieldReference = FieldReference(col)


FieldReference.column(col) as it's plain column name, while FieldReference.apply parses the string.

HyukjinKwon · 2022-12-12T11:51:04Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

@@ -2772,6 +2773,26 @@ class DataSourceV2SQLSuiteV1Filter
    }
  }

+  test("SPARK-41378: test column stats") {


This test fails with Scala 2.13:

- SPARK-41378: test column stats *** FAILED *** (19 milliseconds) 5 did not equal 3 (DataSourceV2SQLSuite.scala:2789) org.scalatest.exceptions.TestFailedException: at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter$$anonfun$$nestedInanonfun$new$386$1.applyOrElse(DataSourceV2SQLSuite.scala:2789) at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter$$anonfun$$nestedInanonfun$new$386$1.applyOrElse(DataSourceV2SQLSuite.scala:2782) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:338) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:334) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$collect$1(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$collect$1$adapted(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode.collect(TreeNode.scala:326) at org.apache.spark.sql.connector.DataSourceV2SQLSuiteV1Filter.$anonfun$new$386(DataSourceV2SQLSuite.scala:2782) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)

https://github.com/apache/spark/actions/runs/3670384591/jobs/6204890447
https://github.com/apache/spark/actions/runs/3665545037/jobs/6196700142
https://github.com/apache/spark/actions/runs/3660066892/jobs/6186794437

Mind taking a look please?

Let me take a look at this.

Here is the followup PR to fix Scala 2.13.

[SPARK-41378][SQL][FOLLOWUP] Use toAttributeMap before comparison #39038

### What changes were proposed in this pull request? follow-up PR ### Why are the changes needed? to address comments #38904 (comment) #38904 (comment) ### Does this PR introduce _any_ user-facing change? Change the return type of `columnStats()` from `Optional<Map>` to `Map`. ### How was this patch tested? existing test Closes #39027 from huaxingao/colstats_followup. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Support Col Stats in DS v2 ### Why are the changes needed? Currently only Table stats is supported in DS V2. Column stats should be supported too. ### Does this PR introduce _any_ user-facing change? Yes `ColumnStatistics` interface is introduced and added as a part of `Statistics` ### How was this patch tested? new test Closes apache#38904 from huaxingao/colStats. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? follow-up PR ### Why are the changes needed? to address comments apache#38904 (comment) apache#38904 (comment) ### Does this PR introduce _any_ user-facing change? Change the return type of `columnStats()` from `Optional<Map>` to `Map`. ### How was this patch tested? existing test Closes apache#39027 from huaxingao/colstats_followup. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Suport Col Stats in DS v2

5b5019f

github-actions bot added the SQL label Dec 5, 2022

fix mima

c77252e

github-actions bot added the BUILD label Dec 5, 2022

dongjoon-hyun reviewed Dec 5, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 5, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala Outdated Show resolved Hide resolved

address comments

2a1422b

github-actions bot removed the BUILD label Dec 5, 2022

remove unnecessary change

a6089ec

dongjoon-hyun reviewed Dec 5, 2022

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 5, 2022

View reviewed changes

viirya reviewed Dec 6, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java Outdated Show resolved Hide resolved

sunchao reviewed Dec 6, 2022

View reviewed changes

viirya reviewed Dec 6, 2022

View reviewed changes

address comments

0cddab9

cloud-fan reviewed Dec 7, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 7, 2022

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java Show resolved Hide resolved

LuciferYang reviewed Dec 7, 2022

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala Outdated Show resolved Hide resolved

address comments and fix tests failure

5a872d8

dongjoon-hyun approved these changes Dec 7, 2022

View reviewed changes

viirya approved these changes Dec 7, 2022

View reviewed changes

dongjoon-hyun closed this in 9684385 Dec 7, 2022

huaxingao deleted the colStats branch December 7, 2022 21:23

cloud-fan reviewed Dec 9, 2022

View reviewed changes

huaxingao mentioned this pull request Dec 12, 2022

[SPARK-41378][SQL][FOLLOWUP] DS V2 ColStats follow up #39027

Closed

HyukjinKwon reviewed Dec 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41378][SQL] Support Column Stats in DS v2 #38904

[SPARK-41378][SQL] Support Column Stats in DS v2 #38904

huaxingao commented Dec 5, 2022

dongjoon-hyun commented Dec 5, 2022

dongjoon-hyun left a comment

huaxingao commented Dec 5, 2022

sunchao left a comment

viirya Dec 6, 2022

viirya Dec 6, 2022

huaxingao Dec 7, 2022 •

edited

huaxingao commented Dec 7, 2022

cloud-fan Dec 7, 2022

dongjoon-hyun Dec 7, 2022

cloud-fan Dec 7, 2022

dongjoon-hyun Dec 7, 2022

LuciferYang Dec 7, 2022

huaxingao Dec 7, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Dec 7, 2022

huaxingao commented Dec 7, 2022

cloud-fan Dec 9, 2022

cloud-fan Dec 9, 2022

HyukjinKwon Dec 12, 2022

dongjoon-hyun Dec 12, 2022

dongjoon-hyun Dec 12, 2022

[SPARK-41378][SQL] Support Column Stats in DS v2 #38904

[SPARK-41378][SQL] Support Column Stats in DS v2 #38904

Conversation

huaxingao commented Dec 5, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Dec 5, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

huaxingao commented Dec 5, 2022

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao Dec 7, 2022 • edited

Choose a reason for hiding this comment

huaxingao commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 7, 2022

huaxingao commented Dec 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao Dec 7, 2022 •

edited