[SPARK-23445] ColumnStat refactoring #20624

juliuszsompolski · 2018-02-16T02:07:23Z

What changes were proposed in this pull request?

Refactor ColumnStat to be more flexible.

Split ColumnStat and CatalogColumnStat just like CatalogStatistics is split from Statistics. This detaches how the statistics are stored from how they are processed in the query plan. CatalogColumnStat keeps min and max as String, making it not depend on dataType information.
For CatalogColumnStat, parse column names from property names in the metastore (KEY_VERSION property), not from metastore schema. This means that CatalogColumnStats can be created for columns even if the schema itself is not stored in the metastore.
Make all fields optional. min, max and histogram for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.

How was this patch tested?

Refactored existing tests to work with refactored ColumnStat and CatalogColumnStat.
New tests added in StatisticsSuite checking that backwards / forwards compatibility is not broken.

juliuszsompolski · 2018-02-16T02:07:56Z

cc @gatorsmile @cloud-fan @marmbrus

SparkQA · 2018-02-16T05:34:05Z

Test build #87500 has finished for PR 20624 at commit cf36020.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalogColumnStat(

cloud-fan · 2018-02-20T07:35:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+   * [[ColumnStat.fromExternalString]].
+   *
+   * As part of the protocol, the returned map always contains a key called "version".
+   * In the case min/max values are null (None), they won't appear in the map.


now all fields are optional, we should update this comment.

cloud-fan · 2018-02-20T07:38:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala

@@ -305,15 +260,15 @@ object ColumnStat extends Logging {
      percentiles: Option[ArrayData]): ColumnStat = {
    // The first 6 fields are basic column stats, the 7th is ndvs for histogram bins.
    val cs = ColumnStat(
-      distinctCount = BigInt(row.getLong(0)),
+      distinctCount = Option(BigInt(row.getLong(0))),


nit: we should use Some(value) if value is expected to be not null.

I'd keep it an Option, just to be prepared for more flexibility and more optionality, unless you have a strong opinion. (note: this code has moved to AnalyzeColumnCommand)

cloud-fan · 2018-02-20T07:41:18Z

.../scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala

@@ -32,13 +32,18 @@ object AggregateEstimation {
    val childStats = agg.child.stats
    // Check if we have column stats for all group-by columns.
    val colStatsExist = agg.groupingExpressions.forall { e =>
-      e.isInstanceOf[Attribute] && childStats.attributeStats.contains(e.asInstanceOf[Attribute])
+      e.isInstanceOf[Attribute] && (
+        childStats.attributeStats.get(e.asInstanceOf[Attribute]) match {


nit: childStats.attributeStats.get(e.asInstanceOf[Attribute]).exists(_.hasCountStats)

gatorsmile · 2018-02-22T17:10:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  nullCount: Option[BigInt] = None,
+  avgLen: Option[Long] = None,
+  maxLen: Option[Long] = None,
+  histogram: Option[Histogram] = None) {


Nit: indents.

gatorsmile · 2018-02-22T17:21:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+
+    try {
+      Some(CatalogColumnStat(
+        distinctCount = map.get(s"${colName}.${KEY_DISTINCT_COUNT}").map(v => BigInt(v.toLong)),


Do we have migration issue here? Now, the key is changed. Can Spark 2.4 read the catalog prop wrote by Spark 2.3?

The keys or format of stats in the metastore didn't change. After this patch it remains backwards compatible with stats created before.

What changed here is that the map passed here used to contain stats for just one column, stripped of the columnName prefix, and now I'm passing a map that has all statistics for all columns, with keys prefixed by columnName.

It reduces complexity in statsFromProperties, see https://github.com/apache/spark/pull/20624/files#diff-159191585e10542f013cb3a714f26075R1057
It used to create a filtered map for every column, stripping the prefix together with column name.
Now it just passes the map of all column's stat properties, and an individual column picks up what it needs.

I'll add a bit of doc / comments about that.

Could you add a test case? BTW, forwards compatibility is also needed since Hive metastore is being shared by different Spark versions.

IIUC the format doesn't change, we just change the way to save/restore stats in metastore, which looks cleaner.

The format doesn't change.
There is existing test StatisticsSuite."verify serialized column stats after analyzing columns" that the format of the serialized stats in the metastore doesn't change by comparing it to a manual map of properties.
I will add a test that verifies it the other way - adds the properties manually as TBLPROPERTIES, and verifies that they are successfully parsed.

Added "verify column stats can be deserialized from tblproperties" test.

viirya · 2018-02-23T11:12:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+   * The key is the name of the column and name of the field (e.g. "colName.distinctCount"),
+   * and the value is the string representation for the value.
+   * min/max values are stored as Strings. They can be deserialized using
+   * [[ColumnStat.fromExternalString]].


nit: Shall we move fromExternalString from ColumnStat to CatalogColumnStat?

I think that actually everything from ColumnStat object should move.
fromExternalString / toExternalString -> CatalogColumnStat

And also:
supportsDatatype / supportsHistogram -> AnalyzeColumnCommand
statExprs / rowToColumnStat -> AnalyzeColumnCommand
because they are tied to that specific method of stats collection.

viirya · 2018-02-23T11:15:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  }
+
+  /** Convert [[CatalogColumnStat]] to [[ColumnStat]]. */
+  def toPlanStat(


toPlanStat is the same as CatalogStatistics.toPlanStat. Should we use toColumnState?

I intentionally made it the same.
CatalogStatistics.toPlanStat converts it to Statistics. CatalogColumnStat.toPlanStat converts it to ColumnStat. The name signifies that it is used to convert both of these objects to their counterparts that are used in the query plan.

SparkQA · 2018-02-25T23:47:29Z

Test build #87657 has finished for PR 20624 at commit 0406f52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM except a few comments.

gatorsmile · 2018-02-26T00:11:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

-            colStats += field.name -> cs
-          }
+      val colStats = new mutable.HashMap[String, CatalogColumnStat]
+      val statPropsForField = new mutable.HashMap[String, mutable.HashMap[String, String]]


This is useless.

gatorsmile · 2018-02-26T00:13:42Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+      }
+
+      // Find all the column names by matching the KEY_VERSION properties for them.
+      val fieldNames = colStatsProps.keys.filter {


fieldNames is not being used.

gatorsmile · 2018-02-26T00:25:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala

-                val colStats = stats.attributeStats.get(col)
-                if (colStats.get.nullCount > 0) {
+                val colStats = stats.attributeStats.get(col).get
+                if (!colStats.hasCountStats || colStats.nullCount.get > 0) {


Do we need to check whether it is defined before calling .get?

hasCountStats == distinctCount.isDefined && nullCount.isDefined
So if it passed to the second part of the ||, then hasCountStats == true -> nullCount.isDefined

gatorsmile · 2018-02-26T01:58:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala

+      avgLen = Option(row.getLong(4)),
+      maxLen = Option(row.getLong(5))
+    )
+    if (row.isNullAt(6) || !cs.nullCount.isDefined) {


!cs.nullCount.isDefined -> cs.nullCount.isEmpty

viirya · 2018-02-26T05:58:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+      case _: DecimalType => v.asInstanceOf[Decimal].toJavaBigDecimal
+      // This version of Spark does not use min/max for binary/string types so we ignore it.
+      case _ =>
+        throw new AnalysisException("Column statistics deserialization is not supported for " +


deserialization -> serialization?

SparkQA · 2018-02-26T15:32:17Z

Test build #87671 has finished for PR 20624 at commit a006bab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-27T07:36:56Z

LGTM

Thanks! Merged to master

column stat refactoring

cf36020

cloud-fan reviewed Feb 20, 2018

View reviewed changes

gatorsmile reviewed Feb 22, 2018

View reviewed changes

viirya reviewed Feb 23, 2018

View reviewed changes

juliuszsompolski added 3 commits February 25, 2018 21:00

add backwards/forwards compatibility test.

15802fe

nits

b81252b

Move stuff out of ColumnStat object.

0406f52

gatorsmile reviewed Feb 26, 2018

View reviewed changes

viirya reviewed Feb 26, 2018

View reviewed changes

nits

a006bab

asfgit closed this in 8077bb0 Feb 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23445] ColumnStat refactoring #20624

[SPARK-23445] ColumnStat refactoring #20624

juliuszsompolski commented Feb 16, 2018 •

edited

juliuszsompolski commented Feb 16, 2018

SparkQA commented Feb 16, 2018

cloud-fan Feb 20, 2018

cloud-fan Feb 20, 2018

juliuszsompolski Feb 25, 2018

cloud-fan Feb 20, 2018 •

edited

gatorsmile Feb 22, 2018

gatorsmile Feb 22, 2018

juliuszsompolski Feb 22, 2018

gatorsmile Feb 22, 2018

cloud-fan Feb 23, 2018

juliuszsompolski Feb 25, 2018

juliuszsompolski Feb 25, 2018

viirya Feb 23, 2018

juliuszsompolski Feb 25, 2018

viirya Feb 23, 2018

juliuszsompolski Feb 25, 2018

SparkQA commented Feb 25, 2018

gatorsmile left a comment

gatorsmile Feb 26, 2018

gatorsmile Feb 26, 2018

gatorsmile Feb 26, 2018

juliuszsompolski Feb 26, 2018

gatorsmile Feb 26, 2018

viirya Feb 26, 2018

SparkQA commented Feb 26, 2018

gatorsmile commented Feb 27, 2018 •

edited

[SPARK-23445] ColumnStat refactoring #20624

[SPARK-23445] ColumnStat refactoring #20624

Conversation

juliuszsompolski commented Feb 16, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

juliuszsompolski commented Feb 16, 2018

SparkQA commented Feb 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 25, 2018

gatorsmile left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 26, 2018

gatorsmile commented Feb 27, 2018 • edited

juliuszsompolski commented Feb 16, 2018 •

edited

cloud-fan Feb 20, 2018 •

edited

gatorsmile commented Feb 27, 2018 •

edited