[SPARK-20881][SQL] Clearly document the mechanism to choose between t…

…wo sources of statistics ## What changes were proposed in this pull request? Now, we have two sources of statistics, i.e. Spark's stats and Hive's stats. Spark's stats is generated by running "analyze" command in Spark. Once it's available, we respect this stats over Hive's. This pr is to clearly document in related code the mechanism to choose between these two sources of stats. ## How was this patch tested? Not related. Author: Zhenhua Wang <wzh_zju@163.com> Closes #18105 from wzhfy/cboSwitchStats.
apache · May 28, 2017 · 9d0db5a · 9d0db5a
1 parent 24d3428
commit 9d0db5a
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 1 deletion.
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
@@ -681,9 +681,11 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
       }
     }
 
-    // construct Spark's statistics from information in Hive metastore
+    // Restore Spark's statistics from information in Metastore.
     val statsProps = table.properties.filterKeys(_.startsWith(STATISTICS_PREFIX))
 
+    // Currently we have two sources of statistics: one from Hive and the other from Spark.
+    // In our design, if Spark's statistics is available, we respect it over Hive's statistics.
     if (statsProps.nonEmpty) {
       val colStats = new mutable.HashMap[String, ColumnStat]
 

diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
@@ -434,6 +434,8 @@ private[hive] class HiveClientImpl(
       }
       val comment = properties.get("comment")
 
+      // Here we are reading statistics from Hive.
+      // Note that this statistics could be overridden by Spark's statistics if that's available.
       val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
       val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
       val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)