[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831

wangyum · 2017-11-28T11:02:03Z

What changes were proposed in this pull request?

This pr to ensure that the Hive's statistics totalSize (or rawDataSize) > 0, rowCount also must be > 0. Otherwise may cause OOM when CBO is enabled.

How was this patch tested?

unit tests

SparkQA · 2017-11-28T12:20:00Z

Test build #84255 has finished for PR 19831 at commit b16f88e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-28T14:55:56Z

Test build #84259 has finished for PR 19831 at commit 5c43b2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2017-11-28T15:14:49Z

cc @wzhfy

wzhfy · 2017-11-29T02:02:38Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
      // Note that this statistics could be overridden by Spark's statistics if that's available.
      val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
      val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
-      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
+      val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)


Hive has a flag called StatsSetupConst.COLUMN_STATS_ACCURATE. If I remember correctly, this flag will become false if user changes table properties or table data. Can you check if the flag exists in your case? If so, we can use the flag to decide whether to read statistics from Hive.

The root problem is that user can set "wrong" table properties. So if we want to prevent using wrong stats, we need to detect changes in properties. Otherwise your case can't be avoided.

StatsSetupConst.COLUMN_STATS_ACCURATE to ensure that statistics have been updated, but can not be guaranteed to be correct:

cat <<EOF > data 1,1 2,2 3,3 4,4 5,5 EOF hive -e "CREATE TABLE spark_22626(c1 int, c2 int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';" hive -e "LOAD DATA local inpath 'data' into table spark_22626;" hive -e "INSERT INTO table spark_22626 values(6, 6);" hive -e "desc extended spark_22626;"

The result is:

parameters:{totalSize=24, numRows=1, rawDataSize=3, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}

numRows should be 6, but got 1.

Maybe this could be more clear:

val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) val stats = if (totalSize.isDefined && totalSize.get > 0L) { Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0))) } else if (rawDataSize.isDefined && rawDataSize.get > 0) { Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = rowCount.filter(_ > 0))) } else { None }

Thanks for the investigation. Seems hive can't protect its stats properties.

wzhfy · 2017-11-29T02:08:05Z

BTW, the case here is not about join reorder, it's actually about broadcast decision. Could you update the title of this PR?

wzhfy · 2017-11-29T02:10:04Z

Besides, if the size stats totalSize or rawDataSize is wrong, the problem exists whether CBO is enabled or not. We need to change that in the title too.

wangyum · 2017-11-30T11:31:35Z

If CBO enabled, the outputRowCount == 0, the getOutputSize is 1, sizeInBytes is 1 and this side can broadcast:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

Lines 65 to 88 in b803b66

    
           def getOutputSize( 
        
               attributes: Seq[Attribute], 
        
               outputRowCount: BigInt, 
        
               attrStats: AttributeMap[ColumnStat] = AttributeMap(Nil)): BigInt = { 
        
             // We assign a generic overhead for a Row object, the actual overhead is different for different 
        
             // Row format. 
        
             val sizePerRow = 8 + attributes.map { attr => 
        
               if (attrStats.contains(attr)) { 
        
                 attr.dataType match { 
        
                   case StringType => 
        
                     // UTF8String: base + offset + numBytes 
        
                     attrStats(attr).avgLen + 8 + 4 
        
                   case _ => 
        
                     attrStats(attr).avgLen 
        
                 } 
        
               } else { 
        
                 attr.dataType.defaultSize 
        
               } 
        
             }.sum 
        
             // Output size can't be zero, or sizeInBytes of BinaryNode will also be zero 
        
             // (simple computation of statistics returns product of children). 
        
             if (outputRowCount > 0) outputRowCount * sizePerRow else 1 
        
           }

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

Lines 45 to 64 in e26dac5

def estimate: Option[Statistics] = {

if (childStats.rowCount.isEmpty) return None

// Estimate selectivity of this filter predicate, and update column stats if needed.

// For not-supported condition, set filter selectivity to a conservative estimate 100%

val filterSelectivity = calculateFilterSelectivity(plan.condition).getOrElse(BigDecimal(1))

val filteredRowCount: BigInt = ceil(BigDecimal(childStats.rowCount.get) * filterSelectivity)

val newColStats = if (filteredRowCount == 0) {

// The output is empty, we don't need to keep column stats.

AttributeMap[ColumnStat](Nil)

} else {

colStatsMap.outputColumnStats(rowsBeforeFilter = childStats.rowCount.get,

rowsAfterFilter = filteredRowCount)

}

val filteredSizeInBytes: BigInt = getOutputSize(plan.output, filteredRowCount, newColStats)

Some(childStats.copy(sizeInBytes = filteredSizeInBytes, rowCount = Some(filteredRowCount),

attributeStats = newColStats))

}

If CBO disabled, the sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize and this side cann't broadcast:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 30 to 49 in ae253e5

    
             /** 
        
              * A default, commonly used estimation for unary nodes. We assume the input row number is the 
        
              * same as the output row number, and compute sizes based on the column types. 
        
              */ 
        
             private def visitUnaryNode(p: UnaryNode): Statistics = { 
        
               // There should be some overhead in Row object, the size should not be zero when there is 
        
               // no columns, this help to prevent divide-by-zero error. 
        
               val childRowSize = p.child.output.map(_.dataType.defaultSize).sum + 8 
        
               val outputRowSize = p.output.map(_.dataType.defaultSize).sum + 8 
        
               // Assume there will be the same number of rows as child has. 
        
               var sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize 
        
               if (sizeInBytes == 0) { 
        
                 // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero 
        
                 // (product of children). 
        
                 sizeInBytes = 1 
        
               } 
        
               // Don't propagate rowCount and attributeStats, since they are not estimated here. 
        
               Statistics(sizeInBytes = sizeInBytes, hints = p.child.stats.hints) 
        
             }

wzhfy · 2017-12-01T01:35:45Z

Besides, if the size stats totalSize or rawDataSize is wrong, the problem exists whether CBO is enabled or not.

If CBO enabled, the outputRowCount == 0, the getOutputSize is 1, sizeInBytes is 1 and this side can broadcast:
If CBO disabled, the sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize and this side cann't broadcast:

@wangyum totalSize or rawDataSize can also be wrong, right?

wzhfy · 2017-12-01T01:42:46Z

Since Hive doesn't detect user to set wrong stats properties, I think this solution can alleviate the problem. Besides, it's consistent with what we do for totalSize and rawDataSize (only use the stats when > 0).

wangyum · 2017-12-01T05:26:42Z

Yes, I saw some of these tables in my cluster, but the user did not manually modify this parameter:

# Detailed Table Information		
Database	dw	
Table	prod	
Owner	bi	
Created Time	Tue Nov 03 16:33:52 CST 2015	
Last Access	Thu Jan 01 08:00:00 CST 1970	
Created By	Spark 2.2 or prior	
Type	EXTERNAL	
Provider	hive	
Comment	Product list	
Table Properties	[transient_lastDdlTime=1508260780, last_modified_time=1473154014, last_modified_by=bi]	
Statistics	26596461123 bytes, 0 rows	
Location	viewfs://cluster9/user/hive/warehouse/dw.db/prod	
Serde Library	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	
InputFormat	org.apache.hadoop.mapred.TextInputFormat	
OutputFormat	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	
Storage Properties	[serialization.format=1]	
Partition Provider	Catalog	
Time taken: 1.241 seconds, Fetched 70 row(s)

wangyum · 2017-12-01T09:09:36Z

cc @gatorsmile @cloud-fan

cloud-fan · 2017-12-01T15:34:31Z

Is it really an issue? If you manually set a wrong statistics, how would you expect the system to do? I think data source tables don't allow you set the statistics manually, so this problem is inherited from Hive. cc @wzhfy to confirm.

This PR treats 0 row count as invalid, which is arguable, i.e. if we analyze an empty table, and then the 0 row count is valid.

wangyum · 2017-12-01T19:38:36Z

Instead of manually setting up table statistics, I'm just trying to simulate the statistics for these tables by this way.
If totalSize (or rawDataSize) > 0 and rowCount = 0, at least one parameter is incorrect, and should not be optimized based on these incorrect statistics.

wzhfy · 2017-12-02T15:55:21Z

@cloud-fan Yes, Spark doesn't allow user to set (Spark's) statistics manually.

This PR treats 0 row count of Hive's stats, it doesn't affect the logic for Spark's stats. Besides, Spark currently only uses Hive's totalSize and rawDataSize when they are > 0. This PR changes the behavior for rowCount to be consistent with that, so I think it's fine. But the title of the PR should be more specific, i.e. it deals with wrong Hive's statistics (zero rowCount).

wzhfy · 2017-12-02T16:02:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala

@@ -1187,6 +1187,22 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd
      }
    }
  }
+
+  test("Wrong Hive table statistics may trigger OOM if enables join reorder in CBO") {


IMHO you can just test the read logic for Hive's stats properties in StatisticsSuite, instead of a end-to-end test case, developers may not know what's going on by this test case.

SparkQA · 2017-12-03T02:47:22Z

Test build #84394 has finished for PR 19831 at commit 5b744e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-03T15:07:42Z

thanks, merging to master!

wangyum added 2 commits November 28, 2017 16:56

if dataSize > 0, rowCount should bigger than 0.

ed7352f

add test

b16f88e

wangyum changed the title ~~[SPARK-22489][SQL] Wrong Hive table statistics may trigger OOM if enables join reorder in CBO~~ [SPARK-22626][SQL] Wrong Hive table statistics may trigger OOM if enables join reorder in CBO Nov 28, 2017

fix test error

5c43b2a

wzhfy reviewed Nov 29, 2017

View reviewed changes

wangyum changed the title ~~[SPARK-22626][SQL] Wrong Hive table statistics may trigger OOM if enables join reorder in CBO~~ [SPARK-22626][SQL] Wrong Hive table statistics may trigger OOM if enables CBO Nov 30, 2017

wzhfy reviewed Dec 2, 2017

View reviewed changes

Move test to StatisticsSuite

5b744e3

wangyum changed the title ~~[SPARK-22626][SQL] Wrong Hive table statistics may trigger OOM if enables CBO~~ [SPARK-22626][SQL] t deals with wrong Hive's statistics (zero rowCount) Dec 3, 2017

wangyum changed the title ~~[SPARK-22626][SQL] t deals with wrong Hive's statistics (zero rowCount)~~ [SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) Dec 3, 2017

asfgit closed this in dff440f Dec 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831

[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831

wangyum commented Nov 28, 2017 •

edited

SparkQA commented Nov 28, 2017

SparkQA commented Nov 28, 2017

wangyum commented Nov 28, 2017

wzhfy Nov 29, 2017

wzhfy Nov 29, 2017

wangyum Nov 30, 2017

wangyum Dec 1, 2017

wzhfy Dec 1, 2017

wzhfy commented Nov 29, 2017

wzhfy commented Nov 29, 2017 •

edited

wangyum commented Nov 30, 2017

wzhfy commented Dec 1, 2017

wzhfy commented Dec 1, 2017 •

edited

wangyum commented Dec 1, 2017

wangyum commented Dec 1, 2017

cloud-fan commented Dec 1, 2017

wangyum commented Dec 1, 2017

wzhfy commented Dec 2, 2017 •

edited

wzhfy Dec 2, 2017

SparkQA commented Dec 3, 2017

cloud-fan commented Dec 3, 2017

[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831

[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831

Conversation

wangyum commented Nov 28, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 28, 2017

SparkQA commented Nov 28, 2017

wangyum commented Nov 28, 2017

wzhfy Nov 29, 2017

Choose a reason for hiding this comment

wzhfy Nov 29, 2017

Choose a reason for hiding this comment

wangyum Nov 30, 2017

Choose a reason for hiding this comment

wangyum Dec 1, 2017

Choose a reason for hiding this comment

wzhfy Dec 1, 2017

Choose a reason for hiding this comment

wzhfy commented Nov 29, 2017

wzhfy commented Nov 29, 2017 • edited

wangyum commented Nov 30, 2017

wzhfy commented Dec 1, 2017

wzhfy commented Dec 1, 2017 • edited

wangyum commented Dec 1, 2017

wangyum commented Dec 1, 2017

cloud-fan commented Dec 1, 2017

wangyum commented Dec 1, 2017

wzhfy commented Dec 2, 2017 • edited

wzhfy Dec 2, 2017

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2017

cloud-fan commented Dec 3, 2017

wangyum commented Nov 28, 2017 •

edited

wzhfy commented Nov 29, 2017 •

edited

wzhfy commented Dec 1, 2017 •

edited

wzhfy commented Dec 2, 2017 •

edited