New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22626][SQL] It deals with wrong Hive's statistics (zero rowCount) #19831
Conversation
Test build #84255 has finished for PR 19831 at commit
|
Test build #84259 has finished for PR 19831 at commit
|
cc @wzhfy |
@@ -418,7 +418,7 @@ private[hive] class HiveClientImpl( | |||
// Note that this statistics could be overridden by Spark's statistics if that's available. | |||
val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) | |||
val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) | |||
val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0) | |||
val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive has a flag called StatsSetupConst.COLUMN_STATS_ACCURATE
. If I remember correctly, this flag will become false if user changes table properties or table data. Can you check if the flag exists in your case? If so, we can use the flag to decide whether to read statistics from Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root problem is that user can set "wrong" table properties. So if we want to prevent using wrong stats, we need to detect changes in properties. Otherwise your case can't be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatsSetupConst.COLUMN_STATS_ACCURATE
to ensure that statistics have been updated, but can not be guaranteed to be correct:
cat <<EOF > data
1,1
2,2
3,3
4,4
5,5
EOF
hive -e "CREATE TABLE spark_22626(c1 int, c2 int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';"
hive -e "LOAD DATA local inpath 'data' into table spark_22626;"
hive -e "INSERT INTO table spark_22626 values(6, 6);"
hive -e "desc extended spark_22626;"
The result is:
parameters:{totalSize=24, numRows=1, rawDataSize=3, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}
numRows
should be 6, but got 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be more clear:
val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_))
val stats =
if (totalSize.isDefined && totalSize.get > 0L) {
Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0)))
} else if (rawDataSize.isDefined && rawDataSize.get > 0) {
Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = rowCount.filter(_ > 0)))
} else {
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the investigation. Seems hive can't protect its stats properties.
BTW, the case here is not about join reorder, it's actually about broadcast decision. Could you update the title of this PR? |
Besides, if the size stats |
If CBO enabled, the Lines 65 to 88 in b803b66
Lines 45 to 64 in e26dac5
If CBO disabled, the Lines 30 to 49 in ae253e5
|
@wangyum |
Since Hive doesn't detect user to set wrong stats properties, I think this solution can alleviate the problem. Besides, it's consistent with what we do for |
Yes, I saw some of these tables in my cluster, but the user did not manually modify this parameter:
|
Is it really an issue? If you manually set a wrong statistics, how would you expect the system to do? I think data source tables don't allow you set the statistics manually, so this problem is inherited from Hive. cc @wzhfy to confirm. This PR treats 0 row count as invalid, which is arguable, i.e. if we analyze an empty table, and then the 0 row count is valid. |
Instead of manually setting up table statistics, I'm just trying to simulate the statistics for these tables by this way. |
@cloud-fan Yes, Spark doesn't allow user to set (Spark's) statistics manually. This PR treats 0 row count of Hive's stats, it doesn't affect the logic for Spark's stats. Besides, Spark currently only uses Hive's |
@@ -1187,6 +1187,22 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd | |||
} | |||
} | |||
} | |||
|
|||
test("Wrong Hive table statistics may trigger OOM if enables join reorder in CBO") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you can just test the read logic for Hive's stats properties in StatisticsSuite
, instead of a end-to-end test case, developers may not know what's going on by this test case.
Test build #84394 has finished for PR 19831 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
This pr to ensure that the Hive's statistics
totalSize
(orrawDataSize
) > 0,rowCount
also must be > 0. Otherwise may cause OOM when CBO is enabled.How was this patch tested?
unit tests