[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550
[SPARK-47444][SQL] Validate numeric table stats in ALTER TABLE SET TBLPROPERTIES#55550shrirangmhalgi wants to merge 1 commit intoapache:masterfrom
Conversation
f3aaea1 to
38f8a27
Compare
|
Could somebody please help with the PR review |
|
Hi @shrirangmhalgi, SPARK-30262 (cited as motivation) was a read-side issue caused by Hive Metastore's internal behavior, not by users writing invalid values. The JIRA argues that Hive validates these properties, but the context is different. In Hive, |
What changes were proposed in this pull request?
This PR adds validation for table statistics properties
(numRows, totalSize, rawDataSize)inALTER TABLE SET TBLPROPERTIESto reject non-numeric values.The PR changes the below 4 files:
error-conditions.json- Added a new error conditionINVALID_TABLE_STATS_VALUE (SQLSTATE 22023)with the message:"The value <value> for table statistics property <key> is not a valid numeric value."CheckAnalysis.scala- Added new caseSetTableProperties(_, properties)match incheckAnalysis0()that validates stats property values can be parsed asBigInt. This catches invalid values at analysis time for thev2 catalogcode path (e.g.,ALTER TABLE ... SET TBLPROPERTIESresolved throughDataSourceV2).ddl.scala- Added the same validation inAlterTableSetPropertiesCommand.run()before properties are written to the catalog. This catches invalid values at execution time for thev1 catalogcode path (Hive/in-memory catalog).AlterTableSetTblPropertiesSuiteBase.scala- Added a new test that covers both invalid and valid inputs for all three stats properties.Why are the changes needed?
As reported in SPARK-47444,
ALTER TABLE SET TBLPROPERTIEScurrently accepts empty strings and non-numeric values fornumRows, totalSize, and rawDataSize. While SPARK-30262 added a defensive filter when reading stats (to avoidNumberFormatException), invalid values can still be written to the catalog. Downstream tools and applications that consume these stats may break or produce incorrect results.As mentioned in SPARK-47444 - Hive and Beeline already validate these properties on write. Spark should do the same.
Does this PR introduce any user-facing change?
Yes.
ALTER TABLE SET TBLPROPERTIESnow throws anAnalysisExceptionwith error conditionINVALID_TABLE_STATS_VALUEifnumRows, totalSize, or rawDataSizeis set to a non-numeric value (including empty strings). Previously, these invalid values were silently accepted.How was this patch tested?
Added a new test in
AlterTableSetTblPropertiesSuiteBase(numRows, totalSize, rawDataSize)(e.g., 'abc')are rejected for all three stats properties(e.g., '100', '5000')continue to be accepted without errorAll assertions use
checkErrorto verify the exact error condition(INVALID_TABLE_STATS_VALUE)and parameter values. The fullAlterTableSetTblPropertiesSuitepasses (4/4 tests including the new one)Was this patch authored or co-authored using generative AI tooling?
Yes. All changes were reviewed and verified by the author.