[Spark] Support OPTIMIZE tbl FULL for clustered table#3793
[Spark] Support OPTIMIZE tbl FULL for clustered table#3793allisonport-db merged 10 commits intodelta-io:masterfrom
Conversation
| inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), | ||
| inputNumZCubes = 1, | ||
| mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE), | ||
| mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), |
There was a problem hiding this comment.
Could you explain this change?
There was a problem hiding this comment.
This is related to comment https://github.com/delta-io/delta/pull/3793/files#r1815830851
After fixing the validateClusteringMetrics , this assertion starts to failing and I have to fix it since validateClusteringMetrics is used by new tests as well
| var finalActualMetrics = actualMetrics | ||
| if (expectedMetrics.inputZCubeFiles.size == SKIP_CHECK_SIZE_VALUE) { | ||
| val stats = expectedMetrics.inputZCubeFiles | ||
| val stats = finalActualMetrics.inputZCubeFiles |
There was a problem hiding this comment.
This is a test bug left from the commit that added this test. I have to fix this in the PR since new tests depend on validateClusteringMetrics to validate the metrics are correct. Without this fix, though this validation passed, it doesn't mean the program is correct.
| } | ||
| } | ||
|
|
||
| test("OPTIMIZE FULL") { |
There was a problem hiding this comment.
can we add a test case for different clusteringProvider with OPTIMIZE FULL?
There was a problem hiding this comment.
Added a new test OPTIMIZE FULL - change clustering provider
| inputOtherFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), | ||
| inputNumZCubes = 1, | ||
| mergedFiles = ClusteringFileStats(6, SKIP_CHECK_SIZE_VALUE), | ||
| mergedFiles = ClusteringFileStats(4, SKIP_CHECK_SIZE_VALUE), |
There was a problem hiding this comment.
Im guessing this is also same reason as above
There was a problem hiding this comment.
Yes, this is fixing the test bug introduced in https://github.com/delta-io/delta/pull/3793/files#r1815830851
rahulsmahadev
left a comment
There was a problem hiding this comment.
LGTM! thanks for working on this
Which Delta project/connector is this regarding?
Description
How was this patch tested?
new unit tests added
Does this PR introduce any user-facing changes?
Yes
Previously clustered table won't re-cluster data that was clustered against different cluster keys. With OPTIMIZE tbl FULL, they will be re-clustered against the new keys.