[SPARK-57860][ML][PYTHON] Add HasIntermediateStorageLevel shared param and apply to KMeans#56949
Closed
maoli67660 wants to merge 1 commit into
Closed
[SPARK-57860][ML][PYTHON] Add HasIntermediateStorageLevel shared param and apply to KMeans#56949maoli67660 wants to merge 1 commit into
maoli67660 wants to merge 1 commit into
Conversation
…ply to KMeans Introduce a `HasIntermediateStorageLevel` shared param (generated into both `sharedParams.scala` and PySpark `shared.py`) and apply it to KMeans as the reference implementation for SPARK-47103. This lets users control the StorageLevel of the intermediate datasets MLlib persists internally during training, which is currently hardcoded to MEMORY_AND_DISK. The default is unchanged, so existing behavior is preserved. Follows the per-estimator param approach suggested by @zhengruifeng on apache#45182, consistent with ALS's existing `intermediateStorageLevel` param.
zhengruifeng
reviewed
Jul 2, 2026
| * Trait for shared param intermediateStorageLevel (default: "MEMORY_AND_DISK"). This trait may be changed or | ||
| * removed between minor versions. | ||
| */ | ||
| trait HasIntermediateStorageLevel extends Params { |
Contributor
There was a problem hiding this comment.
shall we also make ALS also extend this new trait?
this can be done in a separate PR
Contributor
Author
There was a problem hiding this comment.
Done in #56979 (SPARK-57910). Thanks for the suggestion!
zhengruifeng
approved these changes
Jul 2, 2026
zhengruifeng
left a comment
Contributor
There was a problem hiding this comment.
looks pretty good, thanks
Contributor
|
this failed org.apache.spark.sql.connect.service.SparkConnectSessionHolderSuite should be unrelated |
zhengruifeng
pushed a commit
that referenced
this pull request
Jul 2, 2026
…m and apply to KMeans ### What changes were proposed in this pull request? This is the first sub-task of [SPARK-47103](https://issues.apache.org/jira/browse/SPARK-47103), which aims to make the default storage level of MLlib's intermediate datasets configurable. This PR: 1. Adds a new shared param `HasIntermediateStorageLevel` (default `"MEMORY_AND_DISK"`, cannot be `"NONE"`) by extending the code generators on both sides: - Scala: `SharedParamsCodeGen.scala` -> regenerated `sharedParams.scala` - Python: `_shared_params_code_gen.py` -> regenerated `shared.py` 2. Applies it to `KMeans` (Scala and PySpark): mixes in the trait, adds `setIntermediateStorageLevel`, and uses `$(intermediateStorageLevel)` at the `persist` call site instead of the hardcoded `StorageLevel.MEMORY_AND_DISK`. 3. Adds test coverage in `KMeansSuite`. The design follows the suggestion from zhengruifeng on the earlier PR #45182 (a per-estimator param via a shared `HasIntermediateStorageLevel` trait, consistent with ALS's existing `intermediateStorageLevel`), rather than the global SQL config explored there. The remaining estimators are tracked as sibling sub-tasks under SPARK-47103. ### Why are the changes needed? MLlib persists *intermediate* datasets internally during training (e.g. blockified instances), with the storage level hardcoded to `MEMORY_AND_DISK`. These datasets are created inside the algorithm and are not the user's input `DataFrame`, so users currently have **no way** to change their storage level -- unlike the input, which they can already cache themselves. Making this configurable (e.g. `DISK_ONLY`) improves resilience to executor loss: since SPARK-27677, the External Shuffle Service can serve disk-persisted cached blocks, so disk-based intermediate storage survives executor failures. `ALS` already exposes exactly this via `intermediateStorageLevel`; this PR starts extending the same capability to the rest of MLlib. ### Does this PR introduce _any_ user-facing change? Yes. `KMeans` gains a new expert param `intermediateStorageLevel` and a `setIntermediateStorageLevel` setter. The default is `"MEMORY_AND_DISK"`, so **behavior is unchanged unless the user sets it**. Before (no way to change intermediate storage level): ```python kmeans = KMeans(k=3) # intermediate data always MEMORY_AND_DISK ``` After: ```python kmeans = KMeans(k=3).setIntermediateStorageLevel("DISK_ONLY") ``` ### How was this patch tested? - Extended `KMeansSuite` to assert the new param's default value, that it can be set, and that invalid values (`"NONE"` and non-existent levels) are rejected. `KMeansSuite` passes (15/15). - PySpark param parity is covered by the existing `pyspark.ml.tests.test_param.test_java_params`, which checks that Python params match their Scala counterparts. - `dev/mima` (mllib `mimaReportBinaryIssues`) reports no binary compatibility problems; no `MimaExcludes` entries were needed. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) Closes #56949 from maoli67660/SPARK-57860. Lead-authored-by: Mao Li <63109264+maoli67660@users.noreply.github.com> Co-authored-by: Mao Li <maoli@roku.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 6313ea6) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
Contributor
|
thanks, merged into master/branch-4.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is the first sub-task of SPARK-47103, which aims to make the default storage level of MLlib's intermediate datasets configurable.
This PR:
HasIntermediateStorageLevel(default"MEMORY_AND_DISK", cannot be"NONE") by extending the code generators on both sides:SharedParamsCodeGen.scala-> regeneratedsharedParams.scala_shared_params_code_gen.py-> regeneratedshared.pyKMeans(Scala and PySpark): mixes in the trait, addssetIntermediateStorageLevel, and uses$(intermediateStorageLevel)at thepersistcall site instead of the hardcodedStorageLevel.MEMORY_AND_DISK.KMeansSuite.The design follows the suggestion from @zhengruifeng on the earlier PR #45182 (a per-estimator param via a shared
HasIntermediateStorageLeveltrait, consistent with ALS's existingintermediateStorageLevel), rather than the global SQL config explored there. The remaining estimators are tracked as sibling sub-tasks under SPARK-47103.Why are the changes needed?
MLlib persists intermediate datasets internally during training (e.g. blockified instances), with the storage level hardcoded to
MEMORY_AND_DISK. These datasets are created inside the algorithm and are not the user's inputDataFrame, so users currently have no way to change their storage level -- unlike the input, which they can already cache themselves.Making this configurable (e.g.
DISK_ONLY) improves resilience to executor loss: since SPARK-27677, the External Shuffle Service can serve disk-persisted cached blocks, so disk-based intermediate storage survives executor failures.ALSalready exposes exactly this viaintermediateStorageLevel; this PR starts extending the same capability to the rest of MLlib.Does this PR introduce any user-facing change?
Yes.
KMeansgains a new expert paramintermediateStorageLeveland asetIntermediateStorageLevelsetter.The default is
"MEMORY_AND_DISK", so behavior is unchanged unless the user sets it.Before (no way to change intermediate storage level):
After:
How was this patch tested?
KMeansSuiteto assert the new param's default value, that it can be set, and that invalid values ("NONE"and non-existent levels) are rejected.KMeansSuitepasses (15/15).pyspark.ml.tests.test_param.test_java_params, which checks that Python params match their Scala counterparts.dev/mima(mllibmimaReportBinaryIssues) reports no binary compatibility problems; noMimaExcludesentries were needed.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)