[SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading #30590

WeichenXu123 · 2020-12-03T11:45:19Z

Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

Bug fix.

No

Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu weichen.xu@databricks.com
Signed-off-by: Ruifeng Zheng ruifengz@foxmail.com
(cherry picked from commit 8016123)
Signed-off-by: Weichen Xu weichen.xu@databricks.com

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

…may be lost after saving and reloading Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ Bug fix. No Unit test. Closes apache#30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 8016123) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

SparkQA · 2020-12-03T12:59:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36727/

SparkQA · 2020-12-03T13:18:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36727/

SparkQA · 2020-12-03T14:15:46Z

Test build #132126 has finished for PR 30590 at commit e3c04ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng

LGTM pending tests

WeichenXu123 · 2020-12-04T09:38:41Z

Jenkins retest this

zhengruifeng · 2020-12-04T09:38:45Z

retest this please

SparkQA · 2020-12-04T11:11:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36822/

SparkQA · 2020-12-04T11:27:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36822/

SparkQA · 2020-12-04T11:39:22Z

Test build #132222 has finished for PR 30590 at commit e3c04ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2020-12-04T13:46:17Z

Jenkins retest this

WeichenXu123 · 2020-12-05T00:38:46Z

For failed test irrelative to this PR:


org.apache.spark.launcher.LauncherBackendSuite.local: launcher handle | 2 min 51 sec | 1
-- | -- | --
org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false | 1 min 20 sec | 1
org.apache.spark.sql.kafka010.consumer.KafkaDataConsumerSuite.SPARK-23623: concurrent use of KafkaDataConsumer | 2 min 53 sec | 1
org.apache.spark.sql.kafka010.consumer.KafkaDataConsumerSuite.SPARK-25151 Handles multiple tasks in executor fetching same (topic, partition) pair | 1 min 0 sec | 1

WeichenXu123 · 2020-12-05T00:42:16Z

retest this please

SparkQA · 2020-12-05T01:30:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36857/

SparkQA · 2020-12-05T01:56:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36857/

SparkQA · 2020-12-05T03:29:31Z

Test build #132256 has finished for PR 30590 at commit e3c04ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ams in estimatorParamMaps may be lost after saving and reloading Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ Bug fix. No Unit test. Closes #30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xudatabricks.com> Signed-off-by: Ruifeng Zheng <ruifengzfoxmail.com> (cherry picked from commit 8016123) Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30590 from WeichenXu123/SPARK-33592-bp-3.0. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2020-12-07T03:43:37Z

Merged to branch-3.0

WeichenXu123 added 2 commits December 3, 2020 19:38

update

e3c04ea

zhengruifeng approved these changes Dec 4, 2020

View reviewed changes

WeichenXu123 closed this Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading #30590

[SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading #30590

WeichenXu123 commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

zhengruifeng left a comment

WeichenXu123 commented Dec 4, 2020

zhengruifeng commented Dec 4, 2020

SparkQA commented Dec 4, 2020

SparkQA commented Dec 4, 2020

SparkQA commented Dec 4, 2020

WeichenXu123 commented Dec 4, 2020

WeichenXu123 commented Dec 5, 2020

WeichenXu123 commented Dec 5, 2020

SparkQA commented Dec 5, 2020

SparkQA commented Dec 5, 2020

SparkQA commented Dec 5, 2020

WeichenXu123 commented Dec 7, 2020

[SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading #30590

[SPARK-33592][ML][PYTHON][3.0] Backport Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading #30590

Conversation

WeichenXu123 commented Dec 3, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

zhengruifeng left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Dec 4, 2020

zhengruifeng commented Dec 4, 2020

SparkQA commented Dec 4, 2020

SparkQA commented Dec 4, 2020

SparkQA commented Dec 4, 2020

WeichenXu123 commented Dec 4, 2020

WeichenXu123 commented Dec 5, 2020

WeichenXu123 commented Dec 5, 2020

SparkQA commented Dec 5, 2020

SparkQA commented Dec 5, 2020

SparkQA commented Dec 5, 2020

WeichenXu123 commented Dec 7, 2020