[SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect by WeichenXu123 · Pull Request #41478 · apache/spark

WeichenXu123 · 2023-06-06T12:32:47Z

What changes were proposed in this pull request?

Base class / helper functions for saving/loading estimator / transformer / evaluator / model.
Add saving/loading implementation for feature transformers.
Add saving/loading implementation for logistic regression estimator.

Design goals:

The model format is decoupled from spark, i.e. we can run model inference without spark service.
We can save model to either local file system or cloud storage file system.

Why are the changes needed?

We need to support saving/loading estimator / transformer / evaluator / model.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Unit tests.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

zhengruifeng · 2023-06-08T02:57:21Z

python/pyspark/mlv2/io_utils.py

+from pyspark.errors import PySparkNotImplementedError, PySparkRuntimeError
+from pyspark.util import VersionUtils
+
+from pyspark import __version__ as pyspark_version


on python client, this version info may be different with the one spark.version from the sever side

For new pyspark ML module, I think we should regard it as a client side package so using client side pyspark version should be fine. Pyspark server side just runs wrapped python UDF which is unaware of concrete ML algorithm logic.

zhengruifeng · 2023-06-08T02:59:30Z

python/pyspark/mlv2/io_utils.py

+        return instance
+
+
+class ModelReadWrite:


ModelReadWrite will not deal with the parameters?

See my latest code, "ModelReadWrite" inherits from "ParamsReadWrite"

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…L on spark connect ### What changes were proposed in this pull request? * Base class / helper functions for saving/loading estimator / transformer / evaluator / model. * Add saving/loading implementation for feature transformers. * Add saving/loading implementation for logistic regression estimator. Design goals: * The model format is decoupled from spark, i.e. we can run model inference without spark service. * We can save model to either local file system or cloud storage file system. ### Why are the changes needed? We need to support saving/loading estimator / transformer / evaluator / model. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit tests. Closes apache#41478 from WeichenXu123/mlv2-read-write. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits June 6, 2023 18:36

init

2085bba

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

add read/write support for mlv2

f0bed71

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added CORE PYTHON labels Jun 6, 2023

zhengruifeng reviewed Jun 8, 2023

View reviewed changes

WeichenXu123 added 9 commits June 9, 2023 20:29

Merge branch 'master' into mlv2-read-write

cb0dc65

update

3479737

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

f8c6e63

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

72f9c71

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

fe80715

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

635f5b5

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

support legacy mode save

3fccd77

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

a16a167

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

ac662cd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added the ML label Jun 11, 2023

WeichenXu123 added 5 commits June 13, 2023 20:34

format

094f5c3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update feature transformers

7195067

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update tests

26be121

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix-mypy

046e9c6

fix black

47295fa

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 changed the title ~~[WIP][SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect~~ [SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect Jun 13, 2023

fix format

86649ab

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

zhengruifeng approved these changes Jun 14, 2023

View reviewed changes

WeichenXu123 closed this in a5d3bea Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect#41478

[SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect#41478
WeichenXu123 wants to merge 17 commits intoapache:masterfrom
WeichenXu123:mlv2-read-write

WeichenXu123 commented Jun 6, 2023 •

edited

Loading

Uh oh!

zhengruifeng Jun 8, 2023

Uh oh!

WeichenXu123 Jun 8, 2023

Uh oh!

zhengruifeng Jun 8, 2023

Uh oh!

WeichenXu123 Jun 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

WeichenXu123 commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jun 13, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WeichenXu123 commented Jun 6, 2023 •

edited

Loading