Skip to content

Comments

[SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect#41478

Closed
WeichenXu123 wants to merge 17 commits intoapache:masterfrom
WeichenXu123:mlv2-read-write
Closed

[SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect#41478
WeichenXu123 wants to merge 17 commits intoapache:masterfrom
WeichenXu123:mlv2-read-write

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jun 6, 2023

What changes were proposed in this pull request?

  • Base class / helper functions for saving/loading estimator / transformer / evaluator / model.
  • Add saving/loading implementation for feature transformers.
  • Add saving/loading implementation for logistic regression estimator.

Design goals:

  • The model format is decoupled from spark, i.e. we can run model inference without spark service.
  • We can save model to either local file system or cloud storage file system.

Why are the changes needed?

We need to support saving/loading estimator / transformer / evaluator / model.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Unit tests.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
from pyspark.errors import PySparkNotImplementedError, PySparkRuntimeError
from pyspark.util import VersionUtils

from pyspark import __version__ as pyspark_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on python client, this version info may be different with the one spark.version from the sever side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new pyspark ML module, I think we should regard it as a client side package so using client side pyspark version should be fine. Pyspark server side just runs wrapped python UDF which is unaware of concrete ML algorithm logic.

return instance


class ModelReadWrite:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelReadWrite will not deal with the parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my latest code, "ModelReadWrite" inherits from "ParamsReadWrite"

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@github-actions github-actions bot added the ML label Jun 11, 2023
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 changed the title [WIP][SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect [SPARK-43981][PYTHON][ML] Basic saving / loading implementation for ML on spark connect Jun 13, 2023
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…L on spark connect

### What changes were proposed in this pull request?

* Base class / helper functions for saving/loading estimator / transformer / evaluator / model.
* Add saving/loading implementation for feature transformers.
* Add saving/loading implementation for logistic regression estimator.

Design goals:

* The model format is decoupled from spark, i.e. we can run model inference without spark service.
* We can save model to either local file system or cloud storage file system.

### Why are the changes needed?

We need to support saving/loading estimator / transformer / evaluator / model.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Unit tests.

Closes apache#41478 from WeichenXu123/mlv2-read-write.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants