[SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe by WeichenXu123 · Pull Request #42887 · apache/spark

WeichenXu123 · 2023-09-12T08:55:57Z

What changes were proposed in this pull request?

Currently, to avoid data copy, Spark connect ML model directly changes input pandas dataframe for appending prediction columns. But we can use pandas_df.copy(deep=False) to shallow copy it and then append prediction columns in copied dataframe. This is easier for user to use it.

Why are the changes needed?

This makes pyspark.ml.connect model transform method has more similar behavior with pyspark.ml model, i.e., the input dataframe is intact after transform is called. Otherwise user might be surprise at the new behavior and have to change more code to migrate their workload to pyspark.ml.connect

Does this PR introduce any user-facing change?

Yes.
Previous behavior:
In pyspark.ml.connect, model.transform will append new columns into input pandas dataframe, and return input dataframe object.

Changed behavior:
In pyspark.ml.connect, model.transform will shallow copy input pandas dataframe and append new columns into shallow copied pandas dataframe, then return copied pandas dataframe.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py

HyukjinKwon · 2023-09-13T02:55:14Z

https://github.com/WeichenXu123/spark/runs/16710260795

…dataframe

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

python/pyspark/ml/connect/base.py

harupy

LGTM!

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon · 2023-09-13T11:31:23Z

python/pyspark/ml/tests/connect/test_legacy_mode_classification.py

 import tempfile
 import unittest
 import numpy as np
+import pandas as pd


can you import this within test_binary_classes_logistic_regression or under should_test_connect so the tests can be skipped fine when pandas is not installed?

isn't pandas always installed in CI image ?

HyukjinKwon · 2023-09-13T11:31:33Z

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py

 import os
 import pickle
 import numpy as np
+import pandas as pd


HyukjinKwon · 2023-09-13T11:31:39Z

python/pyspark/ml/tests/connect/test_legacy_mode_pipeline.py

 import tempfile
 import unittest
 import numpy as np
+import pandas as pd


…dataframe

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

zhengruifeng · 2023-09-18T04:54:48Z

merged to master

update

a2d4ad6

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

github-actions bot added ML PYTHON CONNECT labels Sep 12, 2023

WeichenXu123 mentioned this pull request Sep 12, 2023

Make spark flavor support saving / loading spark connect ML models mlflow/mlflow#9534

Merged

37 tasks

zhengruifeng approved these changes Sep 12, 2023

View reviewed changes

harupy reviewed Sep 12, 2023

View reviewed changes

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py Outdated Show resolved Hide resolved

WeichenXu123 added 3 commits September 13, 2023 10:55

Merge branch 'master' into spark-ml-connect-model-avoid-change-input-…

fc1c0b6

…dataframe

update

35dcc10

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

900be86

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

harupy reviewed Sep 13, 2023

View reviewed changes

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py Outdated Show resolved Hide resolved

harupy reviewed Sep 13, 2023

View reviewed changes

python/pyspark/ml/tests/connect/test_legacy_mode_feature.py Outdated Show resolved Hide resolved

update

970adfe

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

harupy reviewed Sep 13, 2023

View reviewed changes

python/pyspark/ml/connect/base.py Outdated Show resolved Hide resolved

harupy approved these changes Sep 13, 2023

View reviewed changes

update

f8bacf8

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon reviewed Sep 13, 2023

View reviewed changes

WeichenXu123 added 2 commits September 18, 2023 10:19

Merge branch 'master' into spark-ml-connect-model-avoid-change-input-…

23c0f72

…dataframe

update

5120760

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

HyukjinKwon approved these changes Sep 18, 2023

View reviewed changes

zhengruifeng closed this in 99a979d Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe#42887

[SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe#42887
WeichenXu123 wants to merge 8 commits intoapache:masterfrom
WeichenXu123:spark-ml-connect-model-avoid-change-input-dataframe

WeichenXu123 commented Sep 12, 2023

Uh oh!

Uh oh!

HyukjinKwon commented Sep 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harupy left a comment

Uh oh!

HyukjinKwon Sep 13, 2023

Uh oh!

WeichenXu123 Sep 13, 2023

Uh oh!

HyukjinKwon Sep 13, 2023

Uh oh!

HyukjinKwon Sep 13, 2023

Uh oh!

zhengruifeng commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

WeichenXu123 commented Sep 12, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

HyukjinKwon commented Sep 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harupy left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 13, 2023

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants