Skip to content

Comments

[SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe#42887

Closed
WeichenXu123 wants to merge 8 commits intoapache:masterfrom
WeichenXu123:spark-ml-connect-model-avoid-change-input-dataframe
Closed

[SPARK-45130][CONNECT][ML][PYTHON] Avoid Spark connect ML model to change input pandas dataframe#42887
WeichenXu123 wants to merge 8 commits intoapache:masterfrom
WeichenXu123:spark-ml-connect-model-avoid-change-input-dataframe

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

Currently, to avoid data copy, Spark connect ML model directly changes input pandas dataframe for appending prediction columns. But we can use pandas_df.copy(deep=False) to shallow copy it and then append prediction columns in copied dataframe. This is easier for user to use it.

Why are the changes needed?

This makes pyspark.ml.connect model transform method has more similar behavior with pyspark.ml model, i.e., the input dataframe is intact after transform is called. Otherwise user might be surprise at the new behavior and have to change more code to migrate their workload to pyspark.ml.connect

Does this PR introduce any user-facing change?

Yes.
Previous behavior:
In pyspark.ml.connect, model.transform will append new columns into input pandas dataframe, and return input dataframe object.

Changed behavior:
In pyspark.ml.connect, model.transform will shallow copy input pandas dataframe and append new columns into shallow copied pandas dataframe, then return copied pandas dataframe.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@HyukjinKwon
Copy link
Member

https://github.com/WeichenXu123/spark/runs/16710260795

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Copy link
Contributor

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
import tempfile
import unittest
import numpy as np
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you import this within test_binary_classes_logistic_regression or under should_test_connect so the tests can be skipped fine when pandas is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't pandas always installed in CI image ?

import os
import pickle
import numpy as np
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

import tempfile
import unittest
import numpy as np
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants