[SPARK-54446][ML] FPGrowth supports local filesystem #53150

zhengruifeng · 2025-11-21T02:48:39Z

What changes were proposed in this pull request?

FPGrowth supports local filesystem

Why are the changes needed?

to make FPGrowth work with local filesystem

Does this PR introduce any user-facing change?

yes, FPGrowth will work when local saving mode is one

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

WeichenXu123 · 2025-11-21T10:10:47Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+      Using.resource(
+        new ObjectInputStream(new BufferedInputStream(new FileInputStream(path)))
+      ) { ois =>
+        val schema = ois.readObject().asInstanceOf[StructType]


This uses Java deserializer which seems unsafe (risk of Remote Code Execution)

Related commit: #50922

resort to arrow format suggested by @cloud-fan

WeichenXu123

We need to address the RCE issue :)

nit

cloud-fan · 2025-11-21T15:11:00Z

mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala

+        val spark = df.sparkSession
+        val schema = df.schema
+        val maxRecordsPerBatch = spark.sessionState.conf.arrowMaxRecordsPerBatch
+        df.queryExecution.executedPlan.execute().mapPartitionsInternal { iter =>


Dataset already has def toArrowBatchRdd, shall we reuse it?

we can reuse it, with some change

cloud-fan · 2025-11-21T15:12:22Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+        val schema: StructType = df.schema
+        dos.writeUTF(schema.json)
+
+        val iter = DatasetUtils.toArrowBatchRDD(df, "UTC").toLocalIterator


does the arrow library provide APIs to write to local file?

holdenk

Arrow isn't intended for long term storage it's intended as a wire protocol -- I don't love using it for persisting models. I'm -0.9 on this change for now. Parquet seems like a better choice most likely.

zhengruifeng · 2025-11-24T08:58:57Z

Arrow isn't intended for long term storage it's intended as a wire protocol -- I don't love using it for persisting models. I'm -0.9 on this change for now. Parquet seems like a better choice most likely.

does the arrow library provide APIs to write to local file?

@holdenk @cloud-fan Arrow supports Random Access Files, and it provides APIs to write to local file. But our arrow utils mainly works with serialized ArrowRecordBatches the Array[Byte], we will need add new helper functions for ArrowRecordBatches if we want to use arrow files APIs.

WeichenXu123 · 2025-11-26T10:39:15Z

Arrow isn't intended for long term storage it's intended as a wire protocol -- I don't love using it for persisting models. I'm -0.9 on this change for now. Parquet seems like a better choice most likely.

@holdenk

SparkML model "saveToLocal" is an internal API and it is only used in SparkConnect server side to cache model within one session, it is not used for long term storage. So it should be fine to use it here.

zhengruifeng · 2025-11-26T12:40:38Z

The PR to apply the arrow file format #53232

github-actions bot added ML MLLIB labels Nov 21, 2025

zhengruifeng requested review from WeichenXu123 and cloud-fan November 21, 2025 02:49

HyukjinKwon approved these changes Nov 21, 2025

View reviewed changes

WeichenXu123 reviewed Nov 21, 2025

View reviewed changes

WeichenXu123 requested changes Nov 21, 2025

View reviewed changes

zhengruifeng force-pushed the local_fs_fpg branch from b0d6e92 to 7a3db44 Compare November 21, 2025 14:43

github-actions bot added the SQL label Nov 21, 2025

zhengruifeng added 2 commits November 21, 2025 22:59

fix

7fb027a

apply arrow

8fe0d6e

nit

zhengruifeng force-pushed the local_fs_fpg branch from cd9a177 to 8fe0d6e Compare November 21, 2025 14:59

cloud-fan reviewed Nov 21, 2025

View reviewed changes

holdenk requested changes Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

Uh oh!

zhengruifeng commented Nov 21, 2025

Uh oh!

WeichenXu123 Nov 21, 2025

Uh oh!

zhengruifeng Nov 21, 2025

Uh oh!

WeichenXu123 left a comment

Uh oh!

cloud-fan Nov 21, 2025

Uh oh!

zhengruifeng Nov 24, 2025

Uh oh!

cloud-fan Nov 21, 2025

Uh oh!

holdenk left a comment

Uh oh!

zhengruifeng commented Nov 24, 2025

Uh oh!

WeichenXu123 commented Nov 26, 2025

Uh oh!

zhengruifeng commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

Are you sure you want to change the base?

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

Uh oh!

Conversation

zhengruifeng commented Nov 21, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

WeichenXu123 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Nov 24, 2025

Uh oh!

WeichenXu123 commented Nov 26, 2025

Uh oh!

zhengruifeng commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants