[SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

zhengruifeng · 2025-11-26T12:27:27Z

What changes were proposed in this pull request?

FPGrowth supports local filesystem

Why are the changes needed?

to make FPGrowth work with local filesystem

Does this PR introduce any user-facing change?

yes, FPGrowth will work when local saving mode is one

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

nit

zhengruifeng · 2025-11-26T12:43:15Z

This PR is another attempt to save ml models containing dataframes to driver's local fs.
TBH, I am not very familiar with the arrow file reader / writer

zhengruifeng · 2025-11-26T12:52:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+    fileWriter.start()
+    while (batchBytesIter.hasNext) {
+      val batchBytes = batchBytesIter.next()
+      val batch = ArrowConverters.loadBatch(batchBytes, allocator)


The batch: ArrowRecordBatch doesn't extends Serializable, so still use the Array[Byte] as the underlying data in the PR.

viirya · 2025-11-26T22:17:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+
+  protected val root = VectorSchemaRoot.create(arrowSchema, allocator)
+  protected val loader = new VectorLoader(root)
+  protected val arrowWriter = ArrowWriter.create(root)


Where is arrowWriter used?

good catch!

viirya · 2025-11-26T22:19:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+    fileWriter.close()
+  }
+
+  def write(batchBytesIter: Iterator[Array[Byte]]): Unit = {


This looks like do such thing:

Dataset -> Arrow batches -> Bytes -> Arrow batches -> Write Arrow batches by ArrowFileWriter

Looks like the intermediate Bytes could be skipped?

I think he's doing it cuz local data has to go to executors, and to do that, the arrow batches should be in ipc.

Dataset is already distributed on executors. Rows are written into Arrow batches in executors. If they are not to distributed again, they could be in Arrow batches, no?

Below, writer.write(rdd.toLocalIterator) I think the code path here is to collect Arrow batches into Spark Diver, and write them in Spark Driver. So .. it should collect the Arrow batches from executors to the driver.

I guess it's because to write down into Drivers' local file system

Oh I see. Okay.

holdenk · 2025-11-27T01:37:09Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+  def saveDataFrame(path: String, df: DataFrame): Unit = {
+    if (localSavingModeState.get()) {
+      val filePath = Paths.get(path)
+      Files.createDirectories(filePath.getParent)
+
+      df match {
+        case d: org.apache.spark.sql.classic.DataFrame =>
+          ArrowFileReadWrite.save(d, path)
+        case _ => throw new UnsupportedOperationException("Unsupported dataframe type")
+      }
+    } else {
+      df.write.parquet(path)
+    }
+  }
+
+  def loadDataFrame(path: String, spark: SparkSession): DataFrame = {
+    if (localSavingModeState.get()) {
+      spark match {
+        case s: org.apache.spark.sql.classic.SparkSession =>
+          ArrowFileReadWrite.load(s, path)
+        case _ => throw new UnsupportedOperationException("Unsupported session type")
+      }
+    } else {
+      spark.read.parquet(path)
+    }
+  }


So if we have localSavingModeState set to true this will write out an arrow file which is not stable format wise. It does look like localSavingModeState is only set to true in internal methods in Scala. Looking in the PySpark docstrings I see we tell people to use this API so I remain -0.9.

hi @holdenk , as @WeichenXu123 explained #53150 (comment), this is a runtime temporary file in spark connect server side, and will be cleaned after session close.
So I think we don't have to use a stable format here.

localSavingModeState is also used internally, (only Spark driver code can set the flag) . Where does the doc string mentioned it ? we should remove it from doc and mark localSavingModeState as private field

Hmm, even it is just a temporary session file, is there any reason not to use Parquet but Arrow file format?

we can read/write parquet with arrow, but it requires a new dependency

<dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-arrow</artifactId> </dependency>

otherwise, I am not sure whether we have utils to read/write parquet.

viirya

Wonder why choosing Arrow file format now instead of Parquet?
Due to the process of batch -> bytes -> batch -> bytes (when writing to file), it doesn't look like an efficient way.

viirya · 2025-11-27T01:49:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+    val rdd = df.toArrowBatchRdd(maxRecordsPerBatch, "UTC", true, false)
+    val arrowSchema = ArrowUtils.toArrowSchema(df.schema, "UTC", true, false)
+    val writer = new SparkArrowFileWriter(arrowSchema, path)
+    writer.write(rdd.toLocalIterator)


Instead, can we call toLocalIterator on original DataFrame's rdd and write rows to Arrow batches locally? Then we don't need to have the redundant Bytes?

cloud-fan · 2025-11-27T02:05:08Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

      val extraMetadata: JObject = Map("numTrainingRecords" -> instance.numTrainingRecords)
      DefaultParamsWriter.saveMetadata(instance, path, sparkSession,
        extraMetadata = Some(extraMetadata))
      val dataPath = new Path(path, "data").toString


can we pass Path object to saveDataFrame directly?

sounds good!

cloud-fan · 2025-11-27T02:05:59Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala

  /** Convert to an RDD of serialized ArrowRecordBatches. */
-  private[sql] def toArrowBatchRdd(plan: SparkPlan): RDD[Array[Byte]] = {
+  private def toArrowBatchRddImpl(
+    plan: SparkPlan,


nit: 4 spaces indentation

cloud-fan · 2025-11-27T02:07:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowFileReadWrite.scala

+import org.apache.spark.sql.util.ArrowUtils
+
+private[sql] class SparkArrowFileWriter(
+  arrowSchema: Schema,


nit: 4 spaces indentation

WeichenXu123

LGTM

cloud-fan · 2025-11-27T03:12:48Z

Due to the process of batch -> bytes -> batch -> bytes (when writing to file)

Can we have a shared util to produce RDD of arrow batches? Then we can either turn it to RDD of bytes, or write it to local files.

HyukjinKwon · 2025-11-27T04:11:31Z

Can we have a shared util to produce RDD of arrow batches? Then we can either turn it to RDD of bytes, or write it to local files.

This is actually already reusing a lot of existing utiles at ArrowConverters.scala. We have that same logic in Python but this SparkArrowFileWriter is new in JVM.

Basically toArrowBatchRdd is the util you meant for batch -> bytes.

Below code is for bytes -> batch -> write

val writer = new SparkArrowFileWriter(arrowSchema, path)
writer.write(rdd.toLocalIterator)

viirya · 2025-11-27T05:21:24Z

mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala

+      spark match {
+        case s: org.apache.spark.sql.classic.SparkSession =>
+          ArrowFileReadWrite.load(s, path)
+        case _ => throw new UnsupportedOperationException("Unsupported session type")


Can we show actual session type in the error?

make sense, will update!

zhengruifeng added 6 commits November 26, 2025 15:08

fix

d93588d

apply arrow

9474f69

nit

init

07f67e6

test

536b403

test

4dcf366

test

e09ece3

github-actions bot added SQL ML MLLIB labels Nov 26, 2025

nit

76bc0a8

zhengruifeng mentioned this pull request Nov 26, 2025

[SPARK-54446][ML] FPGrowth supports local filesystem #53150

Draft

zhengruifeng requested review from HyukjinKwon, WeichenXu123, cloud-fan, holdenk and viirya and removed request for WeichenXu123 November 26, 2025 12:40

zhengruifeng commented Nov 26, 2025

View reviewed changes

HyukjinKwon approved these changes Nov 26, 2025

View reviewed changes

viirya reviewed Nov 26, 2025

View reviewed changes

holdenk requested changes Nov 27, 2025

View reviewed changes

viirya reviewed Nov 27, 2025

View reviewed changes

cloud-fan reviewed Nov 27, 2025

View reviewed changes

WeichenXu123 approved these changes Nov 27, 2025

View reviewed changes

zhengruifeng changed the title ~~[SPARK-54446][ML] FPGrowth supports local filesystem with Arrow file format~~ [SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format Nov 27, 2025

viirya reviewed Nov 27, 2025

View reviewed changes

[SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

Are you sure you want to change the base?

[SPARK-54446][ML][CONNECT] FPGrowth supports local filesystem with Arrow file format #53232

Conversation

zhengruifeng commented Nov 26, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Nov 26, 2025

Uh oh!

zhengruifeng Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 27, 2025

Uh oh!

HyukjinKwon commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhengruifeng Nov 26, 2025 •

edited

Loading

WeichenXu123 Nov 27, 2025 •

edited

Loading

viirya Nov 27, 2025 •

edited

Loading