Skip to content

Why no bulk Arrow→Parquet write API in Java? How to avoid row-by-row RecordConsumer + optimize? #907

@Fenil-v

Description

@Fenil-v

Describe the usage question you have. Please include as many useful details as possible.

I have ~ 20KB objects that I need to write to Parquet efficiently from Java.
In C++, C#, and Python there's a direct/bulk Arrow-Parquet write (e.g. WriteTable / write_table) that avoids row-by-row iteration, but in Java I only see row-by-row paths via RecordConsumer or internal/unstable column writers.
Questions:

  1. Is there a supported bulk/columnar Arrow-Parquet write API in Java (e.g, VectorSchemaRoot
    → Parquet) that avoids row-by-row calls?
  2. If not, why is Java limited to row-by-row writes today? Any roadmap for feature parity with C++/Python/C#?
  3. For now, what's the recommended optimization path to write 20KB objects at high throughput from Java (without JNI), or is JNI/Dataset the recommended route?
  4. Any best practices (batch sizing, encodings, writer settings) to mitigate the row-by-row overhead?

Component(s)

Java

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions