Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [Parquet] Arrow writer benchmarks #27944

Closed
asfimport opened this issue Mar 28, 2021 · 1 comment
Closed

[Rust] [Parquet] Arrow writer benchmarks #27944

asfimport opened this issue Mar 28, 2021 · 1 comment

Comments

@asfimport
Copy link

The common concern with Parquet's Arrow readers and writers is that they're slow.
My diagnosis is that we rely on a chain of processes, which introduces overhead.
For example, writing an Arrow RecordBatch involves the following:

  1. Iterate through arrays to create def/rep levels
  2. Extract Parquet primitive values from arrays using these levels
  3. Write primitive values, validating them in the process (when they already should be validated)
  4. Split the already materialised values into small batches for Parquet chunks (consider where we have 1e6 values in a batch)
  5. Write these batches, computing the stats of each batch, and encoding values

The above is as a side-effect of convenience, as it would likely require a lot more effort to bypass some of the steps.

I have ideas around going from step 1 to 5 directly, but won't know if it's better if there aren't performance benchmarks. I also struggle to see if I'm making improvements while I clean up the writer code, especially removing the allocations that I created to reduce the complexity of the level calculations.

With ARROW-12120 (random array & batch generator), it becomes more convenient to benchmark (and test many combinations of) the Arrow writer.

I would thus like to start adding benchmarks for the Arrow writer.

Reporter: Neville Dipale / @nevi-me
Assignee: Neville Dipale / @nevi-me

PRs and other links:

Note: This issue was originally created as ARROW-12121. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Neville Dipale / @nevi-me:
Issue resolved by pull request 9825
#9825

@asfimport asfimport added this to the 4.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants