Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] [Parquet] Implement parquet writer #24603

Closed
16 tasks done
asfimport opened this issue Apr 13, 2020 · 2 comments
Closed
16 tasks done

[Rust] [Parquet] Implement parquet writer #24603

asfimport opened this issue Apr 13, 2020 · 2 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Apr 13, 2020

This is the parent story. See subtasks for more information.

Notes from @wesm :

A couple of initial things to keep in mind

  • Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields

  • You can optimize the special case where a nullable field's data has no nulls

  • A good amount of code is required to handle converting from the Arrow physical form of various logical types to the Parquet equivalent one, see https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc for details

  • It would be worth thinking up front about how dictionary-encoded data is handled both on the Arrow write and Arrow read paths. In parquet-cpp we initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to dense String), and through real world need I was forced to revisit this (quite painfully) to enable Arrow dictionaries to survive roundtrips to Parquet format, and also achieve better performance and memory use in both reads and writes. You can certainly do a dictionary-to-dense conversion like we did, but you may someday find yourselves doing the same painful refactor that I did to make dictionary write and read not only more efficient but also dictionary order preserving.

    Notes from [~sunchao] :

    I roughly skimmed through the C++ implementation and think on the high level we need to do the following:

  1. implement a method similar to WriteArrow in column_writer.cc. We can further break this up into smaller pieces such as: dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on and so forth.
  2. implement an arrow writer in the parquet crate here. This needs to offer similar APIs as writer.h.

Reporter: Andy Grove / @andygrove
Assignee: Neville Dipale / @nevi-me

Subtasks:

PRs and other links:

Note: This issue was originally created as ARROW-8421. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Jorge Leitão / @jorgecarleitao:
[~nevi_me], will this be in time of 4.0 or should we bump it to 5.0?

@asfimport
Copy link
Collaborator Author

Andrew Lamb / @alamb:
Migrated to github: apache/arrow-rs#216

@asfimport asfimport added this to the 5.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants