Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Parquet is a part of Arrow? #1715

Closed
HaoYang670 opened this issue May 20, 2022 · 6 comments
Closed

Why Parquet is a part of Arrow? #1715

HaoYang670 opened this issue May 20, 2022 · 6 comments
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate question Further information is requested

Comments

@HaoYang670
Copy link
Contributor

Which part is this question about
I find this description in the README of parquet crate:

This crate contains the official Native Rust implementation of Apache Parquet, which is part of the Apache Arrow project.

Describe your question

  1. Why parquet is a part of Arrow? They should be independent, aren't they?
  2. Arrow2 and Parquet2 are independent crates. Could we move parquet to a top level repo, for example Apache/parquet-rs?
    https://github.com/jorgecarleitao/parquet2
  3. Parquet should be maintained by Apache Parquet committee. Is it a little weird to let Apache Arrow contributors to maintain this crate ?

Additional context
No.

@HaoYang670 HaoYang670 added the question Further information is requested label May 20, 2022
@HaoYang670 HaoYang670 changed the title Why Parquet is a part of `Apache Arrow? Why Parquet is a part of Arrow? May 20, 2022
@HaoYang670
Copy link
Contributor Author

@alamb Look forward to your opinion.

@alamb
Copy link
Contributor

alamb commented May 20, 2022

Why parquet is a part of Arrow? They should be independent, aren't they?

My understanding is that the parquet project is a separate top level ASF project.

https://projects.apache.org/committee.html?arrow

https://projects.apache.org/committee.html?parquet

Arrow2 and Parquet2 are independent crates. Could we move parquet to a top level repo, for example Apache/parquet-rs?

Yes that would be fine -- right now they are in the same repo as the same people maintain them and it lowers the maintenance burden to have them in the same repo. I would personally not be opposed to separating them

I think it is a similar setup to the C++ implementation https://github.com/apache/arrow/tree/master/cpp which has arrow and parquet in the same foramt

Parquet should be maintained by Apache Parquet committee. Is it a little weird to let Apache Arrow contributors to maintain this crate ?

I don't disagree -- the reason I am helping with both is that we need both in our project.

Also, I think fast conversion between arrow <--> parquet is important and having them in the same repo may help with that.

@Ted-Jiang
Copy link
Member

Also, I think fast conversion between arrow <--> parquet is important and having them in the same repo may help with that.
+1 👍

@alamb And i have a question, if parquet-rs is only schema compatible or full functional support as parquet-mr?
And is there any benchmark between java and rust?

@tustvold
Copy link
Contributor

tustvold commented May 24, 2022

We intend to be fully functionally compatible with parquet-mr, please do file feature requests if you find any areas where we aren't. I think we're mostly there, aside from page index support.

As for performance comparison, I would be disappointed if we aren't significantly faster reading to arrow, but I do not have any benchmarks to verify this. Again I would be very interested in areas where we are slower.

FWIW reading to arrow is likely faster than the row level APIs, especially for byte arrays where columnar decoding makes a huge difference

I have not spent much time optimising the write path, and there is likely a lot of low hanging fruit.

As for why this is part of arrow, the best argument I can give is that most users want an interface to parquet that is performant and easy to understand. Arrow provides this. Whilst some users may want to integrate at a lower level, and deal with all the complexities of Dremel, encodings, etc... most users will want this to be handled for them

@ahmedriza
Copy link

ahmedriza commented May 24, 2022

We've been using arrow-rs in place of a legacy Scala code base that uses parquet-mr and found that arrow-rs is catching up quite nicely now.

I think that #1718 will blow parquet-mr away by quite a margin.

@tustvold
Copy link
Contributor

tustvold commented Jun 8, 2022

I think the question has been answered so closing, feel free to reopen if I'm mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants