Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1392: Read multiple RowGroups at once into an Arrow table #492

Closed
wants to merge 2 commits into from

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Aug 19, 2018

Decided to go with the more simplistic approach and only introduce a convenience API for now. Once is merged, I'll do some work that at least primitive arrays are read into a single Array in this path.

@xhochy xhochy changed the title Read multiple RowGroups at once into an Arrow table PARQUET-1392: Read multiple RowGroups at once into an Arrow table Aug 22, 2018
@xhochy xhochy force-pushed the PARQUET-1392 branch 3 times, most recently from 617d49e to 57da415 Compare August 22, 2018 09:17
Copy link

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM apart from a minor suggestion.


for (size_t i = 0; i < row_groups.size(); ++i) {
std::shared_ptr<Table> rg_table;
RETURN_NOT_OK(ReadRowGroup(row_groups[i], indices, &rg_table));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small optimization here is to use
std::vector<std::shared_ptr<Table>> tables(row_group.size(), nullptr) above and
use &tables[i] here. Afaik certain tools like Spark generate parquet files with many small RowGroups.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw your TODO. So this change might be less relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, added it as it saves some cycles.

@xhochy xhochy closed this in 22c7cbb Aug 23, 2018
@xhochy xhochy deleted the PARQUET-1392 branch August 23, 2018 13:17
@wesm
Copy link
Member

wesm commented Aug 23, 2018

+1 also. I will look at the remaining PR for cpp-1.5.0 this morning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants