Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Addition of option to allow empty Parquet row groups #19381

Closed
asfimport opened this issue Aug 8, 2018 · 3 comments
Closed

[Python] Addition of option to allow empty Parquet row groups #19381

asfimport opened this issue Aug 8, 2018 · 3 comments

Comments

@asfimport
Copy link

While our use case is not common, I was able to find one related request from roughly a year ago. Could this be added as a feature?

https://issues.apache.org/jira/browse/PARQUET-1047

Motivation

We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.

Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.

Reporter: Alex Mendelson
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-3020. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Patches welcome

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's ParquetWriter eventually uses this FileWriter class, and here's the current code (also linked here). If table.num_rows() is zero, nothing will ever happen.

Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
    return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
    chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
    int64_t offset = chunk * chunk_size;
    int64_t size = std::min(chunk_size, table.num_rows() - offset);


    RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
    for (int i = 0; i < table.num_columns(); i++) {
      auto chunked_data = table.column(i)->data();
      RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
                         PARQUET_IGNORE_NOT_OK(Close()));
    }
  }
  return Status::OK();
}

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Issue resolved by pull request 3269
#3269

@asfimport asfimport added this to the 0.12.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants