[Python] Addition of option to allow empty Parquet row groups #19381

asfimport · 2018-08-08T07:35:45Z

While our use case is not common, I was able to find one related request from roughly a year ago. Could this be added as a feature?

https://issues.apache.org/jira/browse/PARQUET-1047

Motivation

We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.

Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.

Reporter: Alex Mendelson
Assignee: Wes McKinney / @wesm

PRs and other links:

GitHub Pull Request #3269

_{Note: This issue was originally created as ARROW-3020. Please see the migration documentation for further details.}

asfimport · 2018-09-15T14:51:53Z

Wes McKinney / @wesm:
Patches welcome

asfimport · 2018-12-21T00:30:52Z

Tanya Schlusser / @tanyaschlusser:
I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's ParquetWriter eventually uses this FileWriter class, and here's the current code (also linked here). If table.num_rows() is zero, nothing will ever happen.

Status FileWriter::WriteTable(const Table& table, int64_t chunk_size) {
  if (chunk_size <= 0) {
    return Status::Invalid("chunk size per row_group must be greater than 0");
  } else if (chunk_size > impl_->properties().max_row_group_length()) {
    chunk_size = impl_->properties().max_row_group_length();
  }


  for (int chunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
    int64_t offset = chunk * chunk_size;
    int64_t size = std::min(chunk_size, table.num_rows() - offset);


    RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
    for (int i = 0; i < table.num_columns(); i++) {
      auto chunked_data = table.column(i)->data();
      RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
                         PARQUET_IGNORE_NOT_OK(Close()));
    }
  }
  return Status::OK();
}

asfimport · 2018-12-28T14:57:43Z

Uwe Korn / @xhochy:
Issue resolved by pull request 3269
#3269

asfimport closed this as completed Dec 28, 2018

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.12.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Addition of option to allow empty Parquet row groups #19381

[Python] Addition of option to allow empty Parquet row groups #19381

asfimport commented Aug 8, 2018

asfimport commented Sep 15, 2018

asfimport commented Dec 21, 2018

asfimport commented Dec 28, 2018

[Python] Addition of option to allow empty Parquet row groups #19381

[Python] Addition of option to allow empty Parquet row groups #19381

Comments

asfimport commented Aug 8, 2018

PRs and other links:

asfimport commented Sep 15, 2018

asfimport commented Dec 21, 2018

asfimport commented Dec 28, 2018