You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.
Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.
Tanya Schlusser / @tanyaschlusser:
I looked into this and do not believe the Parquet code permits this at the moment despite the comment in the OP's hyperlink saying they thought it did. pyarrow's ParquetWriter eventually uses this FileWriter class, and here's the current code (also linked here). If table.num_rows() is zero, nothing will ever happen.
StatusFileWriter::WriteTable(constTable& table, int64_tchunk_size) {
if (chunk_size <= 0) {
returnStatus::Invalid("chunk size per row_group must be greater than 0");
} elseif (chunk_size > impl_->properties().max_row_group_length()) {
chunk_size = impl_->properties().max_row_group_length();
}
for (intchunk = 0; chunk * chunk_size < table.num_rows(); chunk++) {
int64_toffset = chunk * chunk_size;
int64_tsize = std::min(chunk_size, table.num_rows() - offset);
RETURN_NOT_OK_ELSE(NewRowGroup(size), PARQUET_IGNORE_NOT_OK(Close()));
for (inti = 0; i < table.num_columns(); i++) {
autochunked_data = table.column(i)->data();
RETURN_NOT_OK_ELSE(WriteColumnChunk(chunked_data, offset, size),
PARQUET_IGNORE_NOT_OK(Close()));
}
}
returnStatus::OK();
}
While our use case is not common, I was able to find one related request from roughly a year ago. Could this be added as a feature?
https://issues.apache.org/jira/browse/PARQUET-1047
Motivation
We have an application where each row is associated with one of N contexts, though a minority of contexts may have no associated rows. When encountering the Nth context, we will wish to retrieve all the associated rows. Row groups would provide a natural way to index the data, as the nth context could naturally relate to the nth row group.
Unfortunately, this is not possible at the present time, as pyarrow does not support writing empty row groups. If one writes a pyarrow.Table containing zero rows using pyarrow.parquet.ParquetWriter, it is omitted from the final file, and this distorts the indexing.
Reporter: Alex Mendelson
Assignee: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-3020. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: