ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

bkietz · 2020-06-24T19:29:54Z

This bug is inherited from parquet::arrow::RowGroupRecordBatchReader, which yielded empty record batches when no columns were projected because no field readers were available from which to derive batch size. I've added logic to get usable batch sizes from file metadata in the empty columns case

github-actions · 2020-06-24T19:31:45Z

https://issues.apache.org/jira/browse/ARROW-8729

jorisvandenbossche

Looks good to me. Is the interaction with batch size tested?

cpp/src/parquet/arrow/reader.cc

python/pyarrow/tests/test_dataset.py

jorisvandenbossche · 2020-06-25T11:29:05Z

The python dataset tests are crashing on Mac: https://github.com/apache/arrow/runs/806974457

bkietz · 2020-06-25T12:06:46Z

@jorisvandenbossche that build comes from the first of two suggestion commits and it doesn't seem to have crashed with both commits in place. Maybe it was ephemeral?

jorisvandenbossche · 2020-06-25T12:09:14Z

Hmm, it failed on the last commit as well, but I restarted that one. And so now appears to be green indeed ..

wesm · 2020-06-25T14:23:03Z

That crash is ARROW-8999. I'm fairly confident it's a real bug given that it happens about 1-5% of the time

cpp/src/parquet/arrow/reader.cc

fsaintjacques

There's a segfault on OSX and Windows python tests.

wesm · 2020-07-12T21:37:08Z

I think #7704 needs to be merged first

wesm

There's a rebase issue (pulling in a Rust change it looks like) but otherwise this looks fine with the caveat that we need to have a plan to constrain memory use when using the batch readers

rust/datafusion/README.md

cpp/src/parquet/arrow/reader.cc

cpp/src/parquet/arrow/reader.h

… columns are projected

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

bkietz · 2020-07-13T20:31:21Z

cpp/src/parquet/arrow/reader.cc

+            if (chunk == nullptr) {
+              return ::arrow::IterationTraits<RecordBatchIterator>::End();
+            }
+          } while (chunk->length() == 0);


IMHO this check should be internal to the ColumnReaders. I tried embedding it in TransferColumnData but discovered that other location in the codebase rely on empty chunks. Is this expected or should it be fixed?

…minal empty chunks

wesm

+1

bkietz requested review from wesm and fsaintjacques June 24, 2020 19:30

jorisvandenbossche reviewed Jun 25, 2020

View reviewed changes

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved

fsaintjacques approved these changes Jun 25, 2020

View reviewed changes

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

bkietz force-pushed the 8729-empty-virtual-columns branch from f45a051 to 78a48dd Compare June 25, 2020 16:11

nealrichardson force-pushed the 8729-empty-virtual-columns branch from 78a48dd to 0c705bf Compare June 26, 2020 15:27

bkietz force-pushed the 8729-empty-virtual-columns branch from 0c705bf to 0171cea Compare June 27, 2020 13:22

fsaintjacques self-assigned this Jun 29, 2020

fsaintjacques requested changes Jun 30, 2020

View reviewed changes

bkietz force-pushed the 8729-empty-virtual-columns branch from 0171cea to e3f17ce Compare July 12, 2020 19:28

wesm reviewed Jul 12, 2020

View reviewed changes

rust/datafusion/README.md Outdated Show resolved Hide resolved

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved

cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved

cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved

wesm mentioned this pull request Jul 12, 2020

ARROW-9297: [C++][Parquet] Support chunked row groups in RowGroupRecordBatchReader #7704

Closed

bkietz and others added 5 commits July 13, 2020 09:30

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual…

a6aa7d3

… columns are projected

mark test w/pandas

ac11a86

Update python/pyarrow/tests/test_dataset.py

6608cd1

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

rebase on ARROW-9279 apache#7704

d199fc5

don't hold an entire row group in memory

75e3b0a

bkietz force-pushed the 8729-empty-virtual-columns branch from e3f17ce to 75e3b0a Compare July 13, 2020 20:00

bkietz added 2 commits July 13, 2020 16:13

amend comments

ca96c89

guard against non-terminal empty chunks

1acae84

bkietz commented Jul 13, 2020

View reviewed changes

bkietz added 2 commits July 13, 2020 16:43

use TableReader instead of Rechunk...(), revert guard against non-ter…

eb1bec5

…minal empty chunks

msvc: explicit cast

b96e951

bkietz unassigned fsaintjacques Jul 14, 2020

bkietz self-assigned this Jul 14, 2020

wesm approved these changes Jul 14, 2020

View reviewed changes

wesm closed this in ad2b2c5 Jul 14, 2020

bkietz deleted the 8729-empty-virtual-columns branch February 25, 2021 16:31

asfimport mentioned this pull request Jul 14, 2020

[C++][Dataset] Only selecting a partition column results in empty table #24882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

bkietz commented Jun 24, 2020

github-actions bot commented Jun 24, 2020

jorisvandenbossche left a comment

jorisvandenbossche commented Jun 25, 2020

bkietz commented Jun 25, 2020

jorisvandenbossche commented Jun 25, 2020

wesm commented Jun 25, 2020 •

edited

fsaintjacques left a comment

wesm commented Jul 12, 2020

wesm left a comment

bkietz Jul 13, 2020

wesm left a comment

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

Conversation

bkietz commented Jun 24, 2020

github-actions bot commented Jun 24, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 25, 2020

bkietz commented Jun 25, 2020

jorisvandenbossche commented Jun 25, 2020

wesm commented Jun 25, 2020 • edited

fsaintjacques left a comment

Choose a reason for hiding this comment

wesm commented Jul 12, 2020

wesm left a comment

Choose a reason for hiding this comment

bkietz Jul 13, 2020

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

wesm commented Jun 25, 2020 •

edited