Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected #7534

Closed
wants to merge 9 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Jun 24, 2020

This bug is inherited from parquet::arrow::RowGroupRecordBatchReader, which yielded empty record batches when no columns were projected because no field readers were available from which to derive batch size. I've added logic to get usable batch sizes from file metadata in the empty columns case

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Is the interaction with batch size tested?

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

The python dataset tests are crashing on Mac: https://github.com/apache/arrow/runs/806974457

@bkietz
Copy link
Member Author

bkietz commented Jun 25, 2020

@jorisvandenbossche that build comes from the first of two suggestion commits and it doesn't seem to have crashed with both commits in place. Maybe it was ephemeral?

@jorisvandenbossche
Copy link
Member

Hmm, it failed on the last commit as well, but I restarted that one. And so now appears to be green indeed ..

@wesm
Copy link
Member

wesm commented Jun 25, 2020

That crash is ARROW-8999. I'm fairly confident it's a real bug given that it happens about 1-5% of the time

cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 8729-empty-virtual-columns branch from f45a051 to 78a48dd Compare June 25, 2020 16:11
@bkietz bkietz force-pushed the 8729-empty-virtual-columns branch from 0c705bf to 0171cea Compare June 27, 2020 13:22
@fsaintjacques fsaintjacques self-assigned this Jun 29, 2020
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a segfault on OSX and Windows python tests.

@bkietz bkietz force-pushed the 8729-empty-virtual-columns branch from 0171cea to e3f17ce Compare July 12, 2020 19:28
@wesm
Copy link
Member

wesm commented Jul 12, 2020

I think #7704 needs to be merged first

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a rebase issue (pulling in a Rust change it looks like) but otherwise this looks fine with the caveat that we need to have a plan to constrain memory use when using the batch readers

rust/datafusion/README.md Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.h Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 8729-empty-virtual-columns branch from e3f17ce to 75e3b0a Compare July 13, 2020 20:00
if (chunk == nullptr) {
return ::arrow::IterationTraits<RecordBatchIterator>::End();
}
} while (chunk->length() == 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this check should be internal to the ColumnReaders. I tried embedding it in TransferColumnData but discovered that other location in the codebase rely on empty chunks. Is this expected or should it be fixed?

@bkietz bkietz self-assigned this Jul 14, 2020
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants