Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7390: [C++][Dataset] Fix RecordBatchProjector race #6661

Conversation

fsaintjacques
Copy link
Contributor

The RecordBatchProjector is shared accross ScanTasks of the same Fragment. The resize operation of missing columns is not thread safe. This change ensure that each ScanTask gets his own projector. The copy should not be costly since it's copying empty vectors and one shared pointer.

@fsaintjacques
Copy link
Contributor Author

I could not come up with an deterministic unit test for this, it also had to be run in release mode. I tested the issue with scanning the nyc dataset locally.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there multithreaded tests in the dataset tests? Did you try running them with TSAN enabled?

@@ -44,7 +44,10 @@ static inline RecordBatchIterator ProjectRecordBatch(RecordBatchIterator it,
RecordBatchProjector* projector,
MemoryPool* pool) {
return MakeMaybeMapIterator(
[=](std::shared_ptr<RecordBatch> in) { return projector->Project(*in, pool); },
[=](std::shared_ptr<RecordBatch> in) {
RecordBatchProjector local_projector{*projector};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... perhaps add a comment explaining why this is needed?

@fsaintjacques fsaintjacques force-pushed the ARROW-7390-fix-project-concurrency branch from 5cc5cb3 to 523ddfc Compare March 19, 2020 17:37
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -63,7 +63,7 @@ Status RecordBatchProjector::SetDefaultValue(FieldRef ref,

Result<std::shared_ptr<RecordBatch>> RecordBatchProjector::Project(
const RecordBatch& batch, MemoryPool* pool) {
if (from_ == nullptr || !batch.schema()->Equals(*from_)) {
if (from_ == nullptr || !batch.schema()->Equals(*from_, false /*check_metadata*/)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we should try to use /*param=*/ consistently (this is used elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@bkietz
Copy link
Member

bkietz commented Mar 20, 2020

+1

At some point we should examine Fragment::splitable() to determine whether ScanTasks from a single Fragment should be kept in a single thread. For such fragments local_projector will be unnecessary

The RecordBatchProjector is shared accross ScanTasks of the same
Fragment. The resize operation of missing columns is not thread safe.
This change ensure that each ScanTask gets his own projector. The copy
should not be costly since it's copying empty vectors and one shared
pointer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants