Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11607: [C++][Parquet] Update values_capacity_ when resetting. #9498

Closed
wants to merge 4 commits into from

Conversation

emkornfield
Copy link
Contributor

I'm not sure why values_capacity_ is different kept separately
from the buffer, but there is check which does not reserve
capacity again values_capacity_ is already the needed size.
When ReleaseValues is called, we allocate a brand new empty buffer.

I'm not really sure why this hasn't caused users more issues (maybe
increasingly large row groups or some other phenonemon). This bug
also highlight that our C++ tests have very limited coverage on batched
reads. To fix this I added an batch read for every round trip test
to confirm it yields the same values.

I'm not sure why values_capacity_ is different kept separately
from the buffer, but there is check which does not reserve
capacity again values_capacity_ is already the needed size.
When ReleaseValues is called, we allocate a brand new empty buffer.

I'm not really sure why this hasn't caused users more issues (maybe
increasingly large row groups or some other phenonemon). This bug
also highlight that our C++ tests have very limited coverage on batched
reads.  To fix this I added an batch read for every round trip test
to confirm it yields the same values.
@github-actions
Copy link

ASSERT_EQ(table->num_rows(), kValueSize + 1);
std::shared_ptr<Table> result;

ArrowReaderProperties read_options(/*use_threads=*/true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I left in some junk here, I'll clean itup tomorrow.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

@@ -2475,6 +2520,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
}
}

TEST(TestArrowReadWrite, ManySmallLists) {
// ARROW-11607: The actual scenaio this forces is no data reads for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"scenario"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

@emkornfield
Copy link
Contributor Author

@pitrou the python failures look like something CI infrastructure related or am I just not seeing the exception?

@pitrou
Copy link
Member

pitrou commented Feb 17, 2021

I've restarted the CI failures, I suppose it's just a CI glitch.

@pitrou pitrou closed this in 848c803 Feb 17, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
I'm not sure why values_capacity_ is different kept separately
from the buffer, but there is check which does not reserve
capacity again values_capacity_ is already the needed size.
When ReleaseValues is called, we allocate a brand new empty buffer.

I'm not really sure why this hasn't caused users more issues (maybe
increasingly large row groups or some other phenonemon). This bug
also highlight that our C++ tests have very limited coverage on batched
reads.  To fix this I added an batch read for every round trip test
to confirm it yields the same values.

Closes apache#9498 from emkornfield/ARROW-11607

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <micahk@google.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
I'm not sure why values_capacity_ is different kept separately
from the buffer, but there is check which does not reserve
capacity again values_capacity_ is already the needed size.
When ReleaseValues is called, we allocate a brand new empty buffer.

I'm not really sure why this hasn't caused users more issues (maybe
increasingly large row groups or some other phenonemon). This bug
also highlight that our C++ tests have very limited coverage on batched
reads.  To fix this I added an batch read for every round trip test
to confirm it yields the same values.

Closes apache#9498 from emkornfield/ARROW-11607

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <micahk@google.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants