New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11607: [C++][Parquet] Update values_capacity_ when resetting. #9498
Conversation
I'm not sure why values_capacity_ is different kept separately from the buffer, but there is check which does not reserve capacity again values_capacity_ is already the needed size. When ReleaseValues is called, we allocate a brand new empty buffer. I'm not really sure why this hasn't caused users more issues (maybe increasingly large row groups or some other phenonemon). This bug also highlight that our C++ tests have very limited coverage on batched reads. To fix this I added an batch read for every round trip test to confirm it yields the same values.
ASSERT_EQ(table->num_rows(), kValueSize + 1); | ||
std::shared_ptr<Table> result; | ||
|
||
ArrowReaderProperties read_options(/*use_threads=*/true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I left in some junk here, I'll clean itup tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
@@ -2475,6 +2520,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) { | |||
} | |||
} | |||
|
|||
TEST(TestArrowReadWrite, ManySmallLists) { | |||
// ARROW-11607: The actual scenaio this forces is no data reads for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"scenario"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
@pitrou the python failures look like something CI infrastructure related or am I just not seeing the exception? |
I've restarted the CI failures, I suppose it's just a CI glitch. |
I'm not sure why values_capacity_ is different kept separately from the buffer, but there is check which does not reserve capacity again values_capacity_ is already the needed size. When ReleaseValues is called, we allocate a brand new empty buffer. I'm not really sure why this hasn't caused users more issues (maybe increasingly large row groups or some other phenonemon). This bug also highlight that our C++ tests have very limited coverage on batched reads. To fix this I added an batch read for every round trip test to confirm it yields the same values. Closes apache#9498 from emkornfield/ARROW-11607 Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <micahk@google.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I'm not sure why values_capacity_ is different kept separately from the buffer, but there is check which does not reserve capacity again values_capacity_ is already the needed size. When ReleaseValues is called, we allocate a brand new empty buffer. I'm not really sure why this hasn't caused users more issues (maybe increasingly large row groups or some other phenonemon). This bug also highlight that our C++ tests have very limited coverage on batched reads. To fix this I added an batch read for every round trip test to confirm it yields the same values. Closes apache#9498 from emkornfield/ARROW-11607 Lead-authored-by: Micah Kornfield <emkornfield@gmail.com> Co-authored-by: emkornfield <micahk@google.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I'm not sure why values_capacity_ is different kept separately
from the buffer, but there is check which does not reserve
capacity again values_capacity_ is already the needed size.
When ReleaseValues is called, we allocate a brand new empty buffer.
I'm not really sure why this hasn't caused users more issues (maybe
increasingly large row groups or some other phenonemon). This bug
also highlight that our C++ tests have very limited coverage on batched
reads. To fix this I added an batch read for every round trip test
to confirm it yields the same values.