-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go][Parquet] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer #39870
Comments
Thanks for tracking down the cause! It's likely related if not directly the cause of the issue you linked. Is this something you think you'd be willing to contribute a PR for? I don't have the bandwidth immediately, so it'll take me a week or so to dig into this. But I can definitely review any PRs |
About totalBytesWritten, I've summarize similiar interface in C++: Don't know if there're better solutions. Let me fix |
@matthewmcnew Did #39922 fix this issue? or is it still exhibiting this problem? |
#39922 should fix the inaccuracy with However, it would still be useful to estimate buffered data pages within |
@matthewmcnew That makes sense to me |
@zeroshade #40105 should be ready for review now. |
### Rationale for this change Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. ### Are there any user-facing changes? `RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. * Closes: #39870 Authored-by: Matthew McNew <me@mattmcnew.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…che#40105) ### Rationale for this change Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. ### Are there any user-facing changes? `RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. * Closes: apache#39870 Authored-by: Matthew McNew <me@mattmcnew.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…che#40105) ### Rationale for this change Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. ### Are there any user-facing changes? `RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. * Closes: apache#39870 Authored-by: Matthew McNew <me@mattmcnew.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
There does not appear to be an accurate way to identify or estimate the size of the current row group with
pqarrow.FileWriter
.RowGroupTotalCompressedBytes()
provides the total bytes from created data pages but, when the dictionary page size limit is reached the buffered data pages are flushed and the total size is reset to "0". This means the RowGroupTotalCompressedBytes will only provide the size of pages created after the dictionary page size was reached. Ideally the size the TotalCompressedBytes size should include all created data pages.RowGroupTotalBytesWritten()
will provide the total bytes of DataPages when they are written but, not if the the page is buffered due to the dictionary page still being created. This causes theRowGroupTotalBytesWritten
to inaccurately provide a "0" bytes estimate until the dictionary page size limit is reached.Perhaps related to: #39789.
Component(s)
Go
The text was updated successfully, but these errors were encountered: