Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer #39870

Closed
matthewmcnew opened this issue Jan 31, 2024 · 6 comments · Fixed by #40105
Assignees
Milestone

Comments

@matthewmcnew
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

There does not appear to be an accurate way to identify or estimate the size of the current row group with pqarrow.FileWriter.

RowGroupTotalCompressedBytes()provides the total bytes from created data pages but, when the dictionary page size limit is reached the buffered data pages are flushed and the total size is reset to "0". This means the RowGroupTotalCompressedBytes will only provide the size of pages created after the dictionary page size was reached. Ideally the size the TotalCompressedBytes size should include all created data pages.

RowGroupTotalBytesWritten() will provide the total bytes of DataPages when they are written but, not if the the page is buffered due to the dictionary page still being created. This causes the RowGroupTotalBytesWritten to inaccurately provide a "0" bytes estimate until the dictionary page size limit is reached.

Perhaps related to: #39789.

Component(s)

Go

@zeroshade
Copy link
Member

Thanks for tracking down the cause! It's likely related if not directly the cause of the issue you linked.

Is this something you think you'd be willing to contribute a PR for? I don't have the bandwidth immediately, so it'll take me a week or so to dig into this. But I can definitely review any PRs

@kou kou changed the title Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer [Go] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer Feb 1, 2024
@mapleFU
Copy link
Member

mapleFU commented Feb 2, 2024

About totalBytesWritten, I've summarize similiar interface in C++:
#33897 (comment)

Don't know if there're better solutions. Let me fix RowGroupTotalCompressedBytes in another separate issue

@mapleFU mapleFU changed the title [Go] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer [Go][Parquet] Inaccurate RowGroupTotalCompressedBytes/RowGroupTotalBytesWritten with go parquet file writer Feb 2, 2024
@zeroshade
Copy link
Member

@matthewmcnew Did #39922 fix this issue? or is it still exhibiting this problem?

@matthewmcnew
Copy link
Contributor Author

#39922 should fix the inaccuracy with RowGroupTotalCompressedBytes being reset.

However, it would still be useful to estimate buffered data pages within RowGroupTotalBytesWritten. Would an approach like this make sense? To include the buffered data pages within the size of RowGroupTotalBytesWritten?

@zeroshade
Copy link
Member

@matthewmcnew That makes sense to me

@matthewmcnew
Copy link
Contributor Author

@zeroshade #40105 should be ready for review now.

zeroshade pushed a commit that referenced this issue Feb 21, 2024
### Rationale for this change

Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. 

### Are there any user-facing changes?
`RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. 

* Closes: #39870

Authored-by: Matthew McNew <me@mattmcnew.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 16.0.0 milestone Feb 21, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…che#40105)

### Rationale for this change

Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. 

### Are there any user-facing changes?
`RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. 

* Closes: apache#39870

Authored-by: Matthew McNew <me@mattmcnew.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…che#40105)

### Rationale for this change

Currently, buffered data pages are not included in TotalBytesWritten this means that their is not an accurate estimate of the size of the current size. 

### Are there any user-facing changes?
`RowGroupTotalBytesWritten` will include the TotalBytes in buffered DataPages minus the buffered data pages headers. 

* Closes: apache#39870

Authored-by: Matthew McNew <me@mattmcnew.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants