Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Potential inconsistency between TotalBytesWritten tracked by RowGroupWriter and actual bytes written to io.Writer #39789

Closed
joellubi opened this issue Jan 25, 2024 · 1 comment · Fixed by #43326

Comments

@joellubi
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

When using the following props for a ParquetWriter, there is a discrepancy between the sum of RowGroupTotalBytesWritten() for each Write() call and the actual number of bytes seen by the target io.Writer interface.

parquetProps := parquet.NewWriterProperties(
		parquet.WithAllocator(memory.DefaultAllocator),
		parquet.WithCompression(compress.Codecs.Snappy),
		parquet.WithCompressionLevel(flate.DefaultCompression),
		parquet.WithDictionaryDefault(false),
		parquet.WithStats(false),
                parquet.WithMaxRowGroupLength(math.MaxInt64),
	)
arrowProps := pqarrow.NewArrowWriterProperties(pqarrow.WithAllocator(memory.DefaultAllocator))

In this specific case, a 13 MB file had only reported about 10 MB written via RowGroupTotalBytesWritten() calls. Some of the discrepancy can be attributed to metadata that is not included in the row groups, but this likely doesn't explain the entire difference. We should investigate the root cause and either fix it or document the explanation for future users of this API.

Related to arrow-adbc@1456

Component(s)

Go, Parquet

@zeroshade
Copy link
Member

Issue resolved by pull request 43326
#43326

@zeroshade zeroshade added this to the 18.0.0 milestone Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants