Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Enable writing of Parquet footer without closing file #40630

Closed
petenewcomb opened this issue Mar 18, 2024 · 1 comment
Closed

Comments

@petenewcomb
Copy link
Contributor

Describe the enhancement requested

The Parquet file format allows a file to continue to accumulate row groups after a footer has been written, as long as a new and cumulative footer is written afterward. This is useful if one is writing a stream of data directly to Parquet and need to make sure that that data is fully durable and readable within some time bound. For this purpose I propose a new method FlushWithFooter on file.Writer that like its sibling Close would close any open row group and prepare and write out the file footer. Unlike Close it would leave the writer's metadata structures intact, allowing subsequent row groups to be written without starting over, thus ensuring that the metadata written into subsequent footers via FlushWithFooter or Close is inclusive of all row groups written since the beginning of the file.

The alternative, and what is supported today, is to close the open file once the time bound has been reached and start a new one. This works for durability, but is inefficient for readers since they must now open and process the footers of a potentially much larger number of files. The typical workflow is to have a second process "compact" these smaller files to produce larger files that not only consolidate footers but apply other optimizations (such as z-ordering) that holistically reorganize the consolidate data to match observed or expected query patterns. While effective for readers of older data, such compactions take time and significant resources to execute, putting a practical lower bound on the freshness of their outputs.

This feature, if adopted, would allow writers to produce data into a modest and predictable number of files within a strict time bound for durability such that readers enjoy that same time bound and modest number of files to efficiently query fresh data without intervening compaction. Compaction would still be recommended, both to apply holistic optimizations and to collapse the extra footers inserted into the original files, but it would be less urgent since compaction would no longer be a constraint on freshness or the manageability of file cardinality.

Component(s)

Go, Parquet

@kou kou changed the title Enable writing of Parquet footer without closing file [Go][Parquet] Enable writing of Parquet footer without closing file Mar 18, 2024
zeroshade pushed a commit that referenced this issue Mar 25, 2024
…ing file (#40654)

### Rationale for this change

See #40630

### What changes are included in this PR?

1. Added `FlushWithFooter` method to *file.Writer
2. To support `FlushWithFooter`, refactored `Close` in a way that changes the order of operations in two ways:
   a. closure of open row group writers is now done after using `defer` to ensure closure of the sink, instead of before
   b. wiping out of encryption keys is now done by the same deferred function, ensuring that it happens even upon error

### Are these changes tested?

`file_writer_test.go` has been extended to cover `FlushWithFooter` in a manner equivalent to the existing coverage.

### Are there any user-facing changes?

Only the addition of a new public method as described above.  No breaking changes to any existing public interfaces, unless the two minor order-of-operation changes described above are somehow a problem.

I'm not sure it's a critical fix, but one of the minor changes described above may reduce the likelihood that an attack could inject an error (e.g., an I/O error) to prevent an encryption key from being wiped from memory.

* GitHub Issue: #40630

Authored-by: Peter Newcomb <peter.newcomb@walmart.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 16.0.0 milestone Mar 25, 2024
@zeroshade
Copy link
Member

Issue resolved by pull request 40654
#40654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants