Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-5510: [C++][Python][R][GLib] Implement Feather "V2" using Arrow…
… IPC file format This is based on top of ARROW-7979, so I will need to rebase once that is merged. Excluding the changes from ARROW-7979, this patch is a substantial code reduction in Feather-related code. I removed a lot of cruft from the V1 implementation and made things a lot simpler without altering the user-facing functionality. To summarize: * V2 is exactly the Arrow IPC file format, with the option for the experimental "trivial" body buffer compression implemented in ARROW-7979. `read_feather` functions distinguish the files based on the magic bytes at the beginning of the file ("FEA1" versus "ARROW1") * A `ipc::feather::WriteProperties` struct has been introduced to allow setting the file version, as well as chunksize (since large tables are broken up into smaller chunks when writing), compression type, and compression level (compressor-specific) * LZ4 and ZSTD are the only codecs intended to be supported (also in line with mailing list discussion about IPC compression). The default is LZ4 unless -DARROW_WITH_LZ4=OFF in which case it's uncompressed * Unit tests in Python now test both versions * R tests are only running the V2 version. I'll need some help adding options to set the version as well as the compression type and compression level Since 0.17.0 is likely to be released without formalizing IPC compression, I will plan to support an "ARROW:experimental_compression" metadata member in 0.17.0 Feather files. Other notes: * Column decompression is currently serial. I'll work on making this parallel ASAP as it will impact benchmarks significantly. * Compression (both chunk-level and column-level) is serial. Write performance would be much improved, especially at higher compression levels, by compressing in parallel at least at the column level * Write performance could be improved by compressing chunks and writing them to disk concurrently. It's done serially at the moment, so will open a follow up JIRA about this Closes #6694 from wesm/feather-v2 Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
- Loading branch information