-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Parquet support read page with crc32 checking #33115
Comments
Antoine Pitrou / @pitrou: |
Xuwei Fu / @mapleFU: Maybe I can use <boost/crc32.h> here? |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Xuwei Fu / @mapleFU: MySQL uses zlib's crc32 implementions ( https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/ut/crc32.cc ), introducing it might need some code updates about arrow's cpuinfo. Ganvida already includes Arrow includes I'm not familiar with other crc32 implementions, maybe there are some faster implementions. |
Kouhei Sutou / @kou: How about using zlib's |
Kouhei Sutou / @kou: We don't want to bundle zlib's |
Isn't this overkill? We can vendor a crc32 implementation and always enable this feature. No need for zlib or an optional flag. |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Gang Wu / @wgtmac: Or we can do something similar to the logger implementation which uses CerrLog by default and uses a CMake option to plugin glog. We can port a simple and default crc32 implementation, and use that from zlib if ARROW_WITH_ZLIB is ON if the performance is better. |
Antoine Pitrou / @pitrou: |
Kouhei Sutou / @kou: If we use Cyrus's implementation, we should use https://github.com/cyrusimap/cyrus-imapd/blob/master/lib/crc32.c not forked one. If we can't find a fast simple CRC32 implementation, I prefer to use (not vendor) the zlib's implementation for maintainability. |
Xuwei Fu / @mapleFU: |
@mapleFU could you "take" this one too? |
take |
… DATA_PAGE (v1) (#14351) This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches. * [x] Implement crc in writing DATA_PAGE * [x] Implement crc in reading DATA_PAGE * [x] Adding config for write crc page and checking * [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr` * [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904 And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 . The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer. * Closes: #33115 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…ge for DATA_PAGE (v1) (apache#14351) This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches. * [x] Implement crc in writing DATA_PAGE * [x] Implement crc in reading DATA_PAGE * [x] Adding config for write crc page and checking * [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr` * [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904 And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 . The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer. * Closes: apache#33115 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…ge for DATA_PAGE (v1) (apache#14351) This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches. * [x] Implement crc in writing DATA_PAGE * [x] Implement crc in reading DATA_PAGE * [x] Adding config for write crc page and checking * [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr` * [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904 And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 . The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer. * Closes: apache#33115 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
Currently, C++'s Parquet support write page with checksum, but
ReadPage
doesn't have check any checksum. And I would like to fix itI'd like to split this patch to different parts:
Reporter: Xuwei Fu / @mapleFU
Assignee: Xuwei Fu / @mapleFU
PRs and other links:
Note: This issue was originally created as ARROW-17904. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: