Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Parquet support read page with crc32 checking #33115

Closed
asfimport opened this issue Sep 30, 2022 · 21 comments · Fixed by #14351
Closed

[C++] Parquet support read page with crc32 checking #33115

asfimport opened this issue Sep 30, 2022 · 21 comments · Fixed by #14351

Comments

@asfimport
Copy link
Collaborator

Currently, C++'s Parquet support write page with checksum, but ReadPage doesn't have check any checksum. And I would like to fix it

I'd like to split this patch to different parts:

  1. Implement the crc in DataPageV1, which requires a write crc config, counting crc in read mode, crc verification, migrate testing from parquet mr
  2. Implement crc for DataPageV2
  3. Implement crc for Dict

Reporter: Xuwei Fu / @mapleFU
Assignee: Xuwei Fu / @mapleFU

PRs and other links:

Note: This issue was originally created as ARROW-17904. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
That would certainly be a welcome addition!

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
@pitrou I'm on holidays last week. Now I found that in thirdparty, arrow imports crc32c, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number: https://github.com/apache/parquet-format#checksumming

Maybe I can use <boost/crc32.h> here?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I think we'd rather avoid a mandatory boost dependency. Surely there's a standalone CRC32 implementation somewhere under a Apache-compatible license that we can vendor?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also cc @kou

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
gandiva already imports boost/crc. And I have a simple draft using it: #14351 . It's WIP because it requires some testings, and it only implement the case for DATA_PAGE v1. Mind take a look? @pitrou  

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, the fact that Gandiva depends on Boost doesn't mean we want to depend on it for other components (many people probably don't enable Gandiva when building Arrow).

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
kudu/rocksdb/leveldb uses crc32c rather than standard crc32, maybe because crc32c is faster. ClickHouse even uses other checksum functions.

MySQL uses zlib's crc32 implementions ( https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/ut/crc32.cc ), introducing it might need some code updates about arrow's cpuinfo.

Ganvida already includes boost/crc32, using it is quiet convinient.

Arrow includes crc32c in google, which is pasted from leveldb's crc32. But it's crc32c, which breaks the parquet standard.

I'm not familiar with other crc32 implementions, maybe there are some faster implementions.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
It seems that there are only a few candidates...
https://github.com/search?q=crc32+simd
https://github.com/ttsuki/arkana may be usable...?

How about using zlib's crc32() as the first version?
We can replace it with optimized version in follow-up tasks.

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
It's ok, I'll try to use zlib. Should I extract the zlib's crc32 like mysql, or keeping ARROW_WITH_ZLIB ON when in arrow and parquet? @kou @pitrou  

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
Could you add PARQUET_REQUIRE_CRC32_CHECK or something like PARQUERT_REQUIRE_ENCRYPTION? (See also cpp/cmake_modules/DefineOptions.cmake and cpp/cmake_modules/THirdpartyToolchain.cmake.) If PARQUET_REQUIRE_CRC32_CHECK is enabled, we find zlib and enable this feature.

We don't want to bundle zlib's crc32() implementation. We use it as just a fallback implementation.

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
It's ok. I would using another patch to enable PARQUET_REQUIRE_CRC32_CHECK, and #14351 would be done after that patch is merged

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:

If PARQUET_REQUIRE_CRC32_CHECK is enabled, we find zlib and enable this feature.

Isn't this overkill? We can vendor a crc32 implementation and always enable this feature. No need for zlib or an optional flag.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Or we can have Parquet always require zlib.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also, if you're interested in performance, some numbers were published here:
https://fastmail.blog/advanced/the-search-for-a-faster-crc32/

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
I think parquet should always enable PARQUET_REQUIRE_CRC32_CHECK, vendoring a crc32 implemention is ok. Because crc32 is a part of standard, and just like encrptor, it should always enabled in cmake. And during using parquet, we can disable it if we want read/write to be faster

@asfimport
Copy link
Collaborator Author

Gang Wu / @wgtmac:
Vote +1 for vendoring a crc32 implementation.

Or we can do something similar to the logger implementation which uses CerrLog by default and uses a CMake option to plugin glog. We can port a simple and default crc32 implementation, and use that from zlib if ARROW_WITH_ZLIB is ON if the performance is better.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Let's keep it simple and just vendor a single "fast" implementation, such as the "slice-by-16" approach from Cyrus: https://github.com/robn/cyrus-imapd/blob/master/lib/crc32.c

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
I'm OK with the vendoring approach if we can find a fast simple CRC32 implementation.

https://github.com/robn/cyrus-imapd/blob/master/lib/crc32.c

If we use Cyrus's implementation, we should use https://github.com/cyrusimap/cyrus-imapd/blob/master/lib/crc32.c not forked one.

If we can't find a fast simple CRC32 implementation, I prefer to use (not vendor) the zlib's implementation for maintainability.

@asfimport
Copy link
Collaborator Author

Xuwei Fu / @mapleFU:
@kou @pitrou crc32 from cyrus-imapd is introduced in https://github.com/apache/arrow/pull/14351
Mind take a look?

@wjones127
Copy link
Member

@mapleFU could you "take" this one too?

@mapleFU
Copy link
Member

mapleFU commented Feb 10, 2023

take

wjones127 pushed a commit that referenced this issue Feb 10, 2023
… DATA_PAGE (v1) (#14351)

This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.

* [x] Implement crc in writing DATA_PAGE
* [x] Implement crc in reading DATA_PAGE
* [x] Adding config for write crc page and checking
* [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr`
* [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904

And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .

The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer.
* Closes: #33115

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 12.0.0 milestone Feb 10, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…ge for DATA_PAGE (v1) (apache#14351)

This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.

* [x] Implement crc in writing DATA_PAGE
* [x] Implement crc in reading DATA_PAGE
* [x] Adding config for write crc page and checking
* [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr`
* [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904

And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .

The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer.
* Closes: apache#33115

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…ge for DATA_PAGE (v1) (apache#14351)

This patch add crc in writing and reading DATA_PAGE. And crc for dictionary, DATA_PAGE_V2 will be added in comming patches.

* [x] Implement crc in writing DATA_PAGE
* [x] Implement crc in reading DATA_PAGE
* [x] Adding config for write crc page and checking
* [x] Testing DATA_PAGE with crc, the testing maybe borrowed from `parquet-mr`
* [x] Using crc library in https://issues.apache.org/jira/browse/ARROW-17904

And there is some questions, I found that in thirdparty, arrow imports `crc32c`, which is extracted from leveldb's crc library. But seems that our standard uses crc32, which has a different magic number. So I vendor implementions mentioned in https://issues.apache.org/jira/browse/ARROW-17904 .

The default config of `enable crc` in parquet-mr for writer is true, but here I use `false`, because set it true may slow down writer.
* Closes: apache#33115

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants