-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handoff checksum Implementation #7523
Conversation
e1e1984
to
9c9ed06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to have disparate configurations that need to work together to make this work. If I set checksum_handoff_file_types, that does nothing on its own. If I support the DVI versions of FSWritableFile functions, that does nothing on its own. Why should I have to add something on both ends?
Echoing some comments on the design document, I think the DB and/or column families should only have a notion of which files are considered "critical" and let the storage layer handle that appropriately, when there is a distinction to be made in handling them. It then makes sense for there to be a reasonable set of defaults considered "critical."
(Perhaps revisiting old debates.) If handoff checksums are computed in the storage layer anyway, why can't the Env / FileSystem provide a function to generate the DataVerificationInfo it expects from data. And if we do that, why not let DataVerificationInfo live entirely under Env / FileSystem (in appropriate implementations)?
Thanks for the review. !
My point is, this is a three-ways contracts. 1) user want to do handoff on which type of files, 2) RocksDB provides the path, 3) the customized File System support it. From RocksDB's view, it does not know File System support it or not. From user's view, it will set the option only when it knows the customized File System will do the job.
In our case, we hope to calculate the checksum as early as possible, then the customized File System can use/verfiy/pass it. So, it is better to calculate the checksum in WritableFileWrite with the control option.
The motivation of this PR is to move the checksum calculation from Env/File system up, and do it as early as possible. Even in WritableFileWriter, we do memory copy. There will be a follow up PR to bridge the gap, which covers the checksum verification from source buffer, to the place where the handoff checksum is calculated (due to the buffer write and write rate limiter, we are not able to calculate the checksum for the data being append that early) |
Summary: As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in #7523 Pull Request resolved: #7580 Test Plan: make check Reviewed By: riversand963 Differential Revision: D24485420 Pulled By: zhichao-cao fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
9c9ed06
to
bf8b2e8
Compare
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
I'm finding it difficult to approve this complication of the Env interface without actually seeing a data integrity improvement that is not possible with the existing interface. Computing checksum on the buffer passed to Append in caller rather than callee to me is not a meaningful data integrity improvement. |
You mean there is nearly no different between Append itself do the checksum calculation for the buffer and the caller of Append to do it? |
If it's the same buffer, I don't see a difference between computing checksum right before call vs. right at beginning of call. |
Due to the rate limiter, yes, the buffer/size is the same as what we pass down. But based on the WS and Zippy requirement, if RocksDB can provide the calculation in higher layer instead of FileSystem, they can avoid re-calculate it again. In current implementation+the follow up one, we can achieve from WritableFileWriter::append(data)->buffer for WritableFile::Append->WS client validation. Without this interface, even we verify the buffer in WritableFileWriter, FileSystem still need to calculate again, so that's one more calculation. |
Just to clarify, is the
I tend to think it's ok to trust the memory's reliability features for data at rest, so computing in callee would work for me. However, any gap in the software-level protection feels like somewhere users will place the blame in corruption investigations, so I could go either way.
Do you mean in the future there'll be integrity protection computed in edit: s/rolling checksum/cumulative file checksum/g |
Yes. The motivation of this project, is to handle the potential corruption of WAL and Manifest, with the discussion with Zippy, they will use the new FileSystem Append with verification to handle the checksum handoff for their log. So they also want the hand off can be used for other files like SST files. The protection will be in this way: by using the rolling checksum of the data in the buffer of |
Understood, thanks for elaborating. In this PR, it looks like the checksums passed to edit: s/rolling checksum/cumulative file checksum/g |
Based on the current information from WS, they can only use the checksum information aligned with the data being passed in |
Got it, thanks for the explanation.
My understanding is with a edit: s/rolling checksum/cumulative file checksum/g |
My personal consideration is, in the API, we pass a data structure: DataVerificationInfo. It can bring any verification information we want. First, it is natural to bring the checksum align with the data in |
Except not really, because right now the generation of DataVerificationInfo is fixed in the implementation. We are directly tied up in any updates or evolution. Wouldn't it be easier for both sides if we just provide a way for WritableFileWriter to do no buffering (IMHO at the choice of Env/FS), and pass original data directly to FSWritableFile for each Append? Then the FS/Env implementation can worry about its own buffering and coordinating that with hand-off checksums. The fact that we provide buffering above the FS layer seems to me a convenience to FS implementors, and should be optional so that our implementation does not become intertwined will all possible concerns of buffering for all possible FS implementations. Re-implementing a buffering layer into an already complex, niche FS implementation should not be complicated, and not nearly complicated enough to justify intertwining our code and releases with this problem. |
It is a possible direction, but may not be this PR. If we let WritableFileWriter to do no buffering and directly pass what is appended to FS, we will not be able to control write rate limiter, the FS need to deal with the small writes. This can be a future implementation to give an option to control if the user want to let FS to fully control the write process of Append. But here, since Zippy and WS relies on the current flow in WritableFileWriter, it would be better to follow the current buffering and rate limit control in WritableFileWriter. |
This is already handled in And I propose that the theoretical problem of data being corrupted while waiting for the rate limiter be left for future work. (AFAICT, my proposal allows an FS implementation to cover more of the pipeline with checksums sooner than this PR.) |
Can you explain more on your proposal? Or a design doc for it. It is hard for me to pick up the overall information from the comments here. |
675716a
to
e8d89fb
Compare
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
e8d89fb
to
ce1b47e
Compare
@zhichao-cao has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@zhichao-cao merged this pull request in d1c510b. |
Summary: As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in facebook#7523 Pull Request resolved: facebook#7580 Test Plan: make check Reviewed By: riversand963 Differential Revision: D24485420 Pulled By: zhichao-cao fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary: in PR facebook#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in facebook/rocksdb#7523 Pull Request resolved: facebook/rocksdb#7580 Test Plan: make check Reviewed By: riversand963 Differential Revision: D24485420 Pulled By: zhichao-cao fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in facebook/rocksdb#7523 Pull Request resolved: facebook/rocksdb#7580 Test Plan: make check Reviewed By: riversand963 Differential Revision: D24485420 Pulled By: zhichao-cao fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
…AL (#8412) Summary: In PR #7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead. In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine #8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage. Pull Request resolved: #8412 Test Plan: make check, add new testing cases. Reviewed By: anand1976 Differential Revision: D29151545 Pulled By: zhichao-cao fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary: in PR facebook/rocksdb#7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: facebook/rocksdb#7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0 Signed-off-by: Changlong Chen <levisonchen@live.cn>
in PR #7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Test plan: add new unit test, pass make check/ make asan_check