Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FileChecksumGenFactory for SST file checksum #6600

Closed
wants to merge 7 commits into from

Conversation

zhichao-cao
Copy link
Contributor

In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.

Test plan: tested with make asan_check

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update HISTORY.md for public interface change.

// Return a processed value of the checksum for store in somewhere
virtual std::string ProcessChecksum(const std::string& checksum) = 0;
// Get the checksum
virtual std::string GetChecksum() = 0;

// Returns a name that identifies the current file checksum function.
virtual const char* Name() const = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or is it enough to only have it in FileChecksumGenFactory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or is it enough to only have it in FileChecksumGenFactory?

It might be more clear to user different name for FilechecksumGenerator and the Factory. User can decide to use the same name or not I think.

Copy link
Contributor

@siying siying Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used in code for now? Where do you plan to use this Name()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used in code for now? Where do you plan to use this Name()?

Each time, after the SST file is generated, table builder will call to get the checksum generator name and checksum. The name of ChecksumGenFactory is not used yet.

assert(data != nullptr);
return Uint32ToString(crc32c::Value(data, n));
if (is_inintilized_ == false) {
checksum_ = crc32c::Value(data, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the implementation of crc32c::Value():

inline uint32_t Value(const char* data, size_t n) {
  return Extend(0, data, n);
}

I don't think we need this if and is_inintilized_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying Thanks for pointing out. I will change accordingly.

private:
uint32_t checksum_;
bool is_inintilized_;
std::string file_name_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying Will remove.

@zhichao-cao
Copy link
Contributor Author

We need to update HISTORY.md for public interface change.

Sure.


// Returns a name that identifies the current file checksum function.
virtual const char* Name() const = 0;
};

// Create the FileChecksumGenerator object for each SST file.
class FileChecksumGenFactory {
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: explicitly write the default constructor.

include/rocksdb/file_checksum.h Outdated Show resolved Hide resolved
file/writable_file_writer.h Outdated Show resolved Hide resolved
if (checksum_generator_ != nullptr) {
return checksum_generator_->GetChecksum();
} else {
return kUnknownFileChecksum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not a part of the PR but does it make more sense to call it kNoFileChecksum? Similarly for kUnknownFileChecksumFuncName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying So change to kNoFileChecksum and kNoFileChecksumGenName ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying. I will change them in the next PR.

// Return a processed value of the checksum for store in somewhere
virtual std::string ProcessChecksum(const std::string& checksum) = 0;
// Get the checksum
virtual std::string GetChecksum() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you would add a Finalize() function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying Yes. Just pushed the updates.

@@ -216,9 +216,18 @@ Status WritableFileWriter::Flush() {
return s;
}

std::string WritableFileWriter::GetFileChecksum() {
if (checksum_generator_ != nullptr) {
checksum_generator_->Finalize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalize() will be called each time we call GetFileChecksum(). This does not sound intended usage pattern in which I expect Finalize() to be called only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe considering adding a member bool finalized_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riversand963 Good suggestion! I have not considered this situation. PR is updated accordingly

uint32_t checksum_value = StringToUint32(checksum);
return Uint32ToString(crc32c::Mask(checksum_value));
}
void Finalize() override { checksum_str_ = Uint32ToString(checksum_); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid/disallow calling Finalize multiple times. Currently it's possible due to the implementation of GetFileChecksum().

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have blocking comment for now. I'll defer to @riversand963 to approve.

if (checksum_generator_ != nullptr) {
return checksum_generator_->GetChecksum();
} else {
return kUnknownFileChecksum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

std::string WritableFileWriter::GetFileChecksum() {
if (checksum_generator_ != nullptr) {
if (!checksum_finalized_) {
checksum_generator_->Finalize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only calling Finalize() inside GetFileChecksum() defeats the purpose of Finalize(). The application can do it if they want to. In order for Finalize() to make sense, we should call it when we close the file. Otherwise, there is no point of having such a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siying I see the logic here. Will move to Close and update the GetChecksum order in builder.cc

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. Defer to @riversand963 to approve.

db/builder.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @zhichao-cao for working on this.

db/builder.cc Outdated Show resolved Hide resolved
file/writable_file_writer.cc Outdated Show resolved Hide resolved
@zhichao-cao
Copy link
Contributor Author

LGTM. Thanks @zhichao-cao for working on this.

Thanks for the review and comments! Will change accordingly.

@zhichao-cao
Copy link
Contributor Author

The change looks good to me. Defer to @riversand963 to approve.

Thanks for the review and comments!

@facebook-github-bot
Copy link
Contributor

@zhichao-cao has updated the pull request. Re-import the pull request

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhichao-cao merged this pull request in e8d332d.

levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 22, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: facebook/rocksdb#6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
Signed-off-by: Changlong Chen <levisonchen@live.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants