New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IngestFile: support ingest SST files generated by live DB #5602
base: main
Are you sure you want to change the base?
Conversation
71148cb
to
ae44e53
Compare
@zxylvlp Can you describe the use case for this feature? |
Feture Request: Merge two rocksdb instances We want to implement a merge operation of two rocksdb instance in our distributed table store like hbase. Firstly, we try repairdb, but we found that repairdb need same cf name with same cf id. It's not a strongly guarantee in rocksdb implements. Then, we try to use ingest file, but ingest file need has table properties ingest-file.version and ingest-file.global_seq which not exists in normal sst. So we add ingest-file.version and ingest-file.global_seq in all normal files with version=2 and global_seq=-1 (let rocksdb read key's real seq. If we don't set it, it will read largest seq in manifest. If we set it to other value, it will let it to be global seq, not key's real seq). All sst files has version=2,so we don't known whether largest seq is global seq without chang global_seq in table property. We must change global_seq in table property. But it must need a pwrite which distributed file system often not implements. It also let rocksdb sst which be checkpointed (hard link) changed. I think add a flag like ingest-file.version in manifest maybe make sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this interesting feature. I left some comments. If you are interested, we can have a short VC call to explain my comments.
@@ -55,6 +55,7 @@ enum CustomTag : uint32_t { | |||
// kMinLogNumberToKeep as part of a CustomTag as a hack. This should be | |||
// removed when manifest becomes forward-comptabile. | |||
kMinLogNumberToKeepHack = 3, | |||
kHasGlobalSequence = 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one cannot be safely ignored, so it should be 66.
The comment "Mask for an identified tag from the future which can be safely ignored." might be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -247,6 +254,7 @@ const char* VersionEdit::DecodeNewFile4From(Slice* input) { | |||
uint64_t file_size; | |||
SequenceNumber smallest_seqno; | |||
SequenceNumber largest_seqno; | |||
bool has_global_seqno = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to have a general solution, we may want a more straight-forward name for what it is. Maybe something like 'ignore_seqno_in_file' or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'largest_seqno_is_global_seqno' maybe better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel 'ignore_seqno_in_file' is better than "largest_seqno_is_global_seqno".
// table property. | ||
// Warning: set it to false maybe make RepairDB useless after MANIFEST | ||
// corruption. | ||
bool check_cf_id_property_before_ingest = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name "check" is ambiguous. We should say "allow" or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'allow_cf_id_property_check_before_ingest' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'allow_ingest_files_with_different_cf_id_property' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'allow_ingest_files_with_different_cf_id_property' sounds better to me.
// assert(value_type == ValueType::kTypeValue || | ||
// value_type == ValueType::kTypeMerge || | ||
// value_type == ValueType::kTypeDeletion || | ||
// value_type == ValueType::kTypeRangeDeletion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see where multiple versions of the same user key is handled. Right now I assume we generate the same key, same seqno multiple times? It might not be a good idea. What we can do here is to compare key with the previous key. If it is the same, ignore the next key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think db_iter & compaction_iter can handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we handle same key same sequence number correctly. Even if we do today (I believe we don't), it's not an assumption we want to make in the future.
@@ -1376,7 +1378,7 @@ Status BlockBasedTable::ReadPropertiesBlock( | |||
rep_->index_type == BlockBasedTableOptions::kBinarySearchWithFirstKey; | |||
|
|||
s = GetGlobalSequenceNumber(*(rep_->table_properties), largest_seqno, | |||
&(rep_->global_seqno)); | |||
has_global_seqno, &(rep_->global_seqno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see anything changed in binary search dealing with seqno. Right now, binary search should still work because all keys is encoded with seqno, so when we seek to (user_key, seqno), we are guaranteed to see all entries with user_key. It's not the case if seqno is not always 0. Something needs to be done here to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I would change it.
s = test::DestroyDir(env_, test::TmpDir(env_) + "/external_live_db"); | ||
ASSERT_OK(s); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the unit test doesn't cover the critical cases I mentioned worries me. Multiple versions of the same key, multiple merge operands of the same key, snapshot lower than seqno in the file, etc are all important cases prone to failure, and it should all be covered.
If you feel handling all the cases are too complicated, we can step back and chat about how to do things step by step.
@zxylvlp also, please update the summary of the pull request. |
Is work on this ongoing? If not, I am interested in taking this and trying to move it forward. |
SequenceNumber* seqno) { | ||
const auto& props = table_properties.user_collected_properties; | ||
const auto version_pos = props.find(ExternalSstFilePropertyNames::kVersion); | ||
const auto seqno_pos = props.find(ExternalSstFilePropertyNames::kGlobalSeqno); | ||
|
||
*seqno = kDisableGlobalSequenceNumber; | ||
*seqno = has_global_seqno ? largest_seqno : kDisableGlobalSequenceNumber; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this taking the seqno from the table in the old DB and using that as the seqno for the table in the new DB? If so that'll break the invariant that newer versions of a given user key are in newer components of the LSM. Breaking that invariant means the DB contents will appear to change when compactions happen, and that Get()
will return different results from an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
largest_seqno is global_seqno stored in manifest when has_global_seqno
I already used this feature in our production. But I didn't track this MR recently. I am appreciate to cooperate to accomplish this MR. |
No description provided.