Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IngestFile: support ingest SST files generated by live DB #5602

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zxylvlp
Copy link
Contributor

@zxylvlp zxylvlp commented Jul 21, 2019

No description provided.

@anand1976
Copy link
Contributor

@zxylvlp Can you describe the use case for this feature?

@zxylvlp
Copy link
Contributor Author

zxylvlp commented Jul 23, 2019

Feture Request: Merge two rocksdb instances

We want to implement a merge operation of two rocksdb instance in our distributed table store like hbase.

Firstly, we try repairdb, but we found that repairdb need same cf name with same cf id. It's not a strongly guarantee in rocksdb implements.

Then, we try to use ingest file, but ingest file need has table properties ingest-file.version and ingest-file.global_seq which not exists in normal sst.

So we add ingest-file.version and ingest-file.global_seq in all normal files with version=2 and global_seq=-1 (let rocksdb read key's real seq. If we don't set it, it will read largest seq in manifest. If we set it to other value, it will let it to be global seq, not key's real seq).

All sst files has version=2,so we don't known whether largest seq is global seq without chang global_seq in table property. We must change global_seq in table property.

But it must need a pwrite which distributed file system often not implements.

It also let rocksdb sst which be checkpointed (hard link) changed.

I think add a flag like ingest-file.version in manifest maybe make sense.

ref: https://www.facebook.com/groups/rocksdb.dev/

@zxylvlp
Copy link
Contributor Author

zxylvlp commented Jul 23, 2019

@anand1976

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this interesting feature. I left some comments. If you are interested, we can have a short VC call to explain my comments.

@@ -55,6 +55,7 @@ enum CustomTag : uint32_t {
// kMinLogNumberToKeep as part of a CustomTag as a hack. This should be
// removed when manifest becomes forward-comptabile.
kMinLogNumberToKeepHack = 3,
kHasGlobalSequence = 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one cannot be safely ignored, so it should be 66.
The comment "Mask for an identified tag from the future which can be safely ignored." might be ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -247,6 +254,7 @@ const char* VersionEdit::DecodeNewFile4From(Slice* input) {
uint64_t file_size;
SequenceNumber smallest_seqno;
SequenceNumber largest_seqno;
bool has_global_seqno = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to have a general solution, we may want a more straight-forward name for what it is. Maybe something like 'ignore_seqno_in_file' or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'largest_seqno_is_global_seqno' maybe better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel 'ignore_seqno_in_file' is better than "largest_seqno_is_global_seqno".

// table property.
// Warning: set it to false maybe make RepairDB useless after MANIFEST
// corruption.
bool check_cf_id_property_before_ingest = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name "check" is ambiguous. We should say "allow" or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'allow_cf_id_property_check_before_ingest' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'allow_ingest_files_with_different_cf_id_property' ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'allow_ingest_files_with_different_cf_id_property' sounds better to me.

// assert(value_type == ValueType::kTypeValue ||
// value_type == ValueType::kTypeMerge ||
// value_type == ValueType::kTypeDeletion ||
// value_type == ValueType::kTypeRangeDeletion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see where multiple versions of the same user key is handled. Right now I assume we generate the same key, same seqno multiple times? It might not be a good idea. What we can do here is to compare key with the previous key. If it is the same, ignore the next key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think db_iter & compaction_iter can handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we handle same key same sequence number correctly. Even if we do today (I believe we don't), it's not an assumption we want to make in the future.

@@ -1376,7 +1378,7 @@ Status BlockBasedTable::ReadPropertiesBlock(
rep_->index_type == BlockBasedTableOptions::kBinarySearchWithFirstKey;

s = GetGlobalSequenceNumber(*(rep_->table_properties), largest_seqno,
&(rep_->global_seqno));
has_global_seqno, &(rep_->global_seqno));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see anything changed in binary search dealing with seqno. Right now, binary search should still work because all keys is encoded with seqno, so when we seek to (user_key, seqno), we are guaranteed to see all entries with user_key. It's not the case if seqno is not always 0. Something needs to be done here to make it work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I would change it.

s = test::DestroyDir(env_, test::TmpDir(env_) + "/external_live_db");
ASSERT_OK(s);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the unit test doesn't cover the critical cases I mentioned worries me. Multiple versions of the same key, multiple merge operands of the same key, snapshot lower than seqno in the file, etc are all important cases prone to failure, and it should all be covered.

If you feel handling all the cases are too complicated, we can step back and chat about how to do things step by step.

@siying
Copy link
Contributor

siying commented Jul 25, 2019

@zxylvlp also, please update the summary of the pull request.

@siying siying self-assigned this Sep 10, 2019
@ajkr
Copy link
Contributor

ajkr commented Jan 18, 2020

Is work on this ongoing? If not, I am interested in taking this and trying to move it forward.

SequenceNumber* seqno) {
const auto& props = table_properties.user_collected_properties;
const auto version_pos = props.find(ExternalSstFilePropertyNames::kVersion);
const auto seqno_pos = props.find(ExternalSstFilePropertyNames::kGlobalSeqno);

*seqno = kDisableGlobalSequenceNumber;
*seqno = has_global_seqno ? largest_seqno : kDisableGlobalSequenceNumber;
Copy link
Contributor

@ajkr ajkr Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this taking the seqno from the table in the old DB and using that as the seqno for the table in the new DB? If so that'll break the invariant that newer versions of a given user key are in newer components of the LSM. Breaking that invariant means the DB contents will appear to change when compactions happen, and that Get() will return different results from an iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

largest_seqno is global_seqno stored in manifest when has_global_seqno

@zxylvlp
Copy link
Contributor Author

zxylvlp commented Jan 19, 2020

Is work on this ongoing? If not, I am interested in taking this and trying to move it forward.

I already used this feature in our production. But I didn't track this MR recently. I am appreciate to cooperate to accomplish this MR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants