-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add seqno to time mapping #10338
Add seqno to time mapping #10338
Conversation
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
r->next_file_number); | ||
TableFileCreationReason::kMisc, 0 /* oldest_key_time */, | ||
0 /* file_creation_time */, "SST Writer" /* db_id */, r->db_session_id, | ||
0 /* target_file_size */, r->next_file_number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if a user only ingests data through bulkloading, it should still work since we are adding data points of oldest_ancester_time and smallest_seqno? It's better to add a code comment somewhere on this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but ingested file has the "oldest_ancester_time" set to 0 (kUnknownOldestAncesterTime), so even the smallest_seqno -> oldest_ancester_time pair is useless.
I'm not sure how we should handle ingested file usecase, I'll add a TODO for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.... Maybe use file creation time in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_creation_time is also set to 0 for ingested sst. not sure if we should set it to current time, I'll leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a bug. Bulkloaded files also needed to be periodically compacted for options.periodic_compaction_seconds
for (const auto& it : copy) { | ||
// If sequence number is the same, pick the one with larger time, which is | ||
// more accurate than the older time. | ||
if (it.seqno == prev.seqno) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since duplication should be very common (since compaction input files from non-L0 tend to have the same source of seqno->time mapping), I wonder whether we should do Sort(), or always maintain deduplicated order in Append().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append()
does maintain the order and dedup. (which is used for flush).
Sort()
is mostly needed in compaction, which add all information together and do sort once. (ideally we should do k-merge sort, but just to keep it simple and also make it easier to add smallest_seqno -> oldest_ancestor_time pair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is about the compaction case. Compaction case will have lots of duplication. For example, after a L1->L2, many entries in L2 files are the same, as they come from the same L1 file. And then another L1->L2 comes and these entries need to be dedup. So duplication is common here and my understanding is that right now Sort() is critical to make sure we don't store duplicate data here. Maybe we can even consider to use a std::map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L0 files typically should not have duplicated entries, as assume they do not have seqno overlap.
On the other hand, the data set is pretty small here. The maximum is 100, but typically for higher level files, it's pretty small. For example, if the preclude_last_level_data_seconds
is set to 3 days, it samples a seqno->time pair every 1 hour, if memtable life-span is less than 1 hour, the map is actually empty for most of L0 files.
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
0b5cca3
to
0577d3f
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
db/db_impl/db_impl.cc
Outdated
min_first_mem_seqno = | ||
std::min(cfd->GetFirstMemtableSequenceNumber(), min_first_mem_seqno); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is min_first_mem_seqno only for binary search hint? If that is the case, it feels too complex and might not necessarily to be cheaper. Some applications might have thousands of CFs. Looping through all CFs is generally encouraged to be avoided if possible, especially within DB mutex (I know it is now outside but still).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'm going to remove that part for now as it's just an optimization to reduce the memory usage. As now the memory usage is capped at 1.6K per DB instance (100 * (8+8))
or 16K for the worst case (multiple CFs with different settings). It's not worth doing I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I got it. It is for retention. I think there is another counter for that. In that case, this number is better to be updated in flush/compaction path where we already loop through all CFs for something else, just like VersionSet::PreComputeMinLogNumberWithUnflushedData(). The information can be used by something else. We might even have the information precalculated, though I didn't find it.
Of course we don't have to do it here.
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Test Plan: CI
7609f5b
to
0006484
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the
preclude_last_level_data_seconds
).Test Plan: CI