Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add seqno to time mapping #10338

Closed
wants to merge 6 commits into from
Closed

Conversation

jay-zhuang
Copy link
Contributor

@jay-zhuang jay-zhuang commented Jul 11, 2022

Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the preclude_last_level_data_seconds).

Test Plan: CI

@jay-zhuang jay-zhuang requested a review from siying July 12, 2022 05:19
@jay-zhuang jay-zhuang marked this pull request as ready for review July 13, 2022 04:24
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

db/column_family.cc Outdated Show resolved Hide resolved
db/compaction/compaction_iterator.cc Outdated Show resolved Hide resolved
table/table_properties.cc Outdated Show resolved Hide resolved
r->next_file_number);
TableFileCreationReason::kMisc, 0 /* oldest_key_time */,
0 /* file_creation_time */, "SST Writer" /* db_id */, r->db_session_id,
0 /* target_file_size */, r->next_file_number);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if a user only ingests data through bulkloading, it should still work since we are adding data points of oldest_ancester_time and smallest_seqno? It's better to add a code comment somewhere on this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but ingested file has the "oldest_ancester_time" set to 0 (kUnknownOldestAncesterTime), so even the smallest_seqno -> oldest_ancester_time pair is useless.
I'm not sure how we should handle ingested file usecase, I'll add a TODO for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... Maybe use file creation time in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_creation_time is also set to 0 for ingested sst. not sure if we should set it to current time, I'll leave it as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a bug. Bulkloaded files also needed to be periodically compacted for options.periodic_compaction_seconds

include/rocksdb/advanced_options.h Show resolved Hide resolved
db/seqno_to_time_mapping.h Outdated Show resolved Hide resolved
db/seqno_to_time_mapping.h Outdated Show resolved Hide resolved
table/block_based/block_based_table_builder.cc Outdated Show resolved Hide resolved
table/block_based/block_based_table_builder.cc Outdated Show resolved Hide resolved
db/seqno_to_time_mapping.cc Show resolved Hide resolved
@jay-zhuang jay-zhuang changed the title [Draft] Add seqno to time mapping Add seqno to time mapping Jul 13, 2022
for (const auto& it : copy) {
// If sequence number is the same, pick the one with larger time, which is
// more accurate than the older time.
if (it.seqno == prev.seqno) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since duplication should be very common (since compaction input files from non-L0 tend to have the same source of seqno->time mapping), I wonder whether we should do Sort(), or always maintain deduplicated order in Append().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append() does maintain the order and dedup. (which is used for flush).
Sort() is mostly needed in compaction, which add all information together and do sort once. (ideally we should do k-merge sort, but just to keep it simple and also make it easier to add smallest_seqno -> oldest_ancestor_time pair.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is about the compaction case. Compaction case will have lots of duplication. For example, after a L1->L2, many entries in L2 files are the same, as they come from the same L1 file. And then another L1->L2 comes and these entries need to be dedup. So duplication is common here and my understanding is that right now Sort() is critical to make sure we don't store duplicate data here. Maybe we can even consider to use a std::map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L0 files typically should not have duplicated entries, as assume they do not have seqno overlap.
On the other hand, the data set is pretty small here. The maximum is 100, but typically for higher level files, it's pretty small. For example, if the preclude_last_level_data_seconds is set to 3 days, it samples a seqno->time pair every 1 hour, if memtable life-span is less than 1 hour, the map is actually empty for most of L0 files.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

min_first_mem_seqno =
std::min(cfd->GetFirstMemtableSequenceNumber(), min_first_mem_seqno);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is min_first_mem_seqno only for binary search hint? If that is the case, it feels too complex and might not necessarily to be cheaper. Some applications might have thousands of CFs. Looping through all CFs is generally encouraged to be avoided if possible, especially within DB mutex (I know it is now outside but still).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'm going to remove that part for now as it's just an optimization to reduce the memory usage. As now the memory usage is capped at 1.6K per DB instance (100 * (8+8)) or 16K for the worst case (multiple CFs with different settings). It's not worth doing I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I got it. It is for retention. I think there is another counter for that. In that case, this number is better to be updated in flush/compaction path where we already loop through all CFs for something else, just like VersionSet::PreComputeMinLogNumberWithUnflushedData(). The information can be used by something else. We might even have the information precalculated, though I didn't find it.

Of course we don't have to do it here.

db/db_impl/db_impl.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Test Plan: CI
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants