-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a BlobDB-specific table property collector #8316
Conversation
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@ltamasi has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@ltamasi has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should such aggregated information be put into a meta block, instead of a table property?
assert(dynamic_cast<const BlobTablePropertiesCollectorFactory*>( | ||
int_tbl_prop_collector_factories_.front().get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the restriction that BlobTablePropCollectorFactory has to be the first one is not so obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's added before the user-specified ones (see lines 558-559 in column_family.cc
) so it's guaranteed to be at the head of the list. I can add a comment though to point this out.
// COPYING file in the root directory) and Apache 2.0 License | ||
// (found in the LICENSE.Apache file in the root directory). | ||
|
||
#include "db/blob/blob_stats_record.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just general code style question, should for example, blob_stats.h
, blob_stats_record.h
and blob_stats_collection.h
be combined into one file? And the same for *.cc files.
I have the impression that our codebase has lots of related structures defined in one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a hard and fast rule for this; I personally prefer to have separate source files because they can reduce dependencies. (For example, blob_table_properties_collector.h
only includes blob_stats.h
but not the other two.) As for the pre-existing stuff, there has been some cleanup work done recently to split up large source files to make file sizes more manageable.
ASSERT_NE(user_props.find(TablePropertiesNames::kBlobFileMapping), | ||
user_props.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't also check the value of property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that would be redundant since we have dedicated unit tests for that. Here, I'm just trying to make sure BlobDB's property collector was active as expected.
Thanks for the review @jay-zhuang ! |
You mean a dedicated metablock of its own? (Table properties are also stored in a metablock.) I feel that would be overkill; it would involve a lot of changes for the sake of a single property. The table property collector framework gives us exactly what we need in this case. |
@ltamasi has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Yeah, a dedicated metablock, which maybe only for blobDB, I think it will be more customizable for more information in the future. But anyway, a property here would be easier to implement and use I guess. |
…head of the vector
@ltamasi has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…action (#8426) Summary: This is part of an alternative approach to #8316. Unlike that approach, this one relies on key-values getting processed one by one during compaction, and does not involve persistence. Specifically, the patch adds a class `BlobGarbageMeter` that can track the number and total size of blobs in a (sub)compaction's input and output on a per-blob file basis. This information can then be used to compute the amount of additional garbage generated by the compaction for any given blob file by subtracting the "outflow" from the "inflow." Note: this patch only adds `BlobGarbageMeter` and associated unit tests. I plan to hook up this class to the input and output of `CompactionIterator` in a subsequent PR. Pull Request resolved: #8426 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D29242250 Pulled By: ltamasi fbshipit-source-id: 597e50ad556540e413a50e804ba15bc044d809bb
Closing since we ended up going for a different approach (see #8450 ). |
Summary:
The patch adds a BlobDB-specific table property collector that processes
any blob indexes added to an SST and keeps track of the total number and
size of blobs referenced by the SST on a per-blob file basis. This aggregated
information is then persisted as a new table property
rocksdb.blob.file.mapping
.This information will be used to calculate the amount of garbage generated by
compactions. Namely, the amount of additional garbage for any given blob file
can be computed by subtracting the "outflow" (total number/size of blobs in the
output SSTs) from the "inflow" (total number/size of blobs in the input SSTs).
Tracking the amount of garbage in blob files in turn will allow us to optimize GC
performance. (Note: one might consider counting blobs e.g. in
CompactionIterator
as we process key-values during compaction;however, this would be very intrusive, and there are actually cases when
CompactionIterator
does not process all keys individually, for instancewith deletion/TTL compactions or when a compaction filter returns
kRemoveAndSkipUntil
.)Test Plan:
make check