-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os,osd: initial work to drop onreadable/onapplied callbacks #20177
Conversation
src/os/ObjectStore.h
Outdated
* Clients of ObjectStore create and maintain their own Sequencer objects. | ||
* When a list of transactions is queued the caller specifies a Sequencer to be used. | ||
* | ||
* ObjectStore users my get collection handles with open_collection() (or, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: my->may
src/os/bluestore/BlueStore.cc
Outdated
ObjectStore::CollectionHandle BlueStore::create_new_collection( | ||
const coll_t& cid) | ||
{ | ||
RWLock::WLocker l(coll_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be taken after collection creation.
e540753
to
d24a843
Compare
@rzarzynski i updated the last commit to fix a hang.. please retest! thanks |
d24a843
to
a7ab0fa
Compare
from @rzarzynski I've got results for the latest commit. It's faster - the regression in comparison to master is below 2% for randwrites: 1,93 and 1,39 respectively. :-) Reference point [1] vs wip-kill-onreadable~ [2] vs the newest wip-kill-onreadable [3]: seqwrite 4 KiB "bw" : 30161 vs 30693 vs 30588, seqwrite 64 KiB "bw" : 480848 vs 511888 vs 509298, randwrite 4 KiB "bw" : 120254 vs 117502 vs 117914, randwrite 64 KiB "bw" : 1510461 vs 1482712 vs 1489362. [1] cb396a78d42f034dd61606a327cc703211e49cda [2] 0f03c857c18475a97b80165f2696570730b208e5 [3] a7ab0faed16125527e7c3b5683e79da162d37006 |
My results are on my local box and pretty noisy. Everything seems to speed up over time... hrm.. base: 165252 157698 167683 (master) foo-a: 170962 163602 162909 (wip-os-ch) foo-b: 181918 174897 167302 (wip-os-ch + slow filestore tracking) foo-c: 178579 164342 162355 (wip-os-ch + fast filestore tracking) foo-d: 179306 176254 164431 (wip-os-ch + removal of some onreadable callbacks) foo-e: 175176 179313 172316 (wip-os-ch + fast filestore tracking + removal of callbacks) foo-f: 166346 174985 170836 (wip-os-ch + slow filestore tracking + removal of callbacks) |
More results for a7ab0fa from incerta:
|
@liewegas: got the results for reads:
|
looks like 2-3% degradation on read. @jdurgin @gregsfortytwo @markhpc ok to proceed? |
That seems like a reasonable cost to pay for eliminating ~5k lines throughout the rest of the code base, though I didn't do a full review. |
dc27f53
to
eccabf1
Compare
f5c127e
to
b5be286
Compare
Note that this is *slight* overkill in that a *source* object of a clone will also appear in the applying map, even though it is not being modified. Given that those clone operations are normally coupled with another transaction that does write (which is why we are cloning in the first place) this should not make any difference. Signed-off-by: Sage Weil <sage@redhat.com>
On any read, wait for any updates to the object to apply first. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Prevent a collection delete + recreate sequence from allowing two conflicting OpSequencers for the same collection to exist as this can lead to racing async apply threads. Signed-off-by: Sage Weil <sage@redhat.com>
a71d9ec
to
f126f8d
Compare
Signed-off-by: Sage Weil <sage@redhat.com>
Make sure SnapMapper's ContainerContexts don't outlive the mapper itself. Signed-off-by: Sage Weil <sage@redhat.com>
Note that we don't need to worry about the internal get_omap_iterator callrs (e.g., omap_rmkeyrange) because the apply thread does these ops sequentially and in order. Signed-off-by: Sage Weil <sage@redhat.com>
… is done The onreadable completions go through a finisher; add a final event in that stream that keeps the PG alive while prior events flush. flush() isn't quite sufficient since it doesn't wait for the finisher events to flush too--only for the actual apply to have happened. Signed-off-by: Sage Weil <sage@redhat.com>
c5f66c9
to
42060fd
Compare
We need to flush between split. This requirement unfortunately doesn't quite go away with the FileStore tracking. Also, flush for each batch. This is just because the test environment may have a low open file ulimit. (The old code did apply_transaction, so it's functionally equivalent to this.) Signed-off-by: Sage Weil <sage@redhat.com>
The transactions are idependent in each collection/sequencer, so we can't record to a single txn object with racing transactions. Fix it by doing one in each collection, and when reading the latest op, use the highest txn value we see. Signed-off-by: Sage Weil <sage@redhat.com>
42060fd
to
08324f7
Compare
Avoid EIO on, say, osdmaps until we fix http://tracker.ceph.com/issues/23029 Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
825e0e0
to
448b696
Compare
@jdurgin this is passing tests now. ready for review! |
@markhpc mind doing a final check on this for small reads and writes, filestore and bluestore? |
@liewegas on it |
NVMe tests are done though I also am running some (much slower to populate) HDD tests. This is 72ad764 vs wip-kill-onreadable 4k IO size to 4 NVMe backed OSDs. High concurrency, 512GB total RBD volume size with 3x replication: | FS, 72ad764 | FS, wip-kill-onreadable | BS, 72ad764 | BS, wip-kill-onreadable In terms of percentage differences: FS: read: 0.99% BS: read: 1.74% I should note this is only a single iteration of these tests so there may be some variability. |
Generally we should keep a ref of each newly received maps until we get them written onto disk. This is important because as long as the refs are alive, the OSDMaps will be pinned in the cache and we won't try to read it off of disk. Otherwise these maps will probably not stay in the cache, and reading those OSDMaps before they are actually written can result in a crash. ceph#20177 kills the onapplied callbacks entirely, hence here we add the pinned maps back into the on_committed structure instead. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Generally we should keep a ref of each newly received maps until we get them written onto disk. This is important because as long as the refs are alive, the OSDMaps will be pinned in the cache and we won't try to read it off of disk. Otherwise these maps will probably not stay in the cache, and reading those OSDMaps before they are actually written can result in a crash. ceph#20177 kills the onapplied callbacks entirely, hence here we add the pinned maps back into the on_committed structure instead. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
I added these comments a few years ago. Since bluestore can read things that aren't committed and ceph#20177 should have made it work for filestore too, this shouldn't matter any more. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
The flush() operation is now only needed when there is an ordering needed between collections/sequencers. Split is the main example: the actual split happens on the parent collection, and we have to wait for that to flush before we start using the new child collection/sequencers.
There are still several on_applied users left:
...and probably a bit more.
Performance for FileStore is about the same. Haven't check to see whether there is a measureable BlueStore boost yet.