os,osd: initial work to drop onreadable/onapplied callbacks #20177

liewegas · 2018-01-30T03:44:18Z

adds tracking for in-flight writes, and wait_for_apply on read operations for FileStore
removes lots of on_readable/applied callbacks in OSD
more fixes to ceph_test_filestore_idempotent_sequence
drops the "pinning" on handle_osd_map that is no longer necessary
apply_transactions -> queue_transaction

The flush() operation is now only needed when there is an ordering needed between collections/sequencers. Split is the main example: the actual split happens on the parent collection, and we have to wait for that to flush before we start using the new child collection/sequencers.

There are still several on_applied users left:

SnapMapper's OSDriver. Here it keeps the in-memory cached mapping in place until it is applied and readable. It's not necessary because reads would be correct (FileStore--and sometimes bluestore--would block), but it is faster.
The OSD recovery code hasn't been changed to remove the applied callbacks yet. Nothing really preventing us from doing that except that we need to be a bit careful because some of it might be used to throttle the recovery work.

...and probably a bit more.

Performance for FileStore is about the same. Haven't check to see whether there is a measureable BlueStore boost yet.

ifed01 · 2018-01-30T09:08:38Z

src/os/ObjectStore.h

-   * Clients of ObjectStore create and maintain their own Sequencer objects.
-   * When a list of transactions is queued the caller specifies a Sequencer to be used.
-   *
+   * ObjectStore users my get collection handles with open_collection() (or,


typo: my->may

ifed01 · 2018-01-30T09:14:52Z

src/os/bluestore/BlueStore.cc

+ObjectStore::CollectionHandle BlueStore::create_new_collection(
+  const coll_t& cid)
+{
+  RWLock::WLocker l(coll_lock);


Can be taken after collection creation.

liewegas · 2018-01-30T17:45:42Z

@rzarzynski i updated the last commit to fix a hang.. please retest! thanks

liewegas · 2018-01-30T21:33:40Z

from @rzarzynski

I've got results for the latest commit. It's faster - the regression
in comparison to master is below 2% for randwrites: 1,93 and 1,39
respectively. :-)

Reference point [1] vs wip-kill-onreadable~ [2] vs the newest
wip-kill-onreadable [3]:

  seqwrite   4 KiB       "bw" :   30161 vs   30693 vs   30588,
  seqwrite  64 KiB       "bw" :  480848 vs  511888 vs  509298,
  randwrite  4 KiB       "bw" :  120254 vs  117502 vs  117914,
  randwrite 64 KiB       "bw" : 1510461 vs 1482712 vs 1489362.

[1] cb396a78d42f034dd61606a327cc703211e49cda
[2] 0f03c857c18475a97b80165f2696570730b208e5
[3] a7ab0faed16125527e7c3b5683e79da162d37006

liewegas · 2018-01-30T23:20:25Z

My results are on my local box and pretty noisy. Everything seems to speed up over time... hrm..

 base: 165252 157698 167683  (master)
foo-a: 170962 163602 162909  (wip-os-ch)
foo-b: 181918 174897 167302  (wip-os-ch + slow filestore tracking)
foo-c: 178579 164342 162355  (wip-os-ch + fast filestore tracking)
foo-d: 179306 176254 164431  (wip-os-ch + removal of some onreadable callbacks)
foo-e: 175176 179313 172316  (wip-os-ch + fast filestore tracking + removal of callbacks)
foo-f: 166346 174985 170836  (wip-os-ch + slow filestore tracking + removal of callbacks)

rzarzynski · 2018-01-31T15:22:35Z

More results for a7ab0fa from incerta:

randwrite  4 KiB   117328,  116363,  116678,  117929,  115937
randwrite 64 KiB  1487512, 1482242, 1489939, 1487119, 1487347
seqwrite   4 KiB    30993,   31126,   30884,   31418,   30884
seqwrite  64 KiB   516277,  509546,  509979,  516726,  513052

rzarzynski · 2018-02-02T00:45:33Z

@liewegas: got the results for reads:

Reference point: cb396a78d42f034dd61606a327cc703211e49cda

randread,  4 KiB: 167427, 164852, 163403, 162503, 165675,
randread, 64 KiB: 1940050, 1944297, 1933320, 1938220, 1934929,
read,      4 KiB: 88122, 87789, 87821, 90193, 88627,
read,     64 KiB: 1099596, 1085303, 1095368, 1090855, 1089028,

wip-kill-onreadable: a7ab0faed16125527e7c3b5683e79da162d37006

randread,  4 KiB: 165292, 162285, 167443, 163996, 169634,
randread, 64 KiB: 1953484, 1928152, 1916863, 1933165, 1924817,
read,      4 KiB: 85601, 87416, 89353, 87323, 86968,
read,     64 KiB: 1084973, 1086560, 1092684, 1094833, 1087578,

liewegas · 2018-02-02T03:22:31Z

looks like 2-3% degradation on read. @jdurgin @gregsfortytwo @markhpc ok to proceed?

gregsfortytwo · 2018-02-05T21:49:43Z

That seems like a reasonable cost to pay for eliminating ~5k lines throughout the rest of the code base, though I didn't do a full review.

Note that this is *slight* overkill in that a *source* object of a clone will also appear in the applying map, even though it is not being modified. Given that those clone operations are normally coupled with another transaction that does write (which is why we are cloning in the first place) this should not make any difference. Signed-off-by: Sage Weil <sage@redhat.com>

On any read, wait for any updates to the object to apply first. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Prevent a collection delete + recreate sequence from allowing two conflicting OpSequencers for the same collection to exist as this can lead to racing async apply threads. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Make sure SnapMapper's ContainerContexts don't outlive the mapper itself. Signed-off-by: Sage Weil <sage@redhat.com>

Note that we don't need to worry about the internal get_omap_iterator callrs (e.g., omap_rmkeyrange) because the apply thread does these ops sequentially and in order. Signed-off-by: Sage Weil <sage@redhat.com>

… is done The onreadable completions go through a finisher; add a final event in that stream that keeps the PG alive while prior events flush. flush() isn't quite sufficient since it doesn't wait for the finisher events to flush too--only for the actual apply to have happened. Signed-off-by: Sage Weil <sage@redhat.com>

We need to flush between split. This requirement unfortunately doesn't quite go away with the FileStore tracking. Also, flush for each batch. This is just because the test environment may have a low open file ulimit. (The old code did apply_transaction, so it's functionally equivalent to this.) Signed-off-by: Sage Weil <sage@redhat.com>

The transactions are idependent in each collection/sequencer, so we can't record to a single txn object with racing transactions. Fix it by doing one in each collection, and when reading the latest op, use the highest txn value we see. Signed-off-by: Sage Weil <sage@redhat.com>

Avoid EIO on, say, osdmaps until we fix http://tracker.ceph.com/issues/23029 Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2018-02-17T17:23:29Z

@jdurgin this is passing tests now. ready for review!

liewegas · 2018-02-17T17:30:54Z

@markhpc mind doing a final check on this for small reads and writes, filestore and bluestore?

markhpc · 2018-02-19T15:47:36Z

@liewegas on it

markhpc · 2018-02-20T20:39:10Z

NVMe tests are done though I also am running some (much slower to populate) HDD tests. This is 72ad764 vs wip-kill-onreadable 4k IO size to 4 NVMe backed OSDs. High concurrency, 512GB total RBD volume size with 3x replication:

| FS, 72ad764 | FS, wip-kill-onreadable | BS, 72ad764 | BS, wip-kill-onreadable
read | 728.1 | 735.3 | 332.5 | 338.3
write | 81.9 | 77.17 | 65.5518 | 77.9762
randread | 1225 | 1226 | 1093 | 1181
randwrite | 66.1807 | 64.0098 | 105.3 | 112.7
rw | 55.1074 | 56.7705 | 34.0918 | 39.3594
randrw | 63.0459 | 59.6758 | 102.8 | 108.9

In terms of percentage differences:

FS:

read: 0.99%
write: -6.13%
randread: 0.08%
randwrite: -3.39%
rw: 3.02%
randrw: -5.65%

BS:

read: 1.74%
write: 18.95%
randread: 8.05%
randwrite: 7.03%
rw: 15.45%
randrw: 5.93%

I should note this is only a single iteration of these tests so there may be some variability.

Generally we should keep a ref of each newly received maps until we get them written onto disk. This is important because as long as the refs are alive, the OSDMaps will be pinned in the cache and we won't try to read it off of disk. Otherwise these maps will probably not stay in the cache, and reading those OSDMaps before they are actually written can result in a crash. ceph#20177 kills the onapplied callbacks entirely, hence here we add the pinned maps back into the on_committed structure instead. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

I added these comments a few years ago. Since bluestore can read things that aren't committed and ceph#20177 should have made it work for filestore too, this shouldn't matter any more. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

The ondisk_{read,write}_lock infrastructure was long gone with ceph#20177 merged - c244300, to be specific. Hence the related comments must die since they could be super-misleading. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

liewegas changed the title ~~os,osd: initial work to drop onreadable/onapplied callbacks~~ WIP os,osd: initial work to drop onreadable/onapplied callbacks Jan 30, 2018

liewegas added cleanup core labels Jan 30, 2018

liewegas requested a review from rzarzynski January 30, 2018 03:44

ifed01 reviewed Jan 30, 2018

View reviewed changes

liewegas force-pushed the wip-kill-onreadable branch from e540753 to d24a843 Compare January 30, 2018 17:45

liewegas force-pushed the wip-kill-onreadable branch from d24a843 to a7ab0fa Compare January 30, 2018 18:01

liewegas mentioned this pull request Feb 1, 2018

[RFC]For bluestore remove onapplied_sync callback function #20232

Closed

rzarzynski mentioned this pull request Feb 4, 2018

core: make the main dout() paths faster and more maintanable #20290

Merged

6 tasks

liewegas force-pushed the wip-kill-onreadable branch from dc27f53 to eccabf1 Compare February 7, 2018 21:40

liewegas changed the title ~~WIP os,osd: initial work to drop onreadable/onapplied callbacks~~ os,osd: initial work to drop onreadable/onapplied callbacks Feb 7, 2018

liewegas added the wip-sage-testing label Feb 7, 2018

liewegas force-pushed the wip-kill-onreadable branch 2 times, most recently from f5c127e to b5be286 Compare February 8, 2018 17:16

liewegas added 4 commits February 12, 2018 13:56

os/filestore: wait_for_apply on read ops

3fd1634

On any read, wait for any updates to the object to apply first. Signed-off-by: Sage Weil <sage@redhat.com>

os/filestore: more efficient waiter tracking

03213f4

Signed-off-by: Sage Weil <sage@redhat.com>

os/filestore: keep OpSequencers alive

7545773

Prevent a collection delete + recreate sequence from allowing two conflicting OpSequencers for the same collection to exist as this can lead to racing async apply threads. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-kill-onreadable branch 3 times, most recently from a71d9ec to f126f8d Compare February 12, 2018 20:04

os: apply_transaction -> queue_transaction

907b628

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added 3 commits February 12, 2018 14:35

osd/PG: flush ch before pg delete

766092e

Make sure SnapMapper's ContainerContexts don't outlive the mapper itself. Signed-off-by: Sage Weil <sage@redhat.com>

os/filestore: wait_for_apply on get_omap_iterator

98b4e90

Note that we don't need to worry about the internal get_omap_iterator callrs (e.g., omap_rmkeyrange) because the apply thread does these ops sequentially and in order. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-kill-onreadable branch 2 times, most recently from c5f66c9 to 42060fd Compare February 16, 2018 14:37

liewegas added 2 commits February 16, 2018 12:37

liewegas force-pushed the wip-kill-onreadable branch from 42060fd to 08324f7 Compare February 16, 2018 22:54

liewegas added 2 commits February 17, 2018 10:17

os: do not inject read EIO on meta pool objects

d034945

Avoid EIO on, say, osdmaps until we fix http://tracker.ceph.com/issues/23029 Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/rados/objectstore: increase open file limit

448b696

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-kill-onreadable branch from 825e0e0 to 448b696 Compare February 17, 2018 16:18

liewegas requested a review from jdurgin February 17, 2018 17:22

liewegas removed the wip-sage-testing label Feb 19, 2018

jdurgin approved these changes Feb 20, 2018

View reviewed changes

liewegas merged commit 1e34922 into ceph:master Feb 20, 2018

liewegas deleted the wip-kill-onreadable branch February 20, 2018 20:59

xiexingguo mentioned this pull request Dec 15, 2018

osd: kill obsolete comments #25573

Merged

3 tasks

xiexingguo mentioned this pull request Aug 17, 2019

osd/PrimaryLogPG: kill obsolete ondisk_{read,write}_lock comments #29719

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os,osd: initial work to drop onreadable/onapplied callbacks #20177

os,osd: initial work to drop onreadable/onapplied callbacks #20177

liewegas commented Jan 30, 2018 •

edited

Loading

ifed01 Jan 30, 2018

ifed01 Jan 30, 2018

liewegas commented Jan 30, 2018

liewegas commented Jan 30, 2018

liewegas commented Jan 30, 2018

rzarzynski commented Jan 31, 2018

rzarzynski commented Feb 2, 2018

liewegas commented Feb 2, 2018

gregsfortytwo commented Feb 5, 2018

liewegas commented Feb 17, 2018

liewegas commented Feb 17, 2018

markhpc commented Feb 19, 2018

markhpc commented Feb 20, 2018 •

edited

Loading

os,osd: initial work to drop onreadable/onapplied callbacks #20177

os,osd: initial work to drop onreadable/onapplied callbacks #20177

Conversation

liewegas commented Jan 30, 2018 • edited Loading

ifed01 Jan 30, 2018

Choose a reason for hiding this comment

ifed01 Jan 30, 2018

Choose a reason for hiding this comment

liewegas commented Jan 30, 2018

liewegas commented Jan 30, 2018

liewegas commented Jan 30, 2018

rzarzynski commented Jan 31, 2018

rzarzynski commented Feb 2, 2018

liewegas commented Feb 2, 2018

gregsfortytwo commented Feb 5, 2018

liewegas commented Feb 17, 2018

liewegas commented Feb 17, 2018

markhpc commented Feb 19, 2018

markhpc commented Feb 20, 2018 • edited Loading

liewegas commented Jan 30, 2018 •

edited

Loading

markhpc commented Feb 20, 2018 •

edited

Loading