Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os,osd: initial work to drop onreadable/onapplied callbacks #20177

Merged
merged 31 commits into from
Feb 20, 2018

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Jan 30, 2018

  • adds tracking for in-flight writes, and wait_for_apply on read operations for FileStore
  • removes lots of on_readable/applied callbacks in OSD
  • more fixes to ceph_test_filestore_idempotent_sequence
  • drops the "pinning" on handle_osd_map that is no longer necessary
  • apply_transactions -> queue_transaction

The flush() operation is now only needed when there is an ordering needed between collections/sequencers. Split is the main example: the actual split happens on the parent collection, and we have to wait for that to flush before we start using the new child collection/sequencers.

There are still several on_applied users left:

  • SnapMapper's OSDriver. Here it keeps the in-memory cached mapping in place until it is applied and readable. It's not necessary because reads would be correct (FileStore--and sometimes bluestore--would block), but it is faster.
  • The OSD recovery code hasn't been changed to remove the applied callbacks yet. Nothing really preventing us from doing that except that we need to be a bit careful because some of it might be used to throttle the recovery work.

...and probably a bit more.

Performance for FileStore is about the same. Haven't check to see whether there is a measureable BlueStore boost yet.

@liewegas liewegas changed the title os,osd: initial work to drop onreadable/onapplied callbacks WIP os,osd: initial work to drop onreadable/onapplied callbacks Jan 30, 2018
* Clients of ObjectStore create and maintain their own Sequencer objects.
* When a list of transactions is queued the caller specifies a Sequencer to be used.
*
* ObjectStore users my get collection handles with open_collection() (or,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: my->may

ObjectStore::CollectionHandle BlueStore::create_new_collection(
const coll_t& cid)
{
RWLock::WLocker l(coll_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be taken after collection creation.

@liewegas
Copy link
Member Author

@rzarzynski i updated the last commit to fix a hang.. please retest! thanks

@liewegas
Copy link
Member Author

from @rzarzynski

I've got results for the latest commit. It's faster - the regression
in comparison to master is below 2% for randwrites: 1,93 and 1,39
respectively. :-)

Reference point [1] vs wip-kill-onreadable~ [2] vs the newest
wip-kill-onreadable [3]:

  seqwrite   4 KiB       "bw" :   30161 vs   30693 vs   30588,
  seqwrite  64 KiB       "bw" :  480848 vs  511888 vs  509298,
  randwrite  4 KiB       "bw" :  120254 vs  117502 vs  117914,
  randwrite 64 KiB       "bw" : 1510461 vs 1482712 vs 1489362.

[1] cb396a78d42f034dd61606a327cc703211e49cda
[2] 0f03c857c18475a97b80165f2696570730b208e5
[3] a7ab0faed16125527e7c3b5683e79da162d37006

@liewegas
Copy link
Member Author

My results are on my local box and pretty noisy. Everything seems to speed up over time... hrm..

 base: 165252 157698 167683  (master)
foo-a: 170962 163602 162909  (wip-os-ch)
foo-b: 181918 174897 167302  (wip-os-ch + slow filestore tracking)
foo-c: 178579 164342 162355  (wip-os-ch + fast filestore tracking)
foo-d: 179306 176254 164431  (wip-os-ch + removal of some onreadable callbacks)
foo-e: 175176 179313 172316  (wip-os-ch + fast filestore tracking + removal of callbacks)
foo-f: 166346 174985 170836  (wip-os-ch + slow filestore tracking + removal of callbacks)

@rzarzynski
Copy link
Contributor

More results for a7ab0fa from incerta:

randwrite  4 KiB   117328,  116363,  116678,  117929,  115937
randwrite 64 KiB  1487512, 1482242, 1489939, 1487119, 1487347
seqwrite   4 KiB    30993,   31126,   30884,   31418,   30884
seqwrite  64 KiB   516277,  509546,  509979,  516726,  513052

@rzarzynski
Copy link
Contributor

@liewegas: got the results for reads:

Reference point: cb396a78d42f034dd61606a327cc703211e49cda

randread,  4 KiB: 167427, 164852, 163403, 162503, 165675,
randread, 64 KiB: 1940050, 1944297, 1933320, 1938220, 1934929,
read,      4 KiB: 88122, 87789, 87821, 90193, 88627,
read,     64 KiB: 1099596, 1085303, 1095368, 1090855, 1089028,

wip-kill-onreadable: a7ab0faed16125527e7c3b5683e79da162d37006

randread,  4 KiB: 165292, 162285, 167443, 163996, 169634,
randread, 64 KiB: 1953484, 1928152, 1916863, 1933165, 1924817,
read,      4 KiB: 85601, 87416, 89353, 87323, 86968,
read,     64 KiB: 1084973, 1086560, 1092684, 1094833, 1087578,

@liewegas
Copy link
Member Author

liewegas commented Feb 2, 2018

looks like 2-3% degradation on read. @jdurgin @gregsfortytwo @markhpc ok to proceed?

@gregsfortytwo
Copy link
Member

That seems like a reasonable cost to pay for eliminating ~5k lines throughout the rest of the code base, though I didn't do a full review.

@liewegas liewegas changed the title WIP os,osd: initial work to drop onreadable/onapplied callbacks os,osd: initial work to drop onreadable/onapplied callbacks Feb 7, 2018
@liewegas liewegas force-pushed the wip-kill-onreadable branch 2 times, most recently from f5c127e to b5be286 Compare February 8, 2018 17:16
Note that this is *slight* overkill in that a *source* object of a clone
will also appear in the applying map, even though it is not being
modified.  Given that those clone operations are normally coupled with
another transaction that does write (which is why we are cloning in the
first place) this should not make any difference.

Signed-off-by: Sage Weil <sage@redhat.com>
On any read, wait for any updates to the object to apply first.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Prevent a collection delete + recreate sequence from allowing two
conflicting OpSequencers for the same collection to exist as this
can lead to racing async apply threads.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-kill-onreadable branch 3 times, most recently from a71d9ec to f126f8d Compare February 12, 2018 20:04
Signed-off-by: Sage Weil <sage@redhat.com>
Make sure SnapMapper's ContainerContexts don't outlive the mapper
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
Note that we don't need to worry about the internal get_omap_iterator
callrs (e.g., omap_rmkeyrange) because the apply thread does these
ops sequentially and in order.

Signed-off-by: Sage Weil <sage@redhat.com>
… is done

The onreadable completions go through a finisher; add a final event
in that stream that keeps the PG alive while prior events flush.

flush() isn't quite sufficient since it doesn't wait for the finisher
events to flush too--only for the actual apply to have happened.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-kill-onreadable branch 2 times, most recently from c5f66c9 to 42060fd Compare February 16, 2018 14:37
We need to flush between split.  This requirement unfortunately
doesn't quite go away with the FileStore tracking.

Also, flush for each batch.  This is just because the test environment
may have a low open file ulimit.  (The old code did apply_transaction,
so it's functionally equivalent to this.)

Signed-off-by: Sage Weil <sage@redhat.com>
The transactions are idependent in each collection/sequencer, so we
can't record to a single txn object with racing transactions.  Fix
it by doing one in each collection, and when reading the latest op,
use the highest txn value we see.

Signed-off-by: Sage Weil <sage@redhat.com>
Avoid EIO on, say, osdmaps until we fix
http://tracker.ceph.com/issues/23029

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

@jdurgin this is passing tests now. ready for review!

@liewegas
Copy link
Member Author

@markhpc mind doing a final check on this for small reads and writes, filestore and bluestore?

@markhpc
Copy link
Member

markhpc commented Feb 19, 2018

@liewegas on it

@markhpc
Copy link
Member

markhpc commented Feb 20, 2018

NVMe tests are done though I also am running some (much slower to populate) HDD tests. This is 72ad764 vs wip-kill-onreadable 4k IO size to 4 NVMe backed OSDs. High concurrency, 512GB total RBD volume size with 3x replication:

  | FS, 72ad764 | FS, wip-kill-onreadable | BS, 72ad764 | BS, wip-kill-onreadable
read | 728.1 | 735.3 | 332.5 | 338.3
write | 81.9 | 77.17 | 65.5518 | 77.9762
randread | 1225 | 1226 | 1093 | 1181
randwrite | 66.1807 | 64.0098 | 105.3 | 112.7
rw | 55.1074 | 56.7705 | 34.0918 | 39.3594
randrw | 63.0459 | 59.6758 | 102.8 | 108.9

In terms of percentage differences:

FS:

read: 0.99%
write: -6.13%
randread: 0.08%
randwrite: -3.39%
rw: 3.02%
randrw: -5.65%

BS:

read: 1.74%
write: 18.95%
randread: 8.05%
randwrite: 7.03%
rw: 15.45%
randrw: 5.93%

I should note this is only a single iteration of these tests so there may be some variability.

@liewegas liewegas merged commit 1e34922 into ceph:master Feb 20, 2018
@liewegas liewegas deleted the wip-kill-onreadable branch February 20, 2018 20:59
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Dec 15, 2018
Generally we should keep a ref of each newly received maps until we get them
written onto disk. This is important because as long as the refs are alive,
the OSDMaps will be pinned in the cache and we won't try to read it
off of disk. Otherwise these maps will probably not stay in the cache,
and reading those OSDMaps before they are actually written can result
in a crash.

ceph#20177 kills the onapplied callbacks
entirely, hence here we add the pinned maps back into the on_committed
structure instead.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Dec 15, 2018
Generally we should keep a ref of each newly received maps until we get them
written onto disk. This is important because as long as the refs are alive,
the OSDMaps will be pinned in the cache and we won't try to read it
off of disk. Otherwise these maps will probably not stay in the cache,
and reading those OSDMaps before they are actually written can result
in a crash.

ceph#20177 kills the onapplied callbacks
entirely, hence here we add the pinned maps back into the on_committed
structure instead.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
@xiexingguo xiexingguo mentioned this pull request Dec 15, 2018
3 tasks
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Dec 17, 2018
I added these comments a few years ago.
Since bluestore can read things that aren't committed and
ceph#20177 should have made it work for
filestore too, this shouldn't matter any more.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
xiexingguo added a commit to xiexingguo/ceph that referenced this pull request Aug 17, 2019
The ondisk_{read,write}_lock infrastructure was long gone with
ceph#20177 merged - c244300,
to be specific. Hence the related comments must die since they
could be super-misleading.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
dang pushed a commit to dang/ceph that referenced this pull request Aug 19, 2019
The ondisk_{read,write}_lock infrastructure was long gone with
ceph#20177 merged - c244300,
to be specific. Hence the related comments must die since they
could be super-misleading.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
vshankar pushed a commit to vshankar/ceph that referenced this pull request Aug 26, 2019
The ondisk_{read,write}_lock infrastructure was long gone with
ceph#20177 merged - c244300,
to be specific. Hence the related comments must die since they
could be super-misleading.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Sep 5, 2019
The ondisk_{read,write}_lock infrastructure was long gone with
ceph#20177 merged - c244300,
to be specific. Hence the related comments must die since they
could be super-misleading.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
mkogan1 pushed a commit to mkogan1/ceph that referenced this pull request Sep 5, 2019
The ondisk_{read,write}_lock infrastructure was long gone with
ceph#20177 merged - c244300,
to be specific. Hence the related comments must die since they
could be super-misleading.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants