os/filestore: add option to deploy omap to a separate device(path) #6421

xuechendi · 2015-10-29T08:59:09Z

In the HDD as OSD, SSD as journal test, we saw a great
throughput improvement if moving omap to a SSD device at
randwrite case.

This patch aim to add a config option 'filestore_omap_backend_path'
for users to configurate omap path before deploy.

Signed-off-by: Chendi Xue chendi.xue@intel.com

loic-bot · 2015-10-29T09:27:07Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/8733/

SUCCESS http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/8733/

loic-bot · 2015-10-29T09:47:02Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/8734/

SUCCESS http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/8734/

mslovy · 2015-11-01T09:41:40Z

@liewegas , In FileStore::sync_entry(), syncfs() is used. If omap_dir in different file-system, leveldb log may not be sync to the commited_seq. This may cause the pg_log and pg_info corruption when osd rebooting?

yuyuyu101 · 2015-11-01T09:44:04Z

@mslovy partial correct. plz refer to #4718

mslovy · 2015-11-01T09:47:28Z

@yuyuyu101, Great

xiaoxichen · 2015-11-02T01:04:04Z

Refer discussion in liewegas@249782d

Is that better to land this change in ceph-disk?

markhpc · 2015-11-04T04:32:24Z

Hi Chendi,

Do you have any benchmarks at all? It's been a while since we tried this but in the past hadn't seen a big improvement. It may be that with some of the other improvements this is helping now though!

xuechendi · 2015-11-04T05:11:17Z

@markhpc , we simply use fio , 140vm doing randwrite to a 40 HDD osd ceph cluster, which is to say the we can basically push each HDD to nearly 95%+ utils,

There are two benefit to move omap out:

we can get more randwrite iops if there is no seqwrite to one HDD device
when HDD handles randwrite iops and also some omap(leveldb) write, we can only get 175 iops disk write per HDD when util is nearly full.
when HDD only handles randwrite without any omap write, we can get 325 iops disk write per HDD when HDD util is nearly full.
moving omap out, one rbd write will only cause 2.5 HDD write at backend rather than with omap in the same device, one rbd write causes 5 HDD write(including 2 replica write and other omap updates.)
======Here is some data=====
2 replica
filestore_max_inline_xattr_xfs=0
filestore_max_inline_xattr_size_xfs=0

Before moving omap to separate ssd, we saw a frontend and backend iops ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034
Like we talked, 5.8 consists of 2 replica write, inode and omap writes
op_size op_type QD rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
4k randwrite qd8 140 400 sec 1206.000 4.987 MB/s 884.617 msec 7034.975 47.407 MB/s 242.620 msec

And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
op_size op_type QD rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
4k randwrite qd8 140 400 sec 5006.000 19.822 MB/s 222.296 msec 13089.020 82.897 MB/s 482.203 msec

xuechendi · 2015-11-04T05:17:26Z

@markhpc

markhpc · 2015-11-11T13:56:32Z

Very interesting Chendi! At some point it may be useful to do a blktrace on the HDD for both cases to see if there are ways to improve the behavior when a user doesn't have SSDs.

varadakari · 2015-11-11T14:06:56Z

Don't we need to create a partition and mount in current/omap or the relavent parition? might need the changes for ceph-disk to have this change. And how do we make sure the path we mentioned is always intact after a reboot? It is better to add some kind of magic or fsid and osd which it belongs in this partition.

mslovy · 2015-11-11T14:23:16Z

@varadakari, can we do the same trick as what we do for mounting osd partition by using partuuid so that ceph-disk activate-all will auto mount the right partition.

xiaoxichen · 2015-11-12T00:44:30Z

@markhpc I believe only HDD + SSD case could benifit from this, actually we are not saving any IO but just using SSD IO instead of HDD IO.

I happen to play on the full SSD case, it change nothing:)

markhpc · 2015-11-18T13:36:57Z

@xiaoxichen I understand, but these results make me wonder about leveldb's behavior. Chendi's results are interesting because they may have uncovered something we haven't seen before. I'm curious if the behavior has changed since a couple of years ago when we looked at this last.

xuechendi · 2015-12-16T04:09:34Z

@markhpc ， I re-run a comparison test w/wo omap under osd device(HDD) to compare the benefit of moving omap out can bring. Below is the attachment.
2015_12_Chendi_Hammer_omap.pptx

yuyuyu101 · 2015-12-16T04:22:08Z

cool job

varadakari · 2015-12-16T05:55:12Z

@xuechendi Results look good. Did you create a partition on journal device for the omap and filesystem on that partition? As mentioned in the previous comments we have remember this partition as it belongs to this osd, like we do it for journal or some other way feasible. Else any wrong entry on config might lead us to point a wrong omap partition.

xuechendi · 2015-12-16T07:19:45Z

@varadakari , are you asking about if there is any check on if the omap is the correct for corresponding osd device since they are on different device now? That part is not done in current code, but I think it is easy to add a check funtion there. I think we can just assert if the omap and osd is not match, what do you think?

xuechendi · 2015-12-16T07:21:53Z

@varadakari , then leave the mount by uuid thing at ceph-disk codes, if that is doable, I will update this PR then

varadakari · 2015-12-16T09:45:32Z

@xuechendi yes. it is better add a check. we can do the ceph-disk change in a different PR. And if we can add the part-uuid part to ceph-disk while creating the osd, we can automatically mount across reboots by udev rules. we might to add some more functionality to activate part in ceph-disk for that. You can refer to http://www.spinics.net/lists/ceph-devel/msg25887.html for a generic ceph-disk what we have in mind.

liewegas · 2015-12-16T13:51:16Z

FWIW, I would focus on the 'activate' step initially. A recent change
makes block device labeling (with uuid) more generic (to accomodate
newstore/bluestore). We should pick a generic way to say "this file system
is part of osd with uuid $foo", with a new GPT UUID type (and encrypted
variant) to go along with it. Does that make sense?

http://tracker.ceph.com/issues/13942

has a few notes on the block device probing.

I'm less certain about the best way to handle the "create" part.. I'm
worried the JSON spec in the previous thread might be needlessly complex.

varadakari · 2015-12-16T16:40:24Z

@liewegas Sure Sage, agree with you. But how do we get the required parameters for the create? one approach could be, to get from ceph.conf(we are referring conf to get mount options etc.., may be we can fallback to the approach). we would be adding more to the ceph.conf in that case, else we can find a conf/format where we can direct ceph-disk to create in the required format. JSON format complicates current implementation, but if we finalize on a format as first step, we can improve the ceph-disk in incremental steps.

liewegas · 2015-12-16T17:38:08Z

I think the right user interface is to take an objectstore type (filestore, bluestore, etc.) as an argument to ceph-disk (--osd-objectstore bluestore) or pull that same option from ceph.conf. An easy initial step is to make ceph-disk explicitly understand 'filestore' and 'bluestore'. The JSON approach is more complex. It aims to be future-proof, but in practice I fear we'll fail to anticipate what the next backend needs and we'll need to modify ceph-disk anyway.

xiaoxichen · 2015-12-17T05:13:17Z

It seems not that necessary to have one partition for one level DB(Omap), maybe in common deployment people will just create a single FS on the journal SSD and put all DBs onto that FS. Can we have UUID in this kind of deployment?Adding a key OSD_UUID to omap might be better?

varadakari · 2015-12-17T05:47:22Z

@xiaoxichen seems reasonable, we have to see the write amp and wear out problems. Might be the same issue as before. Do you have any results for multiple OSDs? with multiple writes happening on the same ssd, wants to see how it scales?

liewegas · 2015-12-17T13:37:26Z

In that case, I don't think any special ceph-disk tooling is needed.. you just wouldn't get support for swapping devices into another host and having things automagically work (e.g., you'd need to mount the leveldb file system manually).

liewegas · 2015-12-17T13:38:02Z

src/os/FileStore.cc

@@ -577,8 +577,12 @@ FileStore::FileStore(const std::string &base, const std::string &jdev, osflagbit
  current_op_seq_fn = sss.str();

  ostringstream omss;
-  omss << basedir << "/current/omap";
-  omap_dir = omss.str();
+  if (g_conf->filestore_omap_backend_path != ""){


loic-bot · 2015-12-18T02:00:29Z

SUCCESS: http://jenkins.ceph.dachary.org/job/ceph/10283/

SUCCESS http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10283/

xuechendi · 2015-12-18T08:45:14Z

@varadakari @liewegas , the new commit is to create and check omap fsid when doing mkfs and mount; so we can mark and check whose omap it belongs to. Should I also open a new PR on the ceph-disk side?

varadakari · 2015-12-18T08:59:05Z

@xiaoxichen i agree, multiple partition or multiple directories doesn't matter for SSD., but being LSM tree in nature, write amp is more for leveldb/rocksdb until and unless we order the writes for page sizes. That's the reason asked for write amp and wear leveling part.

varadakari · 2015-12-18T09:05:24Z

@xuechendi yes a new PR with ceph-disk would be easy for review, i think that should precede this PR.
If we are following approach suggest by Sage, we can avoid ceph-disk changes, but have to manually mount leveldb partition before osd startup.

loic-bot · 2015-12-18T09:47:53Z

FAILURE: http://jenkins.ceph.dachary.org/job/ceph/10290/

FAILURE http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10290/

liewegas · 2015-12-18T13:42:57Z

So based on above discussion, I think what I should do is like:

create an uuid file under omap dir when creating this osd

I'd suggest naming it 'osd_uuid'.

an omap check when start this osd_daemon, if not match, then assert.

Yep, and we should make sure that if the file isn't present, we assume it
matches.

xiaoxichen · 2015-12-19T11:39:14Z

@xuechendi Why we need to put a seperate file? I think maybe we just add an osd_uuid key-value pairs into the omap db, that might looks better.

@liewegas what's your idea?

xuechendi · 2015-12-23T09:15:17Z

@ldachary , sorry for the trouble, make check failed on one test(FAIL: ceph-detect-init/run-tox.sh
), which I changed no codes on these part, can you give me a hint? passed my local make check

ps: I re-push the codes, and it passed the check...Kind confused

liewegas · 2016-02-04T22:45:46Z

also, can you please rebase? otherwise, i think this is okay

In the HDD as OSD, SSD as journal test, we saw a great throughput improvement if moving omap to a SSD device at randwrite case. This patch aim to add a config option 'filestore_omap_backend_path' for users to configurate omap path before deploy. Signed-off-by: Chendi Xue <chendi.xue@intel.com>

1. write osd_uuid to omap dir when doing filestore mkfs 2. check if omap fsid matches osd fsid when doing filestore mount (if there is no osd_uuid under omap, assume this as match) Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

xuechendi · 2016-02-18T04:54:42Z

@liewegas , rebase is done.

mslovy · 2016-06-21T10:48:55Z

@xuechendi
hi, chendi. can we use ceps-disk to maintain the omap_path when creating osd, and automatically mount the right raw device to the omap_dir now when calling ceph-disk activate ?

liewegas · 2016-06-21T13:59:59Z

I would rather not complicate the filestore case here unless we really have to. I'd rather invest our efforts in bluestore instead...

mslovy · 2016-06-21T14:21:26Z

But Bluestore is still under developing. Furthermore, we find that, although it will not bring a quite large significant improvement under block storage service, it would be helpful under rgw gateway service, especially for maintaining a large number of small file which is smaller that 100KB. The object in rgw gateway use serveral xattr to dealing with its metadata like ACL, Content Type, Tags and so on. So large number of small files will lead to a bad performance under FileStore. I think if those data can be redirect to omap and store them on SSD, it will bring a significant improvement under current FileStore Backend. I am sure that BlueStore is eventually the best solution for it. However, currently if we just use SSD as journal (maybe 10G/osd) --- for most of SSDs, still leading to a waste of space. Therefore, I think this may make best use of resources under current ceph versions?

yuyuyu101 · 2016-06-21T14:45:31Z

Hmm, actually this pr won't block any outstanding performance improvement
jobs. I think if we really get great improvements, anything won't be
problem. I think sage want to see real difference before involve this

On Tue, Jun 21, 2016 at 10:21 PM, Ning Yao notifications@github.com wrote:

But Bluestore is still under developing. Furthermore, we find that,
although it will not bring a quite large significant improvement under
block storage service, it would be helpful under rgw gateway service,
especially for maintaining a large number of small file which is smaller
that 100KB. The object in rgw gateway use serveral xattr to dealing with
its metadata like ACL, Content Type, Tags and so on. So large number of
small files will lead to a bad performance under FileStore. I think if
those data can be redirect to omap and store them on SSD, it will bring a
significant improvement under current FileStore Backend. I am sure that
BlueStore is eventually the best solution for it. However, currently if we
just use SSD as journal (maybe 10G/osd) --- for most of SSDs, still leading
to a waste of space. Therefore, I think this may make best use of resources
under current ceph versions?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6421 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA72YeKVErWUqjSql4EvipEldh9wF0DRks5qN_NrgaJpZM4GYBdW
.

Best Regards,

Wheat

athanatos · 2016-06-21T15:52:41Z

@mslovy Well, we pretty definitely aren't going to extend FileStore to put small files into the omap piece, so it would only help you with the index. I don't really think it's worth complicating ceph-disk and the documentation to make this easy.

xiaoxichen added feature performance core labels Nov 2, 2015

liewegas reviewed Dec 17, 2015
View reviewed changes

xuechendi force-pushed the wip-omap-out branch 2 times, most recently from 0ab799a to 173f581 Compare December 23, 2015 07:08

xuechendi force-pushed the wip-omap-out branch 2 times, most recently from f1f82d3 to d0694ed Compare December 24, 2015 01:12

liewegas self-assigned this Feb 4, 2016

xuechendi force-pushed the wip-omap-out branch 3 times, most recently from 95f7e35 to d45c913 Compare February 5, 2016 02:46

xuechendi added 2 commits February 18, 2016 10:24

create and check omap fsid

2b15e2b

1. write osd_uuid to omap dir when doing filestore mkfs 2. check if omap fsid matches osd fsid when doing filestore mount (if there is no osd_uuid under omap, assume this as match) Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

xuechendi force-pushed the wip-omap-out branch from d45c913 to 2b15e2b Compare February 18, 2016 03:55

liewegas added the needs-qa label Apr 25, 2016

liewegas changed the title ~~Added option to deploy omap to a separate device(path)~~ os/filestore: add option to deploy omap to a separate device(path) May 3, 2016

liewegas added the wip-sage-testing label May 4, 2016

liewegas merged commit 46be9ba into ceph:master May 5, 2016

os/filestore: add option to deploy omap to a separate device(path) #6421

os/filestore: add option to deploy omap to a separate device(path) #6421

Conversation

xuechendi commented Oct 29, 2015

loic-bot commented Oct 29, 2015

loic-bot commented Oct 29, 2015

mslovy commented Nov 1, 2015

yuyuyu101 commented Nov 1, 2015

mslovy commented Nov 1, 2015

xiaoxichen commented Nov 2, 2015

markhpc commented Nov 4, 2015

xuechendi commented Nov 4, 2015

xuechendi commented Nov 4, 2015

markhpc commented Nov 11, 2015

varadakari commented Nov 11, 2015

mslovy commented Nov 11, 2015

xiaoxichen commented Nov 12, 2015

markhpc commented Nov 18, 2015

xuechendi commented Dec 16, 2015

yuyuyu101 commented Dec 16, 2015

varadakari commented Dec 16, 2015

xuechendi commented Dec 16, 2015

xuechendi commented Dec 16, 2015

varadakari commented Dec 16, 2015

liewegas commented Dec 16, 2015

varadakari commented Dec 16, 2015

liewegas commented Dec 16, 2015 via email

xiaoxichen commented Dec 17, 2015

varadakari commented Dec 17, 2015

liewegas commented Dec 17, 2015

liewegas Dec 17, 2015

Choose a reason for hiding this comment

loic-bot commented Dec 18, 2015

xuechendi commented Dec 18, 2015

varadakari commented Dec 18, 2015

varadakari commented Dec 18, 2015

loic-bot commented Dec 18, 2015

liewegas commented Dec 18, 2015

xiaoxichen commented Dec 19, 2015

xuechendi commented Dec 23, 2015

liewegas commented Feb 4, 2016

xuechendi commented Feb 18, 2016

mslovy commented Jun 21, 2016

liewegas commented Jun 21, 2016

mslovy commented Jun 21, 2016

yuyuyu101 commented Jun 21, 2016

athanatos commented Jun 21, 2016