Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/filestore: add option to deploy omap to a separate device(path) #6421

Merged
merged 2 commits into from
May 5, 2016

Conversation

xuechendi
Copy link
Contributor

In the HDD as OSD, SSD as journal test, we saw a great
throughput improvement if moving omap to a SSD device at
randwrite case.

This patch aim to add a config option 'filestore_omap_backend_path'
for users to configurate omap path before deploy.

Signed-off-by: Chendi Xue chendi.xue@intel.com

@mslovy
Copy link
Contributor

mslovy commented Nov 1, 2015

@liewegas , In FileStore::sync_entry(), syncfs() is used. If omap_dir in different file-system, leveldb log may not be sync to the commited_seq. This may cause the pg_log and pg_info corruption when osd rebooting?

@yuyuyu101
Copy link
Member

@mslovy partial correct. plz refer to #4718

@mslovy
Copy link
Contributor

mslovy commented Nov 1, 2015

@yuyuyu101, Great

@xiaoxichen
Copy link
Contributor

Refer discussion in liewegas@249782d

Is that better to land this change in ceph-disk?

@markhpc
Copy link
Member

markhpc commented Nov 4, 2015

Hi Chendi,

Do you have any benchmarks at all? It's been a while since we tried this but in the past hadn't seen a big improvement. It may be that with some of the other improvements this is helping now though!

@xuechendi
Copy link
Contributor Author

@markhpc , we simply use fio , 140vm doing randwrite to a 40 HDD osd ceph cluster, which is to say the we can basically push each HDD to nearly 95%+ utils,

There are two benefit to move omap out:

  1. we can get more randwrite iops if there is no seqwrite to one HDD device
    when HDD handles randwrite iops and also some omap(leveldb) write, we can only get 175 iops disk write per HDD when util is nearly full.
    when HDD only handles randwrite without any omap write, we can get 325 iops disk write per HDD when HDD util is nearly full.
  2. moving omap out, one rbd write will only cause 2.5 HDD write at backend rather than with omap in the same device, one rbd write causes 5 HDD write(including 2 replica write and other omap updates.)
    ======Here is some data=====
    2 replica
    filestore_max_inline_xattr_xfs=0
    filestore_max_inline_xattr_size_xfs=0

Before moving omap to separate ssd, we saw a frontend and backend iops ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034
Like we talked, 5.8 consists of 2 replica write, inode and omap writes
op_size op_type QD rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
4k randwrite qd8 140 400 sec 1206.000 4.987 MB/s 884.617 msec 7034.975 47.407 MB/s 242.620 msec

And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
op_size op_type QD rbdNum runtime fio_iops fio_bw fio_latency osd_iops osd_bw osd_latency
4k randwrite qd8 140 400 sec 5006.000 19.822 MB/s 222.296 msec 13089.020 82.897 MB/s 482.203 msec

@xuechendi
Copy link
Contributor Author

image

@markhpc

@markhpc
Copy link
Member

markhpc commented Nov 11, 2015

Very interesting Chendi! At some point it may be useful to do a blktrace on the HDD for both cases to see if there are ways to improve the behavior when a user doesn't have SSDs.

@varadakari
Copy link
Contributor

Don't we need to create a partition and mount in current/omap or the relavent parition? might need the changes for ceph-disk to have this change. And how do we make sure the path we mentioned is always intact after a reboot? It is better to add some kind of magic or fsid and osd which it belongs in this partition.

@mslovy
Copy link
Contributor

mslovy commented Nov 11, 2015

@varadakari, can we do the same trick as what we do for mounting osd partition by using partuuid so that ceph-disk activate-all will auto mount the right partition.

@xiaoxichen
Copy link
Contributor

@markhpc I believe only HDD + SSD case could benifit from this, actually we are not saving any IO but just using SSD IO instead of HDD IO.

I happen to play on the full SSD case, it change nothing:)

@markhpc
Copy link
Member

markhpc commented Nov 18, 2015

@xiaoxichen I understand, but these results make me wonder about leveldb's behavior. Chendi's results are interesting because they may have uncovered something we haven't seen before. I'm curious if the behavior has changed since a couple of years ago when we looked at this last.

@xuechendi
Copy link
Contributor Author

@markhpc , I re-run a comparison test w/wo omap under osd device(HDD) to compare the benefit of moving omap out can bring. Below is the attachment.
2015_12_Chendi_Hammer_omap.pptx

@yuyuyu101
Copy link
Member

cool job

@varadakari
Copy link
Contributor

@xuechendi Results look good. Did you create a partition on journal device for the omap and filesystem on that partition? As mentioned in the previous comments we have remember this partition as it belongs to this osd, like we do it for journal or some other way feasible. Else any wrong entry on config might lead us to point a wrong omap partition.

@xuechendi
Copy link
Contributor Author

@varadakari , are you asking about if there is any check on if the omap is the correct for corresponding osd device since they are on different device now? That part is not done in current code, but I think it is easy to add a check funtion there. I think we can just assert if the omap and osd is not match, what do you think?

@xuechendi
Copy link
Contributor Author

@varadakari , then leave the mount by uuid thing at ceph-disk codes, if that is doable, I will update this PR then

@varadakari
Copy link
Contributor

@xuechendi yes. it is better add a check. we can do the ceph-disk change in a different PR. And if we can add the part-uuid part to ceph-disk while creating the osd, we can automatically mount across reboots by udev rules. we might to add some more functionality to activate part in ceph-disk for that. You can refer to http://www.spinics.net/lists/ceph-devel/msg25887.html for a generic ceph-disk what we have in mind.

@liewegas
Copy link
Member

FWIW, I would focus on the 'activate' step initially. A recent change
makes block device labeling (with uuid) more generic (to accomodate
newstore/bluestore). We should pick a generic way to say "this file system
is part of osd with uuid $foo", with a new GPT UUID type (and encrypted
variant) to go along with it. Does that make sense?

http://tracker.ceph.com/issues/13942

has a few notes on the block device probing.

I'm less certain about the best way to handle the "create" part.. I'm
worried the JSON spec in the previous thread might be needlessly complex.

@varadakari
Copy link
Contributor

@liewegas Sure Sage, agree with you. But how do we get the required parameters for the create? one approach could be, to get from ceph.conf(we are referring conf to get mount options etc.., may be we can fallback to the approach). we would be adding more to the ceph.conf in that case, else we can find a conf/format where we can direct ceph-disk to create in the required format. JSON format complicates current implementation, but if we finalize on a format as first step, we can improve the ceph-disk in incremental steps.

@liewegas
Copy link
Member

liewegas commented Dec 16, 2015 via email

@xiaoxichen
Copy link
Contributor

It seems not that necessary to have one partition for one level DB(Omap), maybe in common deployment people will just create a single FS on the journal SSD and put all DBs onto that FS. Can we have UUID in this kind of deployment?Adding a key OSD_UUID to omap might be better?

@varadakari
Copy link
Contributor

@xiaoxichen seems reasonable, we have to see the write amp and wear out problems. Might be the same issue as before. Do you have any results for multiple OSDs? with multiple writes happening on the same ssd, wants to see how it scales?

@liewegas
Copy link
Member

In that case, I don't think any special ceph-disk tooling is needed.. you just wouldn't get support for swapping devices into another host and having things automagically work (e.g., you'd need to mount the leveldb file system manually).

@@ -577,8 +577,12 @@ FileStore::FileStore(const std::string &base, const std::string &jdev, osflagbit
current_op_seq_fn = sss.str();

ostringstream omss;
omss << basedir << "/current/omap";
omap_dir = omss.str();
if (g_conf->filestore_omap_backend_path != ""){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/){/) {/

@xuechendi
Copy link
Contributor Author

@varadakari @liewegas , the new commit is to create and check omap fsid when doing mkfs and mount; so we can mark and check whose omap it belongs to. Should I also open a new PR on the ceph-disk side?

@varadakari
Copy link
Contributor

@xiaoxichen i agree, multiple partition or multiple directories doesn't matter for SSD., but being LSM tree in nature, write amp is more for leveldb/rocksdb until and unless we order the writes for page sizes. That's the reason asked for write amp and wear leveling part.

@varadakari
Copy link
Contributor

@xuechendi yes a new PR with ceph-disk would be easy for review, i think that should precede this PR.
If we are following approach suggest by Sage, we can avoid ceph-disk changes, but have to manually mount leveldb partition before osd startup.

@liewegas
Copy link
Member

So based on above discussion, I think what I should do is like:

  1. create an uuid file under omap dir when creating this osd

I'd suggest naming it 'osd_uuid'.

  1. an omap check when start this osd_daemon, if not match, then assert.

Yep, and we should make sure that if the file isn't present, we assume it
matches.

@xiaoxichen
Copy link
Contributor

@xuechendi Why we need to put a seperate file? I think maybe we just add an osd_uuid key-value pairs into the omap db, that might looks better.

@liewegas what's your idea?

@xuechendi xuechendi force-pushed the wip-omap-out branch 2 times, most recently from 0ab799a to 173f581 Compare December 23, 2015 07:08
@xuechendi
Copy link
Contributor Author

@ldachary , sorry for the trouble, make check failed on one test(FAIL: ceph-detect-init/run-tox.sh
), which I changed no codes on these part, can you give me a hint? passed my local make check

ps: I re-push the codes, and it passed the check...Kind confused

@xuechendi xuechendi force-pushed the wip-omap-out branch 2 times, most recently from f1f82d3 to d0694ed Compare December 24, 2015 01:12
@liewegas
Copy link
Member

liewegas commented Feb 4, 2016

also, can you please rebase? otherwise, i think this is okay

@liewegas liewegas self-assigned this Feb 4, 2016
@xuechendi xuechendi force-pushed the wip-omap-out branch 3 times, most recently from 95f7e35 to d45c913 Compare February 5, 2016 02:46
In the HDD as OSD, SSD as journal test, we saw a great
throughput improvement if moving omap to a SSD device at
randwrite case.

This patch aim to add a config option 'filestore_omap_backend_path'
for users to configurate omap path before deploy.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
1. write osd_uuid to omap dir when doing filestore mkfs
2. check if omap fsid matches osd fsid when doing filestore mount
   (if there is no osd_uuid under omap, assume this as match)

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
@xuechendi
Copy link
Contributor Author

@liewegas , rebase is done.

@liewegas liewegas changed the title Added option to deploy omap to a separate device(path) os/filestore: add option to deploy omap to a separate device(path) May 3, 2016
@liewegas liewegas merged commit 46be9ba into ceph:master May 5, 2016
@mslovy
Copy link
Contributor

mslovy commented Jun 21, 2016

@xuechendi
hi, chendi. can we use ceps-disk to maintain the omap_path when creating osd, and automatically mount the right raw device to the omap_dir now when calling ceph-disk activate ?

@liewegas
Copy link
Member

I would rather not complicate the filestore case here unless we really have to. I'd rather invest our efforts in bluestore instead...

@mslovy
Copy link
Contributor

mslovy commented Jun 21, 2016

But Bluestore is still under developing. Furthermore, we find that, although it will not bring a quite large significant improvement under block storage service, it would be helpful under rgw gateway service, especially for maintaining a large number of small file which is smaller that 100KB. The object in rgw gateway use serveral xattr to dealing with its metadata like ACL, Content Type, Tags and so on. So large number of small files will lead to a bad performance under FileStore. I think if those data can be redirect to omap and store them on SSD, it will bring a significant improvement under current FileStore Backend. I am sure that BlueStore is eventually the best solution for it. However, currently if we just use SSD as journal (maybe 10G/osd) --- for most of SSDs, still leading to a waste of space. Therefore, I think this may make best use of resources under current ceph versions?

@yuyuyu101
Copy link
Member

Hmm, actually this pr won't block any outstanding performance improvement
jobs. I think if we really get great improvements, anything won't be
problem. I think sage want to see real difference before involve this

On Tue, Jun 21, 2016 at 10:21 PM, Ning Yao notifications@github.com wrote:

But Bluestore is still under developing. Furthermore, we find that,
although it will not bring a quite large significant improvement under
block storage service, it would be helpful under rgw gateway service,
especially for maintaining a large number of small file which is smaller
that 100KB. The object in rgw gateway use serveral xattr to dealing with
its metadata like ACL, Content Type, Tags and so on. So large number of
small files will lead to a bad performance under FileStore. I think if
those data can be redirect to omap and store them on SSD, it will bring a
significant improvement under current FileStore Backend. I am sure that
BlueStore is eventually the best solution for it. However, currently if we
just use SSD as journal (maybe 10G/osd) --- for most of SSDs, still leading
to a waste of space. Therefore, I think this may make best use of resources
under current ceph versions?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6421 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA72YeKVErWUqjSql4EvipEldh9wF0DRks5qN_NrgaJpZM4GYBdW
.

Best Regards,

Wheat

@athanatos
Copy link
Contributor

@mslovy Well, we pretty definitely aren't going to extend FileStore to put small files into the omap piece, so it would only help you with the index. I don't really think it's worth complicating ceph-disk and the documentation to make this easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants