-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/filestore: add option to deploy omap to a separate device(path) #6421
Conversation
@liewegas , In FileStore::sync_entry(), syncfs() is used. If omap_dir in different file-system, leveldb log may not be sync to the commited_seq. This may cause the pg_log and pg_info corruption when osd rebooting? |
@yuyuyu101, Great |
Refer discussion in liewegas@249782d Is that better to land this change in ceph-disk? |
Hi Chendi, Do you have any benchmarks at all? It's been a while since we tried this but in the past hadn't seen a big improvement. It may be that with some of the other improvements this is helping now though! |
@markhpc , we simply use fio , 140vm doing randwrite to a 40 HDD osd ceph cluster, which is to say the we can basically push each HDD to nearly 95%+ utils, There are two benefit to move omap out:
Before moving omap to separate ssd, we saw a frontend and backend iops ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089 |
Very interesting Chendi! At some point it may be useful to do a blktrace on the HDD for both cases to see if there are ways to improve the behavior when a user doesn't have SSDs. |
Don't we need to create a partition and mount in current/omap or the relavent parition? might need the changes for ceph-disk to have this change. And how do we make sure the path we mentioned is always intact after a reboot? It is better to add some kind of magic or fsid and osd which it belongs in this partition. |
@varadakari, can we do the same trick as what we do for mounting osd partition by using partuuid so that ceph-disk activate-all will auto mount the right partition. |
@markhpc I believe only HDD + SSD case could benifit from this, actually we are not saving any IO but just using SSD IO instead of HDD IO. I happen to play on the full SSD case, it change nothing:) |
@xiaoxichen I understand, but these results make me wonder about leveldb's behavior. Chendi's results are interesting because they may have uncovered something we haven't seen before. I'm curious if the behavior has changed since a couple of years ago when we looked at this last. |
@markhpc , I re-run a comparison test w/wo omap under osd device(HDD) to compare the benefit of moving omap out can bring. Below is the attachment. |
cool job |
@xuechendi Results look good. Did you create a partition on journal device for the omap and filesystem on that partition? As mentioned in the previous comments we have remember this partition as it belongs to this osd, like we do it for journal or some other way feasible. Else any wrong entry on config might lead us to point a wrong omap partition. |
@varadakari , are you asking about if there is any check on if the omap is the correct for corresponding osd device since they are on different device now? That part is not done in current code, but I think it is easy to add a check funtion there. I think we can just assert if the omap and osd is not match, what do you think? |
@varadakari , then leave the mount by uuid thing at ceph-disk codes, if that is doable, I will update this PR then |
@xuechendi yes. it is better add a check. we can do the ceph-disk change in a different PR. And if we can add the part-uuid part to ceph-disk while creating the osd, we can automatically mount across reboots by udev rules. we might to add some more functionality to activate part in ceph-disk for that. You can refer to http://www.spinics.net/lists/ceph-devel/msg25887.html for a generic ceph-disk what we have in mind. |
FWIW, I would focus on the 'activate' step initially. A recent change
has a few notes on the block device probing. I'm less certain about the best way to handle the "create" part.. I'm |
@liewegas Sure Sage, agree with you. But how do we get the required parameters for the create? one approach could be, to get from ceph.conf(we are referring conf to get mount options etc.., may be we can fallback to the approach). we would be adding more to the ceph.conf in that case, else we can find a conf/format where we can direct ceph-disk to create in the required format. JSON format complicates current implementation, but if we finalize on a format as first step, we can improve the ceph-disk in incremental steps. |
I think the right user interface is to take an objectstore type
(filestore, bluestore, etc.) as an argument to ceph-disk
(--osd-objectstore bluestore) or pull that same option from
ceph.conf.
An easy initial step is to make ceph-disk explicitly understand
'filestore' and 'bluestore'.
The JSON approach is more complex. It aims to be future-proof, but in
practice I fear we'll fail to anticipate what the next backend needs and
we'll need to modify ceph-disk anyway.
|
It seems not that necessary to have one partition for one level DB(Omap), maybe in common deployment people will just create a single FS on the journal SSD and put all DBs onto that FS. Can we have UUID in this kind of deployment?Adding a key OSD_UUID to omap might be better? |
@xiaoxichen seems reasonable, we have to see the write amp and wear out problems. Might be the same issue as before. Do you have any results for multiple OSDs? with multiple writes happening on the same ssd, wants to see how it scales? |
In that case, I don't think any special ceph-disk tooling is needed.. you just wouldn't get support for swapping devices into another host and having things automagically work (e.g., you'd need to mount the leveldb file system manually). |
@@ -577,8 +577,12 @@ FileStore::FileStore(const std::string &base, const std::string &jdev, osflagbit | |||
current_op_seq_fn = sss.str(); | |||
|
|||
ostringstream omss; | |||
omss << basedir << "/current/omap"; | |||
omap_dir = omss.str(); | |||
if (g_conf->filestore_omap_backend_path != ""){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/){/) {/
@varadakari @liewegas , the new commit is to create and check omap fsid when doing mkfs and mount; so we can mark and check whose omap it belongs to. Should I also open a new PR on the ceph-disk side? |
@xiaoxichen i agree, multiple partition or multiple directories doesn't matter for SSD., but being LSM tree in nature, write amp is more for leveldb/rocksdb until and unless we order the writes for page sizes. That's the reason asked for write amp and wear leveling part. |
@xuechendi yes a new PR with ceph-disk would be easy for review, i think that should precede this PR. |
I'd suggest naming it 'osd_uuid'.
Yep, and we should make sure that if the file isn't present, we assume it |
@xuechendi Why we need to put a seperate file? I think maybe we just add an osd_uuid key-value pairs into the omap db, that might looks better. @liewegas what's your idea? |
0ab799a
to
173f581
Compare
@ldachary , sorry for the trouble, make check failed on one test(FAIL: ceph-detect-init/run-tox.sh ps: I re-push the codes, and it passed the check...Kind confused |
f1f82d3
to
d0694ed
Compare
also, can you please rebase? otherwise, i think this is okay |
95f7e35
to
d45c913
Compare
In the HDD as OSD, SSD as journal test, we saw a great throughput improvement if moving omap to a SSD device at randwrite case. This patch aim to add a config option 'filestore_omap_backend_path' for users to configurate omap path before deploy. Signed-off-by: Chendi Xue <chendi.xue@intel.com>
1. write osd_uuid to omap dir when doing filestore mkfs 2. check if omap fsid matches osd fsid when doing filestore mount (if there is no osd_uuid under omap, assume this as match) Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
d45c913
to
2b15e2b
Compare
@liewegas , rebase is done. |
@xuechendi |
I would rather not complicate the filestore case here unless we really have to. I'd rather invest our efforts in bluestore instead... |
But Bluestore is still under developing. Furthermore, we find that, although it will not bring a quite large significant improvement under block storage service, it would be helpful under rgw gateway service, especially for maintaining a large number of small file which is smaller that 100KB. The object in rgw gateway use serveral xattr to dealing with its metadata like ACL, Content Type, Tags and so on. So large number of small files will lead to a bad performance under FileStore. I think if those data can be redirect to omap and store them on SSD, it will bring a significant improvement under current FileStore Backend. I am sure that BlueStore is eventually the best solution for it. However, currently if we just use SSD as journal (maybe 10G/osd) --- for most of SSDs, still leading to a waste of space. Therefore, I think this may make best use of resources under current ceph versions? |
Hmm, actually this pr won't block any outstanding performance improvement On Tue, Jun 21, 2016 at 10:21 PM, Ning Yao notifications@github.com wrote:
Best Regards, Wheat |
@mslovy Well, we pretty definitely aren't going to extend FileStore to put small files into the omap piece, so it would only help you with the index. I don't really think it's worth complicating ceph-disk and the documentation to make this easy. |
In the HDD as OSD, SSD as journal test, we saw a great
throughput improvement if moving omap to a SSD device at
randwrite case.
This patch aim to add a config option 'filestore_omap_backend_path'
for users to configurate omap path before deploy.
Signed-off-by: Chendi Xue chendi.xue@intel.com