Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: various changes for preventing internal ENOSPC condition #13425

Merged
merged 6 commits into from Mar 6, 2017

Conversation

@dzafman
Copy link
Member

dzafman commented Feb 15, 2017

Don't log "not implemented" since this should never be seen when properly configured
Increase mon_osd_min_in_ratio for most clusters to 75%
Add warning when the usage on different OSDs exceeds mon_warn_osd_usage_percent (10%)

@dzafman dzafman force-pushed the dzafman:wip-15912 branch from 5fb1daf to d9dd9d2 Feb 15, 2017

@tchaikov tchaikov added the needs-doc label Feb 15, 2017

@tchaikov

This comment has been minimized.

Copy link
Contributor

tchaikov commented Feb 15, 2017

[ RUN      ] TestMockImageReplayer.DecodeError
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: In function 'rbd::mirror::ImageReplayer<ImageCtxT>::~ImageReplayer() [with ImageCtxT = librbd::{anonymous}::MockTestImageCtx]' thread 7fdd96cf1c40 time 2017-02-15 03:06:26.332065
/home/jenkins-build/build/workspace/ceph-pull-requests/src/tools/rbd_mirror/ImageReplayer.cc: 316: FAILED assert(m_replay_status_formatter == nullptr)
 ceph version 12.0.0-347-g1953a57 (1953a576ce6c5280ba29c6d5d8774ef9bc6321e2)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7fdd8d1d865e]
 2: (()+0x3c3ac9) [0x7fdd96547ac9]
 3: (()+0x3c3ad9) [0x7fdd96547ad9]
 4: (rbd::mirror::TestMockImageReplayer::TearDown()+0x16) [0x7fdd965530f6]
 5: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x33) [0x7fdd966d99f3]
 6: (testing::Test::Run()+0x80) [0x7fdd966cd620]
 7: (testing::TestInfo::Run()+0x9e) [0x7fdd966cd6fe]
 8: (testing::TestCase::Run()+0xa5) [0x7fdd966cd805]
 9: (testing::internal::UnitTestImpl::RunAllTests()+0x248) [0x7fdd966cdab8]
 10: (testing::UnitTest::Run()+0x54) [0x7fdd966cdd74]
 11: (main()+0xee) [0x7fdd964f83ce]
 12: (__libc_start_main()+0xf5) [0x7fdd8bf5ff45]
 13: (()+0x38349a) [0x7fdd9650749a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

https://jenkins.ceph.com/job/ceph-pull-requests/18471/consoleFull#-243017103d63714d2-c8d8-41fc-a9d4-8dee30be4c32 , @dillaman is this caused by environmental issue, and is hence a false alarm?

retest this please.

@dillaman

This comment has been minimized.

Copy link
Contributor

dillaman commented Feb 15, 2017

@tchaikov Yes -- no way that error could be related

OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .3) // min osds required to be in to mark things out
OPTION(mon_osd_max_small_cluster, OPT_INT, 10) // Maximum number of osds that constitutes a small cluster for min in ratio
OPTION(mon_osd_small_min_in_ratio, OPT_DOUBLE, .30) // min osds required to be in to mark things out in small cluster
OPTION(mon_osd_min_in_ratio, OPT_DOUBLE, .75) // min osds required to be in to mark things out

This comment has been minimized.

Copy link
@ktdreyer

ktdreyer Feb 15, 2017

Member

The new mon osd min in ratio default will need to be updated in doc/rados/configuration/mon-osd-interaction.rst. And it would be great to document all these new options as well.

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 17, 2017

Author Member

@ktdreyer Fixed doc for default of ceph_osd_min_in_ratio but we decided not to add any other configs.

This comment has been minimized.

Copy link
@ktdreyer

ktdreyer Feb 18, 2017

Member

I don't understand why we would add configs but not document them. Why not document and add a huge warning?

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 18, 2017

Author Member

@ktdreyer The code doesn't look like that anymore. I commented on it because I did have adjust the default for the config that did change. Look at e8fd3e2

This comment has been minimized.

Copy link
@ktdreyer

ktdreyer Feb 18, 2017

Member

@dzafman d52105a adds mon_warn_osd_usage_percent without a doc change

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 18, 2017

Author Member

@ktdreyer Oops. Yes thanks.

@dzafman dzafman force-pushed the dzafman:wip-15912 branch 2 times, most recently from 38d853d to 586f4a3 Feb 17, 2017

@dzafman dzafman changed the title Various changes for preventing internal ENOSPC condition [DNM] Various changes for preventing internal ENOSPC condition Feb 17, 2017

@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Feb 17, 2017

@liewegas I added in a fix for Bluestore and changed its error message.

@dzafman dzafman force-pushed the dzafman:wip-15912 branch from 586f4a3 to e38f381 Feb 17, 2017

@dzafman dzafman added core needs-test and removed needs-doc labels Feb 17, 2017

@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Feb 18, 2017

@ktdreyer There isn't a document already to add the information about mon_warn_osd_usage_percent so I filed http://tracker.ceph.com/issues/18986

@dzafman dzafman force-pushed the dzafman:wip-15912 branch 2 times, most recently from 32e8a3c to 178bdd3 Feb 18, 2017

detail->push_back(make_pair(HEALTH_WARN, ss.str()));
}
}

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 24, 2017

Member

PGMonitor is going ot be turned off in luminous, so we probably want to add this to ceph-mgr (unless it's needed for jewel backport... i'm thinking not?)

@@ -626,6 +626,9 @@ OPTION(osd_backfill_full_ratio, OPT_FLOAT, 0.85)
// Seconds to wait before retrying refused backfills
OPTION(osd_backfill_retry_interval, OPT_DOUBLE, 10.0)

// Seconds to wait before retrying refused backfills

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 24, 2017

Member

s/backfills/recovery/

// -- Recovery Request Scheduling --
Mutex recovery_request_lock;
SafeTimer recovery_request_timer;

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 24, 2017

Member

We should be able to reusing the existing timer and lock (maybe rename it recovery_* instead of backfill_*)

This comment has been minimized.

Copy link
@athanatos

athanatos Feb 24, 2017

Contributor

Agreed, no real reason not to.

assert(0);
}
}

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 24, 2017

Member

This doesn't seem like enough. We've already gotten the reservation.. what we really need is a way to abort recovery or backfill. Until we have that figured out, I don't think this patch makes sense.

This comment has been minimized.

Copy link
@athanatos

athanatos Feb 24, 2017

Contributor

Yeah, you'll need to wire up those events to release the reservations etc.

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 24, 2017

I made some changes to the way these fullness thresholds are handled in #13615. I don't think anything conflicts, but it might make sense to have a discussion about how teh recovery/backfill aborts should work in the context of that change. Basically, I'm wondering if OSDs should either (1) set flags for "too full for backfill" or similar in the OSDMap (siilar to nearfull), or (2) the recovery and backfill reservation requests should include an estimate of the size of the pg. (The latter sounds nicer, except that we don't account for omap space utilization, so it's not a complete solution.)

@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Feb 24, 2017

@liewegas For item (1), on the one hand it would be nice if nearfull (defaults to 85%) would be the same as "too full for backfill" that also defaults to 85% so we wouldn't need another indicator, on the other hand it might be better to be able to configure backfill separately, so then it would be nice to know when we are too full to start a backfill.

dzafman added some commits Feb 14, 2017

osd: Fix ENOSPC crash message text
Signed-off-by: David Zafman <dzafman@redhat.com>
mon: Bump min in ratio to 75%
Signed-off-by: David Zafman <dzafman@redhat.com>
@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 24, 2017

@liewegas

This comment has been minimized.

Copy link
Member

liewegas commented Feb 24, 2017

@@ -472,6 +472,8 @@ class FileJournal :

void set_wait_on_full(bool b) { wait_on_full = b; }

off64_t get_journal_size_estimate(void);

This comment has been minimized.

Copy link
@athanatos

athanatos Feb 24, 2017

Contributor

No need for (void).

@dzafman dzafman force-pushed the dzafman:wip-15912 branch from d2f052b to 57a9c66 Feb 24, 2017

@dzafman dzafman changed the title [DNM] Various changes for preventing internal ENOSPC condition Various changes for preventing internal ENOSPC condition Feb 24, 2017

@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Feb 24, 2017

@liewegas I pulled the more complex commit for recovery/backfill. I don't have additional tests for this, but I think including it in a wip-testing branch rados run is sufficient.

@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .10) // warn if difference in usage percent between OSDs exceeds specified percent

This comment has been minimized.

Copy link
@jdurgin

jdurgin Feb 24, 2017

Member

add this to the docs and both of these settings changes to PendingReleaseNotes

.10 seems like a pretty low default for this - 45%-55% is a pretty flat distribution, especially for small clusters. How about something like .4?

This comment has been minimized.

Copy link
@dzafman

dzafman Feb 24, 2017

Author Member

@jdurgin Thanks, I was hoping we could come up with a better value for this. .4 sounds reasonable.

if (g_conf->mon_warn_osd_usage_percent) {
float max_osd_perc_avail = 0.0, min_osd_perc_avail = 1.0;
for (auto p = pg_map.osd_stat.begin(); p != pg_map.osd_stat.end(); ++p) {
float perc_avail = ((float)(p->second.kb - p->second.kb_avail)) / ((float)p->second.kb);

This comment has been minimized.

Copy link
@jdurgin

jdurgin Feb 24, 2017

Member

should guard against bad stats causing divide by 0

@dzafman dzafman changed the title Various changes for preventing internal ENOSPC condition DNM: Various changes for preventing internal ENOSPC condition Feb 27, 2017

dzafman added some commits Feb 15, 2017

mon: Add warning if diff in OSD usage > config mon_warn_osd_usage_per…
…cent (10%)

Signed-off-by: David Zafman <dzafman@redhat.com>
osd: Round fullness in message to correspond to df -h
This really only works after journal drains because
we adjust for the journal.

Signed-off-by: David Zafman <dzafman@redhat.com>
bluestore: Fix BlueStore::statfs available amount to not include blue…
…fs min reserved

This fixes OSD crashes because checking osd_failsafe_full_ratio won't work
without accurate statfs information.

Signed-off-by: David Zafman <dzafman@redhat.com>

@dzafman dzafman force-pushed the dzafman:wip-15912 branch from 57a9c66 to 72d83f8 Feb 27, 2017

@dzafman dzafman changed the title DNM: Various changes for preventing internal ENOSPC condition Various changes for preventing internal ENOSPC condition Feb 27, 2017

@liewegas liewegas changed the title Various changes for preventing internal ENOSPC condition osd: various changes for preventing internal ENOSPC condition Feb 28, 2017

@dzafman

This comment has been minimized.

Copy link
Member Author

dzafman commented Mar 4, 2017

Testing passed, 2 failures in tracker, 2 dead infrastructure

http://pulpito.ceph.com/dzafman-2017-03-03_09:17:05-rados-wip-15912-distro-basic-smithi

@tchaikov tchaikov self-requested a review Mar 6, 2017

@dzafman dzafman merged commit 4536da7 into ceph:master Mar 6, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@dzafman dzafman deleted the dzafman:wip-15912 branch Mar 6, 2017

@@ -318,6 +318,7 @@ OPTION(mon_crush_min_required_version, OPT_STR, "firefly")
OPTION(mon_warn_on_crush_straw_calc_version_zero, OPT_BOOL, true) // warn if crush straw_calc_version==0
OPTION(mon_warn_on_osd_down_out_interval_zero, OPT_BOOL, true) // warn if 'mon_osd_down_out_interval == 0'
OPTION(mon_warn_on_cache_pools_without_hit_sets, OPT_BOOL, true)
OPTION(mon_warn_osd_usage_percent, OPT_FLOAT, .40) // warn if difference in usage percent between OSDs exceeds specified percent

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 27, 2017

Contributor

@dzafman , could you talk more about this change? I didn't catch why the issue could be resolved by increase this option. Thanks!

This comment has been minimized.

Copy link
@dzafman

dzafman Mar 27, 2017

Author Member

mon_warn_osd_usage_percent is a new config parameter that warns when OSDs aren't being evenly used. This can be caused by many different things like not enough PGs, bad weighting or even storing non-OSD data on an OSD disk. Being aware of this situation could allow someone to take action before the most full OSD runs out of space and crashes while the cluster seems to have lots of space available over all.

mon_osd_min_in_ratio was the config parameter that was increased. Say that if a network failure causes the cluster to partition and a significant part of the cluster goes down. If we allow the automatic mechanism to mark too many of the OSDs out an attempt to recover with the remaining nodes could result in running out of space.

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 28, 2017

Contributor

Very clear, thanks!

@@ -718,6 +718,14 @@ int FileStore::statfs(struct store_statfs_t *buf0)
}
buf0->total = buf.f_blocks * buf.f_bsize;
buf0->available = buf.f_bavail * buf.f_bsize;
// Adjust for writes pending in the journal
if (journal) {
uint64_t estimate = journal->get_journal_size_estimate();

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 28, 2017

Contributor

@dzafman, as I thought, this change is used to prevent ENOSPC condition, am I right? Actually, I think this change could resolve the issue partially: statfs will be called in update_osd_stat, which will be called in each heartbeat(5s). But when the thoughput from clients is very high, this buf0 maybe out of data. Do you agree?

This comment has been minimized.

Copy link
@dzafman

dzafman Mar 28, 2017

Author Member

@liupan1111 It isn't so much that buf0 is out of date it is that the journal may contain data which is pending to be written to the store. It is not up to date with respect to the writes that have been received and not yet flushed to the store. The journal uses a fixed amount of space so it is already accounted for. The function name get_journal_size_estimate() is misleading because it isn't the actual size of the journal file or block device, but rather the amount of unflushed data pending to be written. This data will be consuming space when it flushes so we should reflect that in determining how much more client I/O we should accept.

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 29, 2017

Contributor

@dzafman One question: is there possible repetitive computation for this statfs? The case is:
some data is written into journal disk, and also written to the page cache of data disk(not flushed to physical data disk yet). So these data will be included both in estimate and the used part of buf0?

This comment has been minimized.

Copy link
@dzafman

dzafman Mar 29, 2017

Author Member

@liupan1111 No, the filesystem (e.g. xfs, ext4) which has accepted a write must have allocated the space and kept track of this in memory so statfs is always accurate. The page cache might delay the raw data getting onto the disk, but doesn't affect the filesystem's space accounting.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 31, 2017

Contributor

@liupan1111 there are two cases:

  1. journal and data are co-located in the same filesystem (fs for short), in which case, FileStore::statfs() should return the space still available for the new data. so
fs.available = fs.all - data.used - journal.max_size // the space used by journal is pre-allocated
data.all = fs.all - journal.max_size
buf0.available = fs.available - journal.used
  1. journal and data are living in different filesystems, in which case,
fs.available = fs.all - data.used
data.all = fs.all
buf0.available = fs.available - journal.used

so i think the change is correct.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 31, 2017

Contributor

yes, there is a small time window, before the journal's apply callback is called, and after the journal is committed, where the journal is counted by (data.total - data.available) and (journal.used). is it what you mean?

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 31, 2017

Contributor

@tchaikov Yup! Sorry for my poor English, that is really what I want to say.

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 31, 2017

Contributor

In addition, osd will report call statfs every 5 second(each heartbeat), so there is another reason for this inaccurate especially when high throughput.

This comment has been minimized.

Copy link
@tchaikov

tchaikov Mar 31, 2017

Contributor

i don't have an estimation how large the time window is, but i think it's very small. if you believe this would put us in trouble, could you show us more evidence or statistics?

This comment has been minimized.

Copy link
@liupan1111

liupan1111 Mar 31, 2017

Contributor

I just study this PR and found this possible inaccurate, just curious about it. Yes, I agree this window shouldn't be large. I haven't got any evidence about it, but I will keep this in mind, and test it when I meet similar case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.