New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Health warnings on long network ping times, add "dump_osd_network" to get a report #28755
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
66d44e7
osd mon: Track heartbeat ping times and report health warning
dzafman 025b10a
osd: Add "dump_osd_network" osd admin request to get a sorted report
dzafman 5d3c185
mgr: Add "dump_osd_network" mgr admin request to get a sorted report
dzafman 0d1bbd3
osd mgr mon: Add mon_warn_on_slow_ping_ratio config as 5% of osd_hear…
dzafman f4a0be2
doc: Add documentation and release notes
dzafman 297a0e7
osd mgr: Add minimum and maximum tracking to network ping time
dzafman 3f846d7
osd mgr: Store last pingtime for possible graphing
dzafman 6555699
osd: After first interval populate vectors so 5min/15min values aren't 0
dzafman ea20d35
osd mon: Add last_update to osd_stat_t heartbeat info
dzafman 5ab145d
mon: Indicate when an osd with slow ping time is down
dzafman 048f809
osd mgr: Add osd_mon_heartbeat_stat_stale option to time out ping info
dzafman f2b26d8
osd: Add debug_disable_randomized_ping config for use in testing
dzafman 573aea2
osd: Add debug_heartbeat_testing_span to allow quicker testing
dzafman 4fb42ea
test: Add basic test for network ping tracking
dzafman 8ac1562
common: Add support routines to generate strings for fixed point
dzafman 9d02e5d
osd mon mgr: Convert all network ping time output to milliseconds
dzafman 5f83a61
osd doc mon mgr: To milliseconds for config value, user input and thr…
dzafman 71015b9
doc: Document network performance monitoring
dzafman File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
#!/usr/bin/env bash | ||
|
||
source $CEPH_ROOT/qa/standalone/ceph-helpers.sh | ||
|
||
function run() { | ||
local dir=$1 | ||
shift | ||
|
||
export CEPH_MON="127.0.0.1:7146" # git grep '\<7146\>' : there must be only one | ||
export CEPH_ARGS | ||
CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none " | ||
CEPH_ARGS+="--mon-host=$CEPH_MON " | ||
CEPH_ARGS+="--debug_disable_randomized_ping=true " | ||
CEPH_ARGS+="--debug_heartbeat_testing_span=5 " | ||
CEPH_ARGS+="--osd_heartbeat_interval=1 " | ||
local funcs=${@:-$(set | sed -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')} | ||
for func in $funcs ; do | ||
setup $dir || return 1 | ||
$func $dir || return 1 | ||
teardown $dir || return 1 | ||
done | ||
} | ||
|
||
function TEST_network_ping_test1() { | ||
local dir=$1 | ||
|
||
run_mon $dir a || return 1 | ||
run_mgr $dir x || return 1 | ||
run_osd $dir 0 || return 1 | ||
run_osd $dir 1 || return 1 | ||
run_osd $dir 2 || return 1 | ||
|
||
sleep 5 | ||
|
||
create_pool foo 16 | ||
|
||
# write some objects | ||
timeout 20 rados bench -p foo 10 write -b 4096 --no-cleanup || return 1 | ||
|
||
# Get 1 cycle worth of ping data "1 minute" | ||
sleep 10 | ||
flush_pg_stats | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
# Wait another 4 cycles to get "5 minute interval" | ||
sleep 20 | ||
flush_pg_stats | ||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
|
||
# Wait another 10 cycles to get "15 minute interval" | ||
sleep 50 | ||
flush_pg_stats | ||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "0" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "1000" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "4" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 0 | tee $dir/json | ||
test "$(cat $dir/json | jq '.entries | length')" = "12" || return 1 | ||
test "$(cat $dir/json | jq '.threshold')" = "0" || return 1 | ||
|
||
# Just check the threshold output matches the input | ||
CEPH_ARGS='' ceph daemon $(get_asok_path mgr.x) dump_osd_network 99 | tee $dir/json | ||
test "$(cat $dir/json | jq '.threshold')" = "99" || return 1 | ||
CEPH_ARGS='' ceph daemon $(get_asok_path osd.0) dump_osd_network 98 | tee $dir/json | ||
test "$(cat $dir/json | jq '.threshold')" = "98" || return 1 | ||
|
||
rm -f $dir/json | ||
} | ||
|
||
# Test setting of mon_warn_on_slow_ping_time very low to | ||
# get health warning | ||
function TEST_network_ping_test2() { | ||
local dir=$1 | ||
|
||
export CEPH_ARGS | ||
export EXTRA_OPTS=" --mon_warn_on_slow_ping_time=1" | ||
run_mon $dir a || return 1 | ||
run_mgr $dir x || return 1 | ||
run_osd $dir 0 || return 1 | ||
run_osd $dir 1 || return 1 | ||
run_osd $dir 2 || return 1 | ||
|
||
sleep 5 | ||
|
||
create_pool foo 16 | ||
|
||
# write some objects | ||
timeout 20 rados bench -p foo 10 write -b 4096 --no-cleanup || return 1 | ||
|
||
# Get at least 1 cycle of ping data (this test runs with 5 second cycles of 1 second pings) | ||
sleep 10 | ||
flush_pg_stats | ||
|
||
ceph health | tee $dir/health | ||
grep -q "Long heartbeat" $dir/health || return 1 | ||
|
||
ceph health detail | tee $dir/health | ||
grep -q "OSD_SLOW_PING_TIME_BACK" $dir/health || return 1 | ||
grep -q "OSD_SLOW_PING_TIME_FRONT" $dir/health || return 1 | ||
rm -f $dir/health | ||
} | ||
|
||
main network-ping "$@" | ||
|
||
# Local Variables: | ||
# compile-command: "cd ../.. ; make -j4 && ../qa/run-standalone.sh network-ping.sh" | ||
# End: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to add a little about the admin commands in the main docs too - maybe in doc/rados/operations/monitoring.rst