Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage 👎 #3

Closed
billimek opened this issue Dec 18, 2018 · 22 comments
Closed

storage 👎 #3

billimek opened this issue Dec 18, 2018 · 22 comments
Labels
bug Something isn't working

Comments

@billimek
Copy link
Owner

billimek commented Dec 18, 2018

Persistent Storage is a really big pain in the ass

NFS

  • Centralized NFS has worked really well with little or no problems
  • Challenges with this approach are:
    • This is file-level shared-access type of storage which is apparently very bad to use for things like databases (sqlite, mariadb, postgresdb, elasticsearch, etc) which need block-level storage
    • All of the NFS storage is centralized to the proxmox node which means that other nodes are 'down' if the first node needs to go down. Realistically not sure how big of an issue this is

ceph

proxmox-provided ceph

  • ancient version of ceph and difficult to change (don't want to go beyond proxmox-provided ceph provider)
  • Need to do some special steps to make it work in the cluster which I don't care for
  • if I'm going to do ceph, I'd prefer to use rook
  • problems encountered:
    • observed big issues with the entire ceph system (OSDs?) being non-responsive after a node reboot and requires a further reboot of the other nodes, or manually stopping/removing/adding/starting the OSD to get into a recovery state
    • after running, untouched, for a week suddenly something happened with ceph and the OSDs started writing a ton of crap to the logs and the /var/log filesystem filled-up which is also used by the proxmox root filesystem. the mons detected no disk space and shut themselves down, which resulted in a completely unusable ceph system
    • when ceph 'goes down' (which seems to happen with frequency), any VMs with storage backed by ceph completely lock-up making them unusable until a manual ceph recovery - this is unacceptable

rook-provided ceph

  • always being updated which is nice and can pin to recent versions of ceph (mimic for example)
  • requires direct passthrough of drives to the VMs - not a big deal but an extra step is required during VM setup
  • problems encountered:
    • without warning of apparent reason, the OSDs get into a state where they think that they cannot see each-other and the entire ceph system locks up. a reboot of the node is required to recover
    • If using the same network as the rest of the k8s cluster and LAN, when ceph gets into a problem state, tcp connections start building until the haproxy loadbalancer runs out of tcp connections and ALL of haproxy stops responding which completely fucks the entire network. This is NOT COOL
    • k8s nodes cannot be rebooted without draining first because otherwise the node will hang forever with libceph errors being spit out in dmesg. Not cool

longhorn

It's 'alpha' and I should have known better:

  • randomly loses a node and you have to manually click on a bunch of things in the UI console to get it repaired
  • randomly went read-only, completely breaking the application relying on it
@billimek
Copy link
Owner Author

Steps taken so far to work around the issues and/or mitigate the problems:

  1. Running rook/ceph in both a kubeadm and rke provided cluster. This suggests that the problem isn't related to the cluster orchestration
  2. Running ceph at the proxmox layer but eventually the same issues manifest
  3. Changing drives (in case the issue was with a failing drive)
  4. Closely monitoring dmesg for network-related issues (i.e. network connection bouncing) - did not observe any issues at the network layer
  5. Running mimic ceph instead of luminous ceph (in rook)
  6. (rook) Running with a filesystem-backed bluestore instead of directly using ssd devices
  7. Telling ceph to use the 10.0.7.0/24 network for front-end stuff and the 10.0.10.0/24 network for backend OSD stuff

@billimek
Copy link
Owner Author

billimek commented Dec 18, 2018

  • Will try building out a new cluster that uses only the 10.0.7.0/24 network (or maybe a completely new subnet?) and run rook/ceph with the public/private network model again
  • The nodes will run linux with a very recent kernel (4.11 or 4.14+). There were some hints that OSD bad state could be mitigated with a timeout built-in to the newer kernels

@runningman84
Copy link

You might want to try openebs instead.

@billimek
Copy link
Owner Author

You might want to try openebs instead.

Thanks @runningman84, going to take a look at this!

@billimek billimek transferred this issue from billimek/k8s-templates Jan 8, 2019
@billimek
Copy link
Owner Author

billimek commented Jan 8, 2019

New cluster is built on ubuntu 18.10 (linux 4.18.0). Dedicated 10G network is not in-play except for NFS usage.

@billimek billimek added the bug Something isn't working label Jan 8, 2019
@billimek
Copy link
Owner Author

billimek commented Jan 10, 2019

Putting elasticsearch persistent data on ceph seems to be a pretty good litmus test for issues, based on past experience.

Last night, deployed elasticsearch on rook-ceph and then followed-up with deploying fluentd. As soon as fluentd started dumping cluster logs into elasticsearch, the following things were observed:

  • node k8s-2 started throwing blocked for more than 120 seconds errors via dmesg. This did not occur on any of the other nodes
  • k8s-2 glances display stopped updating
  • k8s-2 netdata continued to work and report metrics, however. According to netdata, the ceph rbd0 and rbd1 'block devices' were showing 100% disk utilization and rbd0 showed an IO wait time staying steady at 7s (rbd1 showed it being at like 500ms). sda (the root drive) and sdb (the passed-through SSD for ceph) didn't show any activity or issues
  • k8s-2 netdata showed the system load pegging around 40 which is really high. There was plenty of memory and most of the CPU was in iowait
  • All of the other kubernetes nodes appeared to be operating properly and were otherwise healthy
  • The ceph dashboard showed no new activity in the OSDs. Basically it looked like it was frozen even though I could interact with it.

After letting this state remain for over 60 minutes, I decided that it was not going to heal itself and took action by killing the OSD pod on k8s-2 to restart it. This was the only 'action' I took. As soon as the pod was killed and restarted, everything started working again. The logs all drained into elasticsearch and I haven't observed any issues since then.

@billimek
Copy link
Owner Author

billimek commented Jan 13, 2019

Issue occurred again around 7pm EST (just after midnight UCT) on Friday 2019-01-11

All three worker nodes were at around 100 load and very unresponsive.

From rook-ceph-osd-2-7679fc657f-b2srh:

2019-01-12 23:48:19.779 7fb6ad7db700  0 log_channel(cluster) log [DBG] : 1.4e scrub starts
2019-01-12 23:48:19.783 7fb6a97d3700  0 log_channel(cluster) log [DBG] : 1.4e scrub ok
2019-01-13 00:00:38.509 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 8 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redirec
ted e81)
2019-01-13 00:00:38.509 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 8 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redirec
ted e81)
2019-01-13 00:00:39.525 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redire
cted e81)
2019-01-13 00:00:39.525 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redire
cted e81)
2019-01-13 00:00:40.557 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redire
cted e81)
2019-01-13 00:00:40.557 7fb6c8010700 -1 osd.2 81 get_health_metrics reporting 21 slow ops, oldest is osd_op(client.116621.0:3656012 1.20 1:053e46c1:::rbd_data.1c8996b8b4567.000000000000000a:head [set-alloc-hint object_size 4194304 write_size 4194304,write 16384~16384] snapc 0=[] ondisk+write+known_if_redire
cted e81)
2019-01-13 00:00:41.581 7fb6c8010700 -1 osd.2 81 heartbeat_check: no reply from 10.42.4.188:6803 osd.1 since back 2019-01-13 00:00:21.574262 front 2019-01-13 00:00:21.574262 (cutoff 2019-01-13 00:00:21.583165)
2019-01-13 00:00:41.581 7fb6c8010700 -1 osd.2 81 heartbeat_check: no reply from 10.42.4.188:6803 osd.1 since back 2019-01-13 00:00:21.574262 front 2019-01-13 00:00:21.574262 (cutoff 2019-01-13 00:00:21.583165)
... forever

Similar things on the other OSD logs. MON logs looked 'normal'

dmesg from k8s-1:

[Jan13 00:00] libceph: osd1 down
[Jan13 00:01] libceph: mon0 10.43.89.158:6790 socket closed (con state OPEN)
[  +0.000025] libceph: mon0 10.43.89.158:6790 session lost, hunting for new mon
[  +2.902934] libceph: mon1 10.43.133.254:6790 session established
[  +6.098799] libceph: osd1 up
[ +24.835488] libceph: osd0 down
[Jan13 00:02] libceph: osd0 up
[Jan13 06:45] libceph: osd1 down
[  +0.000025] libceph: osd2 down
[Jan13 06:55] libceph: osd2 weight 0x0 (out)
[Jan13 07:20] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[  +0.000019] libceph: mon1 10.43.133.254:6790 session lost, hunting for new mon
[ +10.000247] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:23] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:25] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:26] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:27] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
... forever

dmesg from k8s-2:

[Jan13 00:00] libceph: osd1 down
[Jan13 00:01] libceph: osd1 up
[ +24.770080] libceph: osd0 down
[  +9.947094] libceph: osd0 up
[Jan13 05:29] IPv6: ADDRCONF(NETDEV_UP): cali0bbf66667f6: link is not ready
[  +0.000029] IPv6: ADDRCONF(NETDEV_CHANGE): cali0bbf66667f6: link becomes ready
[Jan13 06:45] libceph: osd1 down
[  +0.000002] libceph: osd2 down
[Jan13 06:55] libceph: osd2 weight 0x0 (out)
[Jan13 07:00] IPv6: ADDRCONF(NETDEV_UP): caliab5db8e7fd3: link is not ready
[  +0.001321] IPv6: ADDRCONF(NETDEV_CHANGE): caliab5db8e7fd3: link becomes ready
[Jan13 07:20] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[  +0.000038] libceph: mon1 10.43.133.254:6790 session lost, hunting for new mon
[ +45.002674] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:22] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
... forever

dmesg from k8s-3:

[Jan13 00:00] libceph: osd1 down
[  +1.537953] rbd: rbd0: encountered watch error: -107
[Jan13 00:01] libceph: osd1 up
[ +24.840642] libceph: osd0 down
[  +7.941309] libceph: osd0 up
[Jan13 00:16] libceph: osd2 10.42.5.113:6800 socket closed (con state OPEN)
[Jan13 06:45] libceph: osd1 down
[  +0.000002] libceph: osd2 down
[Jan13 06:55] libceph: osd2 weight 0x0 (out)
[Jan13 07:20] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[  +0.000060] libceph: mon1 10.43.133.254:6790 session lost, hunting for new mon
[ +10.000319] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[ +35.003237] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
[Jan13 07:21] libceph: mon1 10.43.133.254:6790 socket closed (con state OPEN)
... forever

ceph dashboard:

image

rook-ceph-mon-a-8686f9cd9c-brml6 log:

2019-01-13 00:00:27.666977 mon.b mon.0 10.43.89.158:6790/0 5732 : cluster [WRN] Health check failed: 2 slow ops, oldest one blocked for 262920 sec, mon.b has slow ops (SLOW_OPS)
2019-01-13 00:00:32.878549 mon.b mon.0 10.43.89.158:6790/0 5733 : cluster [INF] Health check cleared: SLOW_OPS (was: 2 slow ops, oldest one blocked for 262925 sec, mon.b has slow ops)
2019-01-13 00:00:32.878606 mon.b mon.0 10.43.89.158:6790/0 5734 : cluster [INF] Cluster is now healthy
2019-01-13 00:00:49.256068 mon.b mon.0 10.43.89.158:6790/0 5737 : cluster [INF] osd.1 failed (root=default,host=k8s-2) (2 reporters from different host after 23.000089 >= grace 20.000000)
2019-01-13 00:00:49.678358 mon.b mon.0 10.43.89.158:6790/0 5738 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-01-13 00:00:49.678528 mon.b mon.0 10.43.89.158:6790/0 5739 : cluster [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-01-13 00:01:01.478 7f54aefdd700  1 mon.a@2(peon).paxos(paxos active c 288400..289127) lease_timeout -- calling new election
2019-01-13 00:01:01.482 7f54ac7d8700  0 log_channel(cluster) log [INF] : mon.a calling monitor election
2019-01-13 00:00:51.004661 mon.b mon.0 10.43.89.158:6790/0 5741 : cluster [WRN] Health check failed: 134 slow ops, oldest one blocked for 262945 sec, daemons [osd.0,osd.1,osd.2,mon.b] have slow ops. (SLOW_OPS)
2019-01-13 00:01:01.488467 mon.a mon.2 10.43.227.102:6790/0 3482 : cluster [INF] mon.a calling monitor election
2019-01-13 00:01:01.646784 mon.c mon.1 10.43.133.254:6790/0 1335 : cluster [INF] mon.c calling monitor election
2019-01-13 00:01:06.793025 mon.c mon.1 10.43.133.254:6790/0 1336 : cluster [INF] mon.c is new leader, mons c,a in quorum (ranks 1,2)
2019-01-13 00:01:07.047233 mon.c mon.1 10.43.133.254:6790/0 1341 : cluster [WRN] Health check failed: 1/3 mons down, quorum c,a (MON_DOWN)
2019-01-13 00:01:07.207546 mon.c mon.1 10.43.133.254:6790/0 1342 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; 134 slow ops, oldest one blocked for 262945 sec, daemons [osd.0,osd.1,osd.2,mon.b] have slow ops.; 1/3 mons down, quorum c,a
2019-01-13 00:01:09.994 7f54ac7d8700  0 log_channel(cluster) log [INF] : mon.a calling monitor election
2019-01-13 00:01:09.421024 mon.b mon.0 10.43.89.158:6790/0 5743 : cluster [INF] mon.b calling monitor election
2019-01-13 00:01:09.997792 mon.a mon.2 10.43.227.102:6790/0 3483 : cluster [INF] mon.a calling monitor election
2019-01-13 00:01:10.210804 mon.c mon.1 10.43.133.254:6790/0 1347 : cluster [INF] mon.c calling monitor election
2019-01-13 00:01:13.338321 mon.b mon.0 10.43.89.158:6790/0 5744 : cluster [INF] mon.b calling monitor election
2019-01-13 00:01:13.776578 mon.b mon.0 10.43.89.158:6790/0 5745 : cluster [INF] mon.b is new leader, mons b,c,a in quorum (ranks 0,1,2)
2019-01-13 00:01:16.162033 mon.b mon.0 10.43.89.158:6790/0 5750 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum c,a)
2019-01-13 00:01:16.534357 mon.b mon.0 10.43.89.158:6790/0 5751 : cluster [WRN] Health check update: Degraded data redundancy: 1968/6750 objects degraded (29.156%), 88 pgs degraded (PG_DEGRADED)
2019-01-13 00:01:16.534393 mon.b mon.0 10.43.89.158:6790/0 5752 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs peering)
2019-01-13 00:01:17.814913 mon.b mon.0 10.43.89.158:6790/0 5753 : cluster [WRN] overall HEALTH_WARN 1 osds down; 1 host (1 osds) down; Reduced data availability: 4 pgs peering; Degraded data redundancy: 851/6750 objects degraded (12.607%), 37 pgs degraded
...

eventually it's mostly all this:

019-01-13 01:49:52.830 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:49:57.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:49:57.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:50:02.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:50:02.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:50:07.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:50:07.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
2019-01-13 01:50:12.829 7f54aefdd700 -1 mon.a@2(peon) e3 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.2 10.42.5.113:6800/12471 for 21sec e86 v86)
...

@billimek
Copy link
Owner Author

billimek commented Jan 13, 2019

Eventual recovery was only possible by forcefully rebooting (from proxmox) the 3 worker nodes.

My only analysis is that around midnight UCT there is a large amount of network IO when the stash restic jobs kick-in to back-up the volumes from the pods to an NFS mount. Why this would trigger ceph to go completely unusable and take-down the cluster, I do not know.

When not running ceph, this doesn't happen. This situation is directly related to ceph. What causes ceph to get into this state is still a mystery. I can say with some degree of confience that it is not:

  • network interface related (happens on both the built-in 1G network or the DAC 10G network)
  • version of kernel (running 4.18 now and it still happens)
  • presence of rancher (this was a stretch anyway but worth noting)
  • latest version of ceph (mimic released last week) doesn't make a difference

@billimek
Copy link
Owner Author

From this, edited the config via

k -n rook-ceph edit configmap rook-config-override

and set

[osd]
osd snap trim sleep = 0.6

... then restarted the OSDs.

This may not be necessary or even do anything in ceph mimic but going to try it after seeing locked-conditions a couple of times now.

@billimek
Copy link
Owner Author

billimek commented Jan 17, 2019

There appears to be a direct correlation between the stash restic backups that all kick-off at midnight UCT (7pm ET) and the ceph cluster getting into a bad state.

The restic backups are traversing a different physical network interface (10GB DAC network cards) targeting an NFS mount on a different server. Two of the restic backups read data from the elasticsearch-data pods which run on rook/ceph. All of the other backups pull data from workloads with data persisting to an NFS mount.

Symptoms include the nodes running the elasticsearch-data pods to:

  • spike in iowait immediatley at 7pm ET when the backups start
  • (sometimes) one or both of those nodes start dropping kernel messages about hung processes
  • the OSDs on one or both of the nodes become 'locked' in that they do not respond to anytinhg
  • the nodes themselves peg to about 100 load
  • the nodes themselves report 0 disk utilization to the physical drives but 100% disk utilization to the ceph 'rbd' mounts where elasticsearch-data lives (with disk wait times locked at around 1-7 seconds)

Running netdata on k8s-2 shows that the display basically freezes minutes after the 7pm backup start time:
image

Netdata on node k8s-3 shows it basically pegged in iowait continually since 7pm ET:
image

glances on k8s-2 (frozen as of 34 seconds past midnight UCT (7pm ET)):
image

glances on k8s-3 (lagging minutes behind current time):
image

dmesg on k8s-2 3 minutes past midnight UCT (7pm ET):
image

dmesg on k8s-3:
image

ceph dashboard as of 8:27pm ET:
image

The last command I was able to run on k8s-2 and it's completely hung now and the node is entirely unresponsive:
image

@billimek
Copy link
Owner Author

billimek commented Jan 17, 2019

rook/ceph were never able to recover on their own and the only way to affect a change was to forcefully reboot the k8s-2 node about 80 minutes after everything broke.

As soon as this happened, the ceph slow ops errors went away, the 'locked' or 'blocked' operations on the other nodes immediately recovered as well:
image

k8s-2 recovering according to netdata:
image

k8s-3 recovering according to netdata:
image

dmesg on k8s-2 afer rebooting:
image

@billimek
Copy link
Owner Author

billimek commented Jan 17, 2019

It is important to note that this same behavior was also observed when:

  • running in an older kernel (4.4 series under ubuntu 16.04)
  • running ceph on a different dedicated network
  • running ceph OSDs on completely different drives
  • running ceph OSDs as local bluestore objects on the local filesystem

@billimek
Copy link
Owner Author

billimek commented Jan 17, 2019

Given the seeming correlation between stash/restic backups and this condition, I'm going to do the following experiments:

  1. disable restic backups for the elasticsearch-data pods which are the only things running in ceph right now. This could inform on the notion that a massive amount of reads against the ceph filesystem within the elasticsearch pod and/or heavy IO on the node itself is a triggering event
  2. if this doesn't yield different results, will completely disable all stash/restic backups of all volumes to see if there is a correlation between heavy IO and ceph getting into this bad state.

(edit)
elasticsearch backups are now paused with this commit

Even with the above stated, I do not believe that this situation is directly related to stash/restic, as the problem condition has occurred at times other than the 7pm ET backup start. For example, when installing elasticsearch & fluentd for the first time, ceph got into a bad state which required a node reboot to recover from.

It 'feels' like heavy IO is something related to the root-cause. It just so happens that the restic backups are a big burst of IO at the same time every day, so it's easy to pinpoint this as a trigger.

@billimek
Copy link
Owner Author

Tuning advice links to try later:

In particular,

echo 4194303 > /proc/sys/kernel/pid_max
echo noop > /sys/block/sda/queue/scheduler
echo 1024 > /sys/block/sda/queue/nr_requests
echo "8192" > /sys/block/sda/queue/read_ahead_kb

Keep-up with iostats via S_COLORS=always watch -c -n 10 iostat -x

@billimek
Copy link
Owner Author

7pm ET same thing happened tonight - despite applying those kernel and device tweaks. Also observed that the stash backups for the elasticsearch pods appeared to attempt to run. So not sure how they are supposed to be disabled.

@billimek
Copy link
Owner Author

deployments with sqlite3 databases which don't play well with nfs:

  • plex
  • sonarr
  • radarr
  • grafana
  • home-assistant (in some cases)
  • nzbget (maybe)

@billimek
Copy link
Owner Author

billimek commented Jan 21, 2019

Update:

  • After several more days of testing stash/restic backups of the elasticsearch pods, it is fairly easy to reproduce the problem condition with ceph as described above.
  • As soon as the stash/restic backups were disabled from occurring, the daily 7pm ET OSD lockup issue went away.
  • However, after doing a lot more testing with removing and recreating elasticsearch backed by ceph, any period of heavy IO seems to trigger the problem condition. Changing the proxmox passthrough drives to use directsync caching or threads IO didn't make a difference.

After more googling around about this issue, it seems as if it is a known thing that there are situations where there can be a kernel client deadlock when running a ceph client (rbd thing) on the same host as the ceph OSD itself. Something about a deadlock condition in the kernel code. The description seems to jive with the symptoms I've been observing in that the problems seems to only happen during high periods of IO.

Links discussing this issue:

From the ceph documentation:

In older kernels, Ceph can deadlock if you try to mount CephFS or RBD client services on the same host that runs your test Ceph cluster. This is not a Ceph-related issue. It’s related to memory pressure and needing to relieve free memory. Recent kernels with up-to-date glibc and syncfs(2) reduce this issue considerably. However, a memory pool large enough to handle incoming requests is the only thing that guarantees against the deadlock occuring. When you run Ceph clients on a Ceph cluster machine, loopback NFS can experience a similar problem related to buffer cache management in the kernel. You can avoid these scenarios entirely by using a separate client host, which is more realistic for deployment scenarios anyway.

So armed with this possible explanation I explored how to run ceph in a way that has it 'external' from the clients. One obvious option is to go back to the proxmox-provided ceph cluster. Another is to run rook in a way that the OSDs run on the master nodes. I tried the rook approach. Unfortunately there isn't a way right now to properly run rook ceph on the rke-provided master nodes. I tried for a few hours and spoke with folks in slack and realized that it's not quite there yet.

With that setback, I deployed the external ceph storageclass and provisioner and will try with that way again. This time I will not use it to host the VM disk images directly, as that seems to be the same problem with running the client and server ceph stuff on the same host.

@billimek
Copy link
Owner Author

billimek commented Jan 22, 2019

After migrating to an externalized ceph cluster (provided by proxmox),

  • no issues during initial ingestion of data running elasticsearch
  • no issues during 7pm ET (midnight UCT) daily stash/restic backup

So far, this seems to support the narrative that running client workloads (rbd clients) on the same host as the server (ceph OSD) can cause kernel client deadlocks.

@billimek
Copy link
Owner Author

Almost a full week of running a lot of workload on the external ceph storage and not a single issue or disruption so far. This may be the final answer.

Storage volumes migrated from NFS to ceph so far:

chronograf-chronograf
grafana
hass-mysql
influxdb-influxdb
nzbget-config
plex-kube-plex-config
prometheus-alertmanager
prometheus-server
radarr-config
sonarr-config
unifi
data-elasticsearch-data-0
data-elasticsearch-data-1
data-elasticsearch-master-0
data-elasticsearch-master-1
data-elasticsearch-master-2

Remaining volumes to migrate:

deluge-config
hass-home-assistant
mc-minecraft-datadir
mcsv-minecraft-datadir
node-red
rutorrent-config
datadir-consul-0
datadir-consul-1
datadir-consul-2

@billimek
Copy link
Owner Author

With commit 1af3b71 this can finally be closed.

In summary:

It is apparently known that ceph cluster server components (OSDs) should not co-exist in the same kernel runtime space as ceph clients (i.e. pod workloads consuming storage from the OSDs)

This means that ceph should only be run in a way that isolates the two runtimes. Currently this is difficult to do with rook. Will revisit rook/ceph until it's easier to do this (like run the OSDs on the 3 master nodes away from the worker nodes)

@samvdb
Copy link

samvdb commented Mar 13, 2019

deployments with sqlite3 databases which don't play well with nfs:

  • plex
  • sonarr
  • radarr
  • grafana
  • home-assistant (in some cases)
  • nzbget (maybe)

Thanks for your detailed comments, which i had found this issue sooner :) NFS really doesnt play nice with these guys

@iMartyn
Copy link

iMartyn commented Jul 25, 2019

@billimek I really don't think it's client and OSDs on the same nodes that's causing this, in my scenario I could replicate whilst doing rsync from an mdraid to cephfs on a machine that has no OSDs on it. I'm glad you found a solution that works for you but I think the cephfs bug is potentially a version-related one. Could you check which version of ceph you're running in proxmox?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants