-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage 👎 #3
Comments
Steps taken so far to work around the issues and/or mitigate the problems:
|
|
You might want to try openebs instead. |
Thanks @runningman84, going to take a look at this! |
New cluster is built on ubuntu 18.10 (linux 4.18.0). Dedicated 10G network is not in-play except for NFS usage. |
Putting elasticsearch persistent data on ceph seems to be a pretty good litmus test for issues, based on past experience. Last night, deployed elasticsearch on rook-ceph and then followed-up with deploying fluentd. As soon as fluentd started dumping cluster logs into elasticsearch, the following things were observed:
After letting this state remain for over 60 minutes, I decided that it was not going to heal itself and took action by killing the OSD pod on |
Issue occurred again around 7pm EST (just after midnight UCT) on Friday 2019-01-11 All three worker nodes were at around 100 load and very unresponsive. From
Similar things on the other OSD logs. MON logs looked 'normal'
ceph dashboard: rook-ceph-mon-a-8686f9cd9c-brml6 log:
eventually it's mostly all this:
|
Eventual recovery was only possible by forcefully rebooting (from proxmox) the 3 worker nodes. My only analysis is that around midnight UCT there is a large amount of network IO when the stash restic jobs kick-in to back-up the volumes from the pods to an NFS mount. Why this would trigger ceph to go completely unusable and take-down the cluster, I do not know. When not running ceph, this doesn't happen. This situation is directly related to ceph. What causes ceph to get into this state is still a mystery. I can say with some degree of confience that it is not:
|
From this, edited the config via
and set
... then restarted the OSDs. This may not be necessary or even do anything in ceph mimic but going to try it after seeing locked-conditions a couple of times now. |
It is important to note that this same behavior was also observed when:
|
Given the seeming correlation between stash/restic backups and this condition, I'm going to do the following experiments:
(edit) Even with the above stated, I do not believe that this situation is directly related to stash/restic, as the problem condition has occurred at times other than the 7pm ET backup start. For example, when installing elasticsearch & fluentd for the first time, ceph got into a bad state which required a node reboot to recover from. It 'feels' like heavy IO is something related to the root-cause. It just so happens that the restic backups are a big burst of IO at the same time every day, so it's easy to pinpoint this as a trigger. |
Tuning advice links to try later:
In particular,
Keep-up with iostats via |
7pm ET same thing happened tonight - despite applying those kernel and device tweaks. Also observed that the stash backups for the elasticsearch pods appeared to attempt to run. So not sure how they are supposed to be disabled. |
deployments with sqlite3 databases which don't play well with nfs:
|
Update:
After more googling around about this issue, it seems as if it is a known thing that there are situations where there can be a kernel client deadlock when running a ceph client (rbd thing) on the same host as the ceph OSD itself. Something about a deadlock condition in the kernel code. The description seems to jive with the symptoms I've been observing in that the problems seems to only happen during high periods of IO. Links discussing this issue:
From the ceph documentation:
So armed with this possible explanation I explored how to run ceph in a way that has it 'external' from the clients. One obvious option is to go back to the proxmox-provided ceph cluster. Another is to run rook in a way that the OSDs run on the master nodes. I tried the rook approach. Unfortunately there isn't a way right now to properly run rook ceph on the rke-provided master nodes. I tried for a few hours and spoke with folks in slack and realized that it's not quite there yet. With that setback, I deployed the external ceph storageclass and provisioner and will try with that way again. This time I will not use it to host the VM disk images directly, as that seems to be the same problem with running the client and server ceph stuff on the same host. |
After migrating to an externalized ceph cluster (provided by proxmox),
So far, this seems to support the narrative that running client workloads (rbd clients) on the same host as the server (ceph OSD) can cause kernel client deadlocks. |
Almost a full week of running a lot of workload on the external ceph storage and not a single issue or disruption so far. This may be the final answer. Storage volumes migrated from NFS to ceph so far:
Remaining volumes to migrate:
|
With commit 1af3b71 this can finally be closed. In summary: It is apparently known that ceph cluster server components (OSDs) should not co-exist in the same kernel runtime space as ceph clients (i.e. pod workloads consuming storage from the OSDs) This means that ceph should only be run in a way that isolates the two runtimes. Currently this is difficult to do with rook. Will revisit rook/ceph until it's easier to do this (like run the OSDs on the 3 master nodes away from the worker nodes) |
Thanks for your detailed comments, which i had found this issue sooner :) NFS really doesnt play nice with these guys |
@billimek I really don't think it's client and OSDs on the same nodes that's causing this, in my scenario I could replicate whilst doing rsync from an mdraid to cephfs on a machine that has no OSDs on it. I'm glad you found a solution that works for you but I think the cephfs bug is potentially a version-related one. Could you check which version of ceph you're running in proxmox? |
Persistent Storage is a really big pain in the ass
NFS
proxmox
node which means that other nodes are 'down' if the first node needs to go down. Realistically not sure how big of an issue this isceph
proxmox-provided ceph
/var/log
filesystem filled-up which is also used by the proxmox root filesystem. the mons detected no disk space and shut themselves down, which resulted in a completely unusable ceph systemrook-provided ceph
libceph
errors being spit out in dmesg. Not coollonghorn
It's 'alpha' and I should have known better:
The text was updated successfully, but these errors were encountered: