Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-dqlite spiking cpu core to 100% #3227

Open
WilliamG-LORA opened this issue Jun 8, 2022 · 175 comments
Open

k8s-dqlite spiking cpu core to 100% #3227

WilliamG-LORA opened this issue Jun 8, 2022 · 175 comments

Comments

@WilliamG-LORA
Copy link

WilliamG-LORA commented Jun 8, 2022

Summary

I've setup a 4 node microk8s cluster on bare metal machines. Every now and then the /snap/microk8s/3204/bin/k8s-dqlite process will spike one of the cores on one of my nodes to 100% usage, sending my fans into overdrive.

I can see the ram usage is low and all the other cores are running at <6% usage, and RAM is hardly used:
htop

The specs of the machines are as follows:

  • Node 1:
    • CPU: AMD Threadripper 1950X
    • RAM: 64GB
  • Node 2:
    • CPU: i7-7820X
    • RAM: 64
  • Node 3:
    • CPU: i7-9700
    • RAM: 32
  • Node 4:
    • CPU: i7-9700K
    • RAM: 64

The cluster has the metallb, dns, rbac, and storage enabled.
I've also deployed Rook-Ceph on the cluster.

What Should Happen Instead?

It shouldn't be using over 100% of a core.

Reproduction Steps

  1. Create a microk8s cluster
  2. Deploy Rook-Ceph
  3. Wait a bit.
  4. I'm not sure how to properly reproduce this issue...

Introspection Report

inspection-report-20220608_143601.tar.gz

@maximemoreillon
Copy link

maximemoreillon commented Jun 14, 2022

I am facing the same issue on multiple AWS EC2s, each running single node Microk8s instances.

Microk8s version: 1.23 classic

Enabled addons:

  • DNS
  • Storage
  • RBAC

For example, here is a screenshot of htop on an EC2.XLarge (16Gb memory):

microk8s_cpu

Microk8s was running smoothly until this week.

On the other hand, instances running microk8s version 1.18 were not affected.

@bc185174
Copy link

We found similar results where the dqlite service on the leader node was hitting 100% usage. It is mentioned in a few other issues but dqlite is sensitive to slow disk performance. In other scenarios such as a node drain, it took a while to write to the database. You should see in the logs, something similar to microk8s.daemon-kubelite[3802920]: Trace[736557743]: ---"Object stored in database" 7755ms. We found on our cluster it took over ~18000ms to write to the datastore and dqlite could not cope with it. As a result, it led to leader election failures and the kubelite service panics.

We monitored the CPU and RAM utilisation for dqlite and compared it to etcd under the same workload and conditions.

Dqlite Idle

image

Etcd Idle

image

@benben
Copy link

benben commented Dec 1, 2022

Can confirm this on Proxmox virtualized VMs. There is a constant high load on a 3 node cluster

image

@djjudas21
Copy link

I'm seeing something similar. I was running a 4-node HA cluster but it failed (see #3735) so I removed 2 nodes to disable HA mode and hopefully restore quorum , now running 2 nodes, 1 is master. The master has a dqlite process rammed at 100% CPU. Running iotop shows aggregate disk transfer of only a hundred KB/s. The dqlite log shows various transaction logs and mostly they complete in 500-700ms, but occasionally I get a much slower one.

Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: ["GuaranteedUpdate etcd3" audit-id:70abc60c-f365-4744-96fc-a404d34de11b,key:/leases/kube-system/kube-apiserver-b3nhikmrwakntwutkwiesxox4e,type:*coordination.Lease,resource:leases.coordination.k8s.io 7005ms (09:08:44.693)
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]:  ---"Txn call completed" 7005ms (09:08:51.699)]
Feb 08 09:08:51 kube05 microk8s.daemon-kubelite[1456693]: Trace[1976283013]: [7.005829845s] [7.005829845s] END

Hardware isn't exactly a rocketship but all my nodes are i5-6500T with 4 cores, 16GB memory, 256GB SSD and that should be adequate. Most of my workloads are not running at the moment, either.

@cole-miller
Copy link
Contributor

cole-miller commented Feb 9, 2023

Hi all. I work on dqlite, and I'm going to try to figure out what's causing these CPU usage spikes. If you're experiencing this issue on a continuing basis and are in a position to collect some diagnostic info, I could use your help! Short of a reproducer that I can run myself (which I realize is difficult for this kind of complex system), the data I'd find most helpful would be a sampling profiler report showing where the k8s-dqlite process is spending its time during one of these spikes. A separate report for the same process and workload during a period of nominal CPU usage would also be great, so I can compare the two and see if anything stands out. You can gather this information as follows:

  1. Install the perf command-line tool on all machines in your cluster. On Ubuntu this is part of the linux-tools package (you'll have to pick a "flavor" like linux-tools-generic).
  2. Collect a profile by ssh-ing into the affected node and running perf record -F 99 --call-graph dwarf -p <pid>, where <pid> is the PID of the k8s-dqlite process. That command will keep running and collecting samples until you kill it with Ctrl-C.
  3. Upload the generated perf.data file to somewhere I can access. (It doesn't contain a core dump or anything else that might be sensitive, just backtraces.) Please also share the version of microk8s that you're using.

If the spikes last long enough that you can notice one happening and have time to ssh in and gather data before it ends, do that. Otherwise, since it's probably not feasible to just leave perf running (perf.data would get too huge), you could have a script like

i=0
while true; do
        rm -f perf.data.$i
        i=$(( ( i + 1 ) % 2 ))
        perf record -F 99 --call-graph dwarf -o perf.data.$i -p <pid> sleep 60
done

Or with a longer timeout, etc. Then if you notice after the fact that a spike has occurred, you hopefully still have the perf.data file for that time period around.

Thanks for your reports -- let's get this issue fixed!

@djjudas21
Copy link

Hey @cole-miller, thanks for looking at this.

I've done a perf capture for you, but it is worth noting a couple of things:

  1. Mine aren't transient CPU spikes, dqlite just hammers the CPU at 100% from the moment it is started
  2. In a separate issue I'm investigating with @ktsakalozos whether this high CPU is caused by corruption of the dqlite database

With those caveats, here's my attached perf.data which I ran for about a minute on MicroK8s v1.26.1 (rev 4595). Hope it is useful to you.

perf.tar.gz

@doctorpangloss
Copy link

@cole-miller likewise, k8s-dqlite doesn't spike, it's just at 100% all the time.

perf.data.zip

@doctorpangloss
Copy link

It seems like a lot of people are affected by this issue.

@djjudas21
Copy link

As well as my prod cluster being affected by this, last week I quickly threw together a 3-node MicroK8s cluster on v1.26 in VirtualBox to test something. No workloads. Initially it worked normally, but then I shut down all 3 VMs. When I booted them up again later, I had the dqlite 100% CPU problem. I didn't have time to look into it as I was working on something else, but it does show that it can happen on a new cluster that hasn't been "messed with".

@djjudas21
Copy link

djjudas21 commented Mar 2, 2023

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

@AlexsJones
Copy link
Contributor

I understand that MicroK8s is free software, no guarantees, etc, but it is run by a prominent company like Canonical so it is surprising/disappointing that there are quite few serious, long-standing issues, affecting multiple users, that don't appear to be getting much attention from maintainers (for example this one, #3735 and #3204)

I have been running MicroK8s since v1.17 I think and it has generally been rock solid and only broken when I've fiddled with it. Since v1.25/v1.26 there seem to be chronic issues affecting the stability. I have personally lost data due to this dqlite CPU problem (long story short, I lost dqlite quorum which broke the kube api-server, but I was using OpenEBS/cStor clustered storage, which depends on kube quorum for its own quorum. When it lost quorum and the kube api-server become silently read-only, the storage controller got itself into a bad state and volumes could not be mounted).

A lot of people are talking about switching to k3s, and I don't want to be that guy who rants about switching, but it is something I will consider doing at next refresh. I note that k3s ditched dqlite in favour of etcd in v1.19. I don't know what their reasons were, but it was probably a good move.

Hi Jonathan,

I lead Kubernetes at Canonical and I wanted to firstly offer both an apology and a thank you for your help thus far. We know that there are situations where people like yourself are using MicroK8s and suddenly unbeknownst to you something goes wrong - not being able to easily solve that only compounds the problem.

Our ambition for MicroK8s is to keep it as simple as possible, both in day-to-day operation but also upgrading. I wanted to thank you again for taking the time to engage with us and help us try and improve our projects. Projects plural, as DQLite is also something we build and maintain, hence our investment in building it for a low-ops K8s database backend. ( That said, there is the option to run ETCD with MicroK8s should you desire too ).

This resilience issue is being taken extremely seriously and we are configuring machines to try and reproduce your environment as we are aware of it to the best of our abilities and to work with our DQLite team counterparts to resolve any performance issues. ( Please do let us know what your storage configuration is, localpath etc ).

I believe one thing that really sets MicroK8s apart from alternatives is that we have no secret agenda here.
We are here to serve and help our community grow and in that is a promise of working together to make sure our end-users and community members are assisted as much as humanly possible. We will not rest until any potential issues have been exhaustively analysed; independently of whether this is a quirk of setup or environment.

All that said, I will do the following immediately:

  • Ensure our team has picked this up with the DQLite team to analyze the performance results
  • Run a benchmark and sanity test against our own lab equipment
  • Test scenarios like mentioned ( e.g node drain with a workload like ceph on dqlite )
  • Keep you abreast of updates in this thread.

@djjudas21
Copy link

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

@AlexsJones
Copy link
Contributor

Thanks for the additional detail, I've setup a bare metal cluster ( albeit lower scale than yours ) and will look to install EBS/cstor with rook-ceph. We will then conduct a soak test and a variety of interrupts to generate data

@djjudas21
Copy link

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

@AlexsJones
Copy link
Contributor

Thanks. I installed my cStor from Helm directly, rather than using Rook. I've just made a gist with my values file so you can create a similar environment if you need to, although I the root cause of my problem was with dqlite rather than cStor.

It's still worth investigating as there might be a disk activity correlation - Will set this up now

@AlexsJones
Copy link
Contributor

Thanks @AlexsJones, I really appreciate the detailed response. It's good to hear that there is more going on behind the scenes than I was aware of. I'll be happy to help with testing and providing inspections reports, etc. I've generally had really good experiences interacting with the Canonical team on previous issues (@ktsakalozos in particular has been really helpful).

My specific environment is 4 identical hardware nodes, each with a SATA SSD which has the OS (Ubuntu LTS 22.04) and also the Snap stuff, so that will include the dqlite database. Each node also has an M.2 NVMe which is claimed by OpenEBS/cStor for use as clustered storage.

I don't use any localpath storage. I do also have an off-cluster TrueNAS server which provides NFS volumes via a RWX storage class.

I'm actually part-way through a series of blog posts and the first part covers my architecture. The third part was going to be about OpenEBS/cStor but then it all went wrong, so I'm holding off on writing that!

Are you using CephFS or RBD? If so how's that interacting with the cstor SC?

@djjudas21
Copy link

I'm not using CephFS or RBD, only OpenEBS & cStor

@AlexsJones
Copy link
Contributor

  • Create a microk8s cluster
  • Deploy Rook-Ceph

Okay, I saw it was mentioned at the top of the thread, thanks!

@djjudas21
Copy link

Okay, I saw it was mentioned at the top of the thread, thanks!

No worries, this isn't my thread originally, but I am affected by the same dqlite CPU spike

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

Hi @djjudas21, @doctorpangloss -- thanks very much for uploading the perf files, they're quite useful for narrowing down the root of the problem. Here are the resulting flamegraphs:

  1. For @djjudas21's data: djjudas21.svg
  2. For @doctorpangloss's data:
    doctorpangloss.svg

As you can see, CPU usage is dominated by calls to sqlite3_step. Looking at just the children of that function, the big contributions are from SQLite code, with some contributions also from dqlite's custom VFS (about 14% of the grand total), much of which boils down to calls to memcpy (9%). So my preliminary conclusion is that most of the CPU cycles are spent in SQLite, in which case it's likely the route to fixing this problem lies in optimizing the requests send by microk8s to dqlite (via the kine bridge). But I'll continue to investigate whether any parts of dqlite stand out as causing excessive CPU usage.

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

One possible issue: dqlite runs sqlite3_step in the libuv main thread, so if calls to sqlite3_step are taking quite a long time then we're effectively blocking the event loop -- which could have bad downstream consequences like Raft requests timing out and causing leadership churn. @djjudas21, @doctorpangloss, and anyone else who's experiencing this issue, it'd be very helpful if you could follow these steps to generate some data about the distribution of time spent in sqlite3_step:

  1. On an affected node, install bpftrace: sudo apt install bpftrace
  2. Find the PID of k8s-dqlite, then run
    $ sudo bpftrace -p $k8s_dqlite_pid -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
    
    That will keep running and gathering data until you kill it with Ctrl-C, and print an ASCII art histogram when it exits, which you can post in this issue thread.

Thanks again for your willingness to help debug this issue!

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

@cole-miller I could capture the requested histogram during about 3 minutes of such an event:

[512, 1K)          11070 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)            3758 |@@@@@@@@@@@@@@@@@                                   |
[2K, 4K)            4186 |@@@@@@@@@@@@@@@@@@@                                 |
[4K, 8K)            3882 |@@@@@@@@@@@@@@@@@@                                  |
[8K, 16K)            882 |@@@@                                                |
[16K, 32K)           966 |@@@@                                                |
[32K, 64K)          1048 |@@@@                                                |
[64K, 128K)          494 |@@                                                  |
[128K, 256K)         428 |@@                                                  |
[256K, 512K)          81 |                                                    |
[512K, 1M)            18 |                                                    |
[1M, 2M)               8 |                                                    |
[2M, 4M)            2208 |@@@@@@@@@@                                          |
[4M, 8M)            1271 |@@@@@                                               |
[8M, 16M)            267 |@                                                   |
[16M, 32M)            50 |                                                    |
[32M, 64M)            18 |                                                    |
[64M, 128M)            1 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              10 |                                                    |

This is a 3 nodes cluster with 2 12-CPU nodes and 1 2-CPU node. The particularity (?) is that there is a relatively high (~10ms) latency between nodes 1,2 and node 4. Below the CPU load measurement during the event. This event was on the little node, but it can equally happen on the big nodes, sending CPU load to 300% (apparently due to iowait, not sure).

image

Here is a perf capture (which is probably not good quality due to missing /proc/kallsyms?).

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

Here is another view of the same 15 minutes event, obtained with python psutil.Process(pid).cpu_time().

image

@sbidoul
Copy link

sbidoul commented Mar 2, 2023

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

@doctorpangloss
Copy link

sudo bpftrace -p $(pidof k8s-dqlite) -e 'uprobe:libsqlite3:sqlite3_step { @start[tid] = nsecs; } uretprobe:libsqlite3:sqlite3_step { @times = hist(nsecs - @start[tid]); delete(@start[tid]); }'
Attaching 2 probes...
^C

@start[13719]: 503234295412429
@times: 
[1K, 2K)            6297 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)            5871 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)            1597 |@@@@@@@@@@@@@                                       |
[8K, 16K)           4113 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[16K, 32K)           816 |@@@@@@                                              |
[32K, 64K)           542 |@@@@                                                |
[64K, 128K)          397 |@@@                                                 |
[128K, 256K)         500 |@@@@                                                |
[256K, 512K)          59 |                                                    |
[512K, 1M)            17 |                                                    |
[1M, 2M)              13 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)          3078 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[64M, 128M)           71 |                                                    |
[128M, 256M)           0 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)              16 |                                                    |

@cole-miller
Copy link
Contributor

cole-miller commented Mar 2, 2023

@sbidoul, @doctorpangloss -- thanks! It does look like a substantial number of our calls to sqlite3_step are taking milliseconds to tens of milliseconds, which is not outrageous but is long enough that we should probably try to move those calls out of the main thread. (We will want to do this in any case for the experimental disk mode.) I will try to work up a patch that does that and we'll make it available for you to try out when it's ready.

(I don't know what to think about the small number of calls that took entire seconds to complete.)

@doctorpangloss
Copy link

would it be helpful to have a reproducible environment? i can give you access to a hardware cluster reproduction of what is executing here.

@cole-miller
Copy link
Contributor

@doctorpangloss Yes, that would be very helpful! My SSH keys are listed at https://github.com/cole-miller.keys, and you can contact me at cole.miller@canonical.com to share any private details.

@cole-miller
Copy link
Contributor

cole-miller commented Mar 3, 2023

@sbidoul I had some trouble generating a flamegraph or perf report from your uploaded data -- if you get the chance, could you try following these steps (first source block, with your own perf.data file) on the machine in question and uploading the SVG? It seems like some debug symbols may be missing from my repro environment, or perhaps we have different builds of SQLite.

Re: your second graph, I'm a little confused because it looks like CPU usage for dqlite goes down during the spike event, and it's the kubelite process that's responsible for the spike. Am I misinterpreting?

As a side note, I have always been wondering what dqlite is doing to consume 0.2 CPU when the cluster is otherwise idle. Although I don't want to divert this thread if this is unrelated.

The main thing that dqlite has to do even in the steady state where it's not receiving any client requests is to exchange Raft "heartbeat" messages with other nodes, so that they don't think it has crashed. If you can gather perf data for one of those idle periods I'd be happy to try to interpret the results (it would be educational for me too, and we might uncover something unexpected).

@rich0
Copy link

rich0 commented Feb 3, 2024

I ran into this issue, with my cluster becoming very unstable, and dqlite going to 100% on the leader node, and kubelite constantly crashing.

I ran the query:
select TIME(),max(id)-(SELECT crkv.prev_revision FROM kine crkv WHERE crkv.name = 'compact_rev_key') from kine;

The result was 50-100k records and it was trending upwards slowly.

I did some of the dqlite recovery suggestions from the blog post (copying the files from the most current dqlite node to the others) and that didn't seem to help.

However, what did seem to help is stopping kubelite on all nodes, and then just leaving dqlite running. The uncompacted record count gradually trended downwards. The dqlite process eventually dropped below 100% CPU and the NVMe went to 100%, so it became IO-bound instead of CPU-bound. The uncompacted record count dropped to a few thousand. For some reason the memory use on that process started climbing and I had to restart the unit when it got to 28GB of RAM. Then a new leader took over, and went to 100% CPU and the record count went all the way back up to 90k or so. However, it eventually dropped and seemed stable.

At this point I restarted kubelite and for the most part it ran. API calls (eg kubectl get nodes) returned results very quickly - previously it would take maybe 10+ seconds to return (not normal in this cluster).

However, I suspect there is some kind of damage within the database as API calls return data that appears incorrect. For example, I'm running rook, and it looks like all my OSDs are running (ceph status reports this). However, a dump of the OSD pods from kubectl lists many as terminating/pending. The processes are clearly running.

I might play around with this a bit more to see if I can resurrect the cluster. If not I'll try to back up the rook state and probably will follow the rook disaster recovery and try to get my OSDs working in k3s or something, and rebuild the cluster.

@rich0
Copy link

rich0 commented Feb 3, 2024

A bit of an update: after rebooting a few nodes I now have a consistent API that looks normal (pod states reflect reality). However, if I start increasing the load on the cluster (especially with monitoring - prometheus/loki/etc) the dqlite leader saturates IO on its SSD and the record count starts creeping upwards again. Stopping kubelite on all nodes allows it to catch up. However, even with a fairly minimal load on the cluster dqlite is running near 100% CPU on the leader.

My theory is that there is something wrong with the database causing it to require more resources to do the same work over time. If it saturates IO or CPU then it turns into a positive feedback loop.

@miro-balaz
Copy link
Contributor

@snydergd @rich0 I just finished blog about the issue. https://miro-balaz.github.io/microk8s/sqlite/k8s-dqlite/2024/02/03/fixing-microk8s-database-overgrowth.html. With more queries and details.

It is good if you mention microk8s version that you are using.

@rich0
Copy link

rich0 commented Feb 3, 2024

@miro-balaz that was extremely helpful.

I'm running v1.28.3 (mostly). I did roll back two nodes to 1.27 to see if an update a few days ago caused the issue. That didn't seem to help and so I probably should get them back on 1.28.

I had 147k gap records in the database. After deleting them I have only 4.6k records, and no name having more than 300 individually. CPU use did drop a bit, though with my reduced cluster size it wasn't too unstable.

Honestly, I'm not eager to rebuild my entire cluster and recover Rook, so if this is salvageable and ideally fixed in a future patch release that would be wonderful!

@miro-balaz
Copy link
Contributor

@rich0 It seems there a was big change between 1.26 and 1.27 so if your situation is still bad you can try to downgrade once more. Or just grab k8s-dqlite of version at most 1.1.2 somewhere and replace the binary. I am not sure which version of microk8s uses which version of k8s-dqlite, I think 1.27 might have used one from 1.1.0 to 1.1.5 depending on minor version, the big change is between 1.1.2 and 1.1.3.

But there will be fix for this later, so if you are fine you can wait.

@rich0
Copy link

rich0 commented Feb 3, 2024

So, this definitely fixed the immediate problem, but I noticed that at least on my slower SSDs dqlite is struggling to keep up if I have monitoring enabled. I suspect it is just too much event/API/etc traffic, plus there are a lot of pods/daemonsets.

So, I suspect that maybe my cluster has just outgrown dqlite performance, at least in its current state and with my hardware (hardly high-end, but still, even cheap SSDs should be able to run a few dozen pods). I'm going to look to migrate to etcd after some planning, but this bought me some time. I can definitely endorse your blog and I suspect that a mix of performance of dqlite itself and its tendencey to create gap records is causing issues. Stopping kubelite on most nodes will help it catch up once the gap records are purged, though obviously that is going to no longer restart failing pods/etc.

@CEDDM
Copy link

CEDDM commented Mar 12, 2024

Hi all !
I'm facing same kind of issue with my dev cluster on a VM running with Virtual Box (Ubuntu 22.04, 32GB, 8vCPU).
Strange thing : the VM was working fine on Win10 and it seems this very same VM becomes very slow after upgrading to Win11.
I tried to create a brand new VM still on Ubuntu 22.04, 32GB, 8vCPU and I had the same problems right after installing microk8s (nothing installed on K8S !).

Here is a top with kubelite and dqlite consuming a lot of CPU with microk8s (nothing installed) :
image

Another weird thing : if I give less CPU to the VM, it seems to be better. Top with 4 vCPU :
image

But in fact in both cases, a microk8s.status tells me that microk8s is not running and a microk8s.inspect respond :

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240312_100123.tar.gz

Hopes it can help

@ktsakalozos
Copy link
Member

Thank you for reporting this @CEDDM . If you could attach the inspection tarball we may be able to figure out what is happening.

@doctorpangloss
Copy link

doctorpangloss commented Mar 12, 2024

@ktsakalozos It's been a long time. microk8s has to switch back to exclusively etcd, especially for HA setups, which works flawlessly, and is not challenging to deploy as per k0s. The underlying problem is the lack of telemetry of workloads, which you are never going to be able to collect.

@CEDDM
Copy link

CEDDM commented Mar 13, 2024

Hi @ktsakalozos
Here is the tarball from the fresh install (microk8s 1.29) which does not start :
inspection-report-20240312_154557.tar.gz
I can't share the tarball of the other VM (microk8s 1.25) for now because there are some informations about my company. I will see if I can clean these before sharing

@CEDDM
Copy link

CEDDM commented Mar 13, 2024

Here is a zip from the cleaned tarball of the other VM (Ubuntu 22.04, 32GB, 8vCPU) with microk8s 1.25 running but consuming all CPU :
inspection-report-20240312_101634.zip
image
Some pods are making tens of restart before running, others never succeed to start even after waiting like half an hour (it's a very small cluster, I only have 10 pods)

@masterkain
Copy link

masterkain commented May 8, 2024

same issue, today my microk8s completely crashed after just few days of running with no real workloads, eat up all memory and cpu, hard reset now it just eats all the cpu/ram again and won't start any container

May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="error while range on /registry/crd.projectcalico.org/ipamblocks/10-1-128-192-26 : query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"
May 08 13:03:09 microk8s microk8s.daemon-k8s-dqlite[919]: time="2024-05-08T13:03:09Z" level=error msg="fail to list latest changes: query (try: 0): theid"

after months of no fixes I switched to k3s

@doctorpangloss
Copy link

@cole-miller @ktsakalozos @AlexsJones this is you guys for the last 2 years
hlk2qa0edyb11_00000

@bigwookie
Copy link

Has anyone tried the v1.30 of m8s to see if the mentioned improvement of dqlite has made a difference?

@r2DoesInc
Copy link

r2DoesInc commented May 22, 2024

My cluster is now inoperable this past two weeks due to this bug. I have a few nodes running on 8GB RPI4 boards that are so impacted by this issue that I cant even ssh into them. In the few cases where IU have been able to reach them, Im actually seeing upwards of 200% cpu usage and 100% memory usage.

Even my 32gb RAM mac mini nodes is running into this same issue.

At this point my cluster has completely imploded due to this cpu/mem overutilization.

@r2DoesInc
Copy link

Has anyone tried the v1.30 of m8s to see if the mentioned improvement of dqlite has made a difference?

I am using 1.30/stable currently and saw no difference

@maximemoreillon
Copy link

maximemoreillon commented Jun 2, 2024

Same problem here, dqlite is using massive amounts of CPU on a 3 node Microk8s v1.30 cluster.
Nodes are AWS EC2 running Ubuntu server 20.04 LTS.
Running sudo systemctl restart snap.microk8s.daemon-k8s-dqlite.service helps mitigate the problem

@r2DoesInc
Copy link

r2DoesInc commented Jun 2, 2024 via email

@EmilCarpenter
Copy link

EmilCarpenter commented Jun 18, 2024

microk8s-k8s-dqlite-debug-github.com-issue-3227-A.txt
microk8s-k8s-dqlite-debug-github.com-issue-3227-B.txt

@ktsakalozos log files with --debug according to your instructions from a comment above.

No pods created over the standard ones.

microceph is installed, too.
microceph has a bug that fills the disk with logs at a rate of about 1G/h.
Even if log level is set to 'warning', it resets to default 'debug' after reboot.

The high CPU usage by k8s-dqlite, kubelite and rsyslog started on most nodes when disks got full of logs.

Tha attached log files are from worker1 with more than 1G disk available, which after maybe 5-10 minutes after reboot goes to 100% CPU usage.

There are lot's of entries with the IP 10.0.2.15 which maybe shouldn't be there at all. That is the VirtualBox NAT IP. There is another Host-only network which was used to add worker nodes, 192.168.60.0/24. I should probably configure the microk8s cluster for the 192 network at install time, will look into that.


Edit 1:
Without this issue, total CPU usage is 2% -15%.

Edit 2:
The 'workers' are just named so, they were joined without '--worker' flag.

Edit 3: Possible reason

Time sync was not implemented/working (due to me not installing Guest Additions on each VirtualBox VM).
Observations showed time being off by several days between the servers/VM's.
Not tested yet with time being in sync.


4 x Ubuntu 24.04 server VirtualBox'es in Ubuntu desktop host.
Master: 2 CPUs, 2G RAM
Workers x 3: 1 CPU, 1G RAM
microk8s version
MicroK8s v1.29.4 revision 6809

microceph --version
ceph-version: 18.2.0-0ubuntu3~cloud0; microceph-git: cba31e8c75

# Edit:
#  /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 192.168.60.11:19001
  Role: 0
- ID: 1760248264364118143
  Address: 192.168.60.21:19001
  Role: 1
- ID: 12809918262664895569
  Address: 192.168.60.22:19001
  Role: 0
- ID: 6417727055371815105
  Address: 192.168.60.23:19001
  Role: 0
kubemaster@kmaster:~$ sudo microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.60.11:19001 192.168.60.22:19001 192.168.60.23:19001
  datastore standby nodes: 192.168.60.21:19001
addons:
  enabled:
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
  disabled:
    ...

Screenshot from 2024-06-19 00-50-19

Hope this helps in troubleshooting.

@bananu7
Copy link

bananu7 commented Jun 27, 2024

I've just upgraded my cluster from 1.27 to 1.30. I run a small homelab with 3 nodes, 8,8 and 4G and 2c each. After the upgrade i noticed lack of stability. The 4G node got hit especially badly. Even though i drained the nodes during the update, the added load caused them to misbehave. Right now my the 4G node is cordoned, and sitting on 90% CPU. I tried disabling and enabling the control plane, and it didn't help. That's when i found this thread.

I'm perplexed as to why this issue can be open so long without a mitigation. I understand that m8s is an open source project, but if it's meant to be production ready, i should at least have a way of fixing my cluster in a hotfix way. I've read through all 100+ comments and it seems the best way is to back up my Longhorn, pack everything in boxes and move to a fresh k3s install.

@r2DoesInc
Copy link

r2DoesInc commented Jun 27, 2024 via email

@bananu7
Copy link

bananu7 commented Jun 27, 2024

It seems that a downgrade of the offending node to 1.29.4 did help a bit (silver is 4GB):

image

I'll downgrade the other ones and see what happens. I'd really like to avoid going for 1 control plane node if I can.

@doctorpangloss
Copy link

I'm perplexed as to why this issue can be open so long without a mitigation.

Beats me. I would have started the migration to etcd away from dqlite 2 years ago. k3s used to have the same problems and stopped using dqlite in 2022.

@Dunge
Copy link
Contributor

Dunge commented Jun 27, 2024

I don't believe versions changes has much impact on triggering the issue or not. Nearly 2 years ago I was on 1.26 and got it sporadically, sometimes rebooting one node would fix it. I tried to upgrade to 1.27 and never managed to get it working, it was always stuck in this mode. I downgraded back to 1.26 and it's been running flawlessly ever since (nearly two years). I do believe it just require one heavy change of node structure to trigger it, and it's the same issue that linger since years. So no @bananu7 switching from 1.30 to 1.29 fixing it was probably dumb luck.

I somehow also believe it's related to disk usage. I have issues related to Longhorn, other comments here mentioned ceph and microceph struggling, and low disk space remaining.

Suggesting disabling HA is also completely ridiculous, HA is the only reason why we use kubernetes in the first place.

@fmiqbal
Copy link

fmiqbal commented Jun 28, 2024

Got this problem on my 3 master, 8c 16G (my mistake that I didnt isolate master node with worker node on my 500 pod cluster, but nonetheless),

One of the 3 node become unresponsive and hogging 100% cpu, 100% memory, and because swap is disabled as per recommendation, it have nowhere to go, and I tried to force reboot it. But then the other node now become like that, so i ended up doing recovery and downgrade it into single cluster, and need to do leave-rejoin the worker node, which also broke my cluster longhorn state and 40 volume faulted (also my mistake).

On the migration to standalone master node (that will solely host microk8s, without any other k8s workload, with 3 node 4c/8g because I think it will be enough), I found some findings,

  • that a single node master state can become flapping, the kubelet become timed out because dqlite is becoming slower, showing on "Context Deadline Exceeded" with Txn call completed in 10000ms
  • that on the third node joining to become HA-cluster, after the dqlite receiving and inflating, it then grow in memory, draining all free memory, make it swapping, and cpu maxed out, this process only happen on the third node, the "solution" is just to wait it out, and also adding swap is helping a bit, but generally you want higher memory (my 8g can't cope)
  • after I wait it out, it then lower the memory and cpu usage, and then it kinda stable, but then it become flapping again because of slow database call, and sometihng is killed, now the weird things is when I do kubectl describe node, it says "kubelet : OOM, victim dqlite" or something like that, can't quiet get the log because i need it to work asap.
  • the main problem for the slow database call I think is the gaps, because I have 60k total record on kine, and 40k of them is gaps, now how it happen to have gaps in the first place I don't how, after i delete the gaps, it become more stable

What does help me debugging is enabling dqlite debug with flag in /var/snap/microk8s/current/args/k8s-dqlite-env, checking who is the leader node with

sudo -E /snap/microk8s/current/bin/dqlite -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key -s file:///var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml k8s ".leader"

accessing the dqlite database with

/snap/microk8s/current/bin/dqlite -s 10.10.8.62:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s

checking the kubelite Txn duration

sudo journalctl -f | grep Txn

The questions that remain is,

  • why the HA-cluster creation on third node so resource heavy on dqlite ?
  • and if its because clutter of gaps, how can it be mitigated ?

What I currently use is throw_more_ram™, although I think we need a benchmark on how many k8s resource can dqlite realistically handle with how big the storage size /var/snap/microk8s/current/var/kubernetes/backend/ is used on how big the memory should be, Currently I have 20K kine entry, with 700MB Snapshots size, with RAM usage sitting on 11 GB (dqlite and kubelet)

On the other hand, or rather the main thread here is, there are still occasionally constant 100% dqlite usage, but with my 8 core, I didnt worry too much

@lucian1094
Copy link

I'm also facing with this issue. I want to install a external etcd cluster and to restore data from dqlite backup. Has anyone tried this? Can it be done?

@doctorpangloss
Copy link

  • why the HA-cluster creation on third node so resource heavy on dqlite ?

because dqlite is buggy

and if its because clutter of gaps, how can it be mitigated ?

By adopting etcd

I'm also facing with this issue. I want to install a external etcd cluster and to restore data from dqlite backup. Has anyone tried this? Can it be done?

For your sake, I'd hope so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests