Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database locked situation while creating namespace should be handled more gracefully in the Charm #126

Open
motjuste opened this issue Dec 15, 2023 · 1 comment

Comments

@motjuste
Copy link

Summary

The charm does not gracefully handle database locked RPC error when creating a new namespace via Juju.

2023-12-14-09:31:17 root ERROR [localhost] Command failed: juju add-model -c foundations-maas juju-system-microk8s microk8s_cloud
2023-12-14-09:31:17 root ERROR 1[localhost] STDOUT follows:
b''
2023-12-14-09:31:17 root ERROR 2[localhost] STDERR follows:
ERROR creating namespace "juju-system-microk8s": rpc error: code = Unknown desc = exec (try: 500): database is locked

In the syslog for the main machine:

Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.024783    9310 job_controller.go:562] "enqueueing job" key="rook-ceph/rook-ceph-csi-detect-version"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.106641    9310 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="rook-ceph/csi-rbdplugin-provisioner-5885496bf5" duration="83.646528ms"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.116299    9310 replica_set.go:676] "Finished syncing" kind="ReplicaSet" key="rook-ceph/csi-rbdplugin-provisioner-5885496bf5" duration="651.454µs"
Dec 14 09:31:11 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:11.709716    9310 job_controller.go:562] "enqueueing job" key="rook-ceph/rook-ceph-csi-detect-version"
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-k8s-dqlite[11911]: time="2023-12-14T09:31:17Z" level=error msg="error in txn: exec (try: 500): database is locked"
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: E1214 09:31:17.734550    9310 status.go:71] apiserver received an error that is not an metav1.Status: &status.Error{s:(*status.Status)(0xc00950fd48)}: rpc error: code = Unknown desc = exec (try: 500): database is locked
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: I1214 09:31:17.734937    9310 trace.go:236] Trace[1854421033]: "Create" accept:application/json, */*,audit-id:c1a0ffa6-961c-4f85-92c5-da79207e583e,client:10.246.167.34,protocol:HTTP/2.0,resource:namespaces,scope:resource,url:/api/v1/namespaces,user-agent:Go-http-client/2.0,verb:POST (14-Dec-2023 09:31:15.682) (total time: 2052ms):
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]: ["Create etcd3" audit-id:c1a0ffa6-961c-4f85-92c5-da79207e583e,key:/namespaces/juju-system-microk8s,type:*core.Namespace,resource:namespaces 2030ms (09:31:15.704)
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]:  ---"Txn call failed" err:rpc error: code = Unknown desc = exec (try: 500): database is locked 2029ms (09:31:17.734)]
Dec 14 09:31:17 microk8s-55-1-3 microk8s.daemon-kubelite[9310]: Trace[1854421033]: [2.052580676s] [2.052580676s] END

What Should Happen Instead?

A database locked error, which is probably a temporary issue while creating a namespace, should be handled more gracefully, perhaps retrying after some time. It should at least be reported more gracefully to Juju if the retry must be done by that. See feedback from the Juju team in the relevant issue filed with them LP#2046471.

Reproduction Steps

Such issues are difficult to reproduce, but we have a lot of information and crash-dump available in our Solutions QA labs (accessible only to Canonical employees). In any case, a short summary would be as follows:

  1. Using Juju, create a new microk8s_cloud with three nodes (provided by MAAS).
  2. Attempt to create a namespace in this cloud via Juju:
    $ juju add-model -c ... juju-system-microk8s microk8s_cloud

Environment

MicroK8s charm track: 1.28/stable
Juju version: 3.1.6
Cloud: MaaS

Additional info, logs

Apart from the parts of the logs mentioned above, a lot of information can be found in the Solutions QA labs, but that is accessible only to Canonical employees. There, the relevant crash-dump artefact can be found at generated/generated/microk8s/juju-crashdump-microk8s-2023-12-14-09.32.16.tar.gz.

@ktsakalozos
Copy link
Member

Hi @motjuste, thank you for reporting this. This kind of behavior we often see when the node is under I/O starvation. Usually there is a background process (it does not need to be a pod/container) that performs some I/O heavy task so the kubernetes datastore does not have the chance to write on the disk and distribute the write operation to its peers. The failure of creating the namespace is just a symptom. In the reported case the I/O operation failed because for two seconds we could not reach the DB. During that time we retried the request 500 times. What else was happening on the physical node at that time? Maybe ceph was syncing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants