You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a deployment of COS stable with a 3 node microk8s and microceph back-end. The grafana unit stayed stuck executing for over 4 hours, bumping into our timeout for hanging.
The unit keeps executing config changed events for the whole duration, but never resolves if it is executing or not. The grafana logs show that it still runs a handful of commands but then stops with a Exec 19: timeout waiting for websocket connections: context deadline exceeded and never runs anything else over the 4 hour period.
# Storage overlay## This sample overlay shows how to configure the amount of# storage to grant to the various COS components.applications:
alertmanager:
storage:
# The Alertmanager operator uses the `database` storage to# store nflog and silence snapshots.# See the `--storage.path` argument of Alertmanagerdata: 2Ggrafana:
storage:
# The Grafana operator uses the `database` storage to store# configurations, plugins, user data, etc. That is, everything# that is normally stored under /var/lib/grafana.database: 2Gloki:
storage:
# This is the volume used by loki for the index of boltdb-shipper.# See https://grafana.com/docs/loki/latest/configuration/examples/active-index-directory: 2G# Loki will store in this volume the chunks, which contain the# actual log content.# See https://grafana.com/docs/loki/latest/operations/storage/loki-chunks: 10Gprometheus:
storage:
# Prometheus operator uses the `database` storage to store TSDB data.# See the `--storage.tsdb.path` argument of Prometheus.database: 10G# TLS overlayapplications:
ca:
charm: self-signed-certificateschannel: edgescale: 1options:
ca-common-name: ca.demo.localexternal-ca:
# This charm needs to be replaced with a real CA charm.# Use `juju refresh --switch` to replace via a "crossgrade refresh".charm: self-signed-certificateschannel: edgescale: 1options:
ca-common-name: external-ca.example.comrelations:
# This is a more general CA (e.g. root CA) that signs traefik's own CSR.
- [external-ca, traefik:certificates]# This is the local CA that signs CSRs from COS charms (excluding traefik).# Traefik is trusting this CA so that it could load balance via TLS.
- [ca, traefik:receive-ca-cert]
- [ca, alertmanager:certificates]
- [ca, prometheus:certificates]
- [ca, grafana:certificates]
- [ca, loki:certificates]
- [ca, catalogue:certificates]# Testing Overlayapplications:
alertmanager:
scale: 2# Set low resource requests to accommodate small VMsoptions:
cpu: 1mmemory: 1Miprometheus:
# Set low resource requests to accommodate small VMsoptions:
cpu: 1mmemory: 1Migrafana:
# Set low resource requests to accommodate small VMsoptions:
cpu: 1mmemory: 1Miloki:
# Set low resource requests to accommodate small VMsoptions:
cpu: 1mmemory: 1Miavalanche:
charm: avalanche-k8schannel: edgescale: 2trust: trueoptions:
metric_count: 10series_count: 2relations:
- [avalanche:metrics-endpoint, prometheus]
from grafana logs:
2024-01-16T03:37:32.044959467Z stderr F 2024-01-16T03:37:32.044Z [pebble] GET /v1/changes/16/wait?timeout=4.000s 1.0201003s 200
2024-01-16T03:37:32.110507133Z stdout F 2024-01-16T03:37:32.110Z [grafana] logger=migrator t=2024-01-16T03:37:32.110169145Z level=info msg="Executing migration" id="alter alert_rule table data column to mediumtext in mysql"
2024-01-16T03:37:32.118688051Z stderr F 2024-01-16T03:37:32.117Z [pebble] POST /v1/files 53.25769ms 200
2024-01-16T03:37:32.2238537Z stderr F 2024-01-16T03:37:32.222Z [pebble] POST /v1/exec 100.911324ms 202
2024-01-16T03:37:32.228219063Z stderr F 2024-01-16T03:37:32.226Z [pebble] GET /v1/tasks/17/websocket/control 512.422µs 200
2024-01-16T03:37:32.228252585Z stderr F 2024-01-16T03:37:32.228Z [pebble] GET /v1/tasks/17/websocket/stdio 58.318µs 200
2024-01-16T03:37:32.2308767Z stderr F 2024-01-16T03:37:32.229Z [pebble] GET /v1/tasks/17/websocket/stderr 56.178µs 200
2024-01-16T03:37:32.275297745Z stdout F 2024-01-16T03:37:32.274Z [grafana] logger=migrator t=2024-01-16T03:37:32.266897897Z level=info msg="Executing migration" id="add column for to alert_rule"
2024-01-16T03:37:32.298035757Z stderr F 2024-01-16T03:37:32.297Z [pebble] GET /v1/changes/17/wait?timeout=4.000s 67.806172ms 200
2024-01-16T03:37:32.362580189Z stdout F 2024-01-16T03:37:32.362Z [grafana] logger=migrator t=2024-01-16T03:37:32.362289259Z level=info msg="Executing migration" id="add column annotations to alert_rule"
2024-01-16T03:37:32.510454928Z stdout F 2024-01-16T03:37:32.510Z [grafana] logger=migrator t=2024-01-16T03:37:32.510290484Z level=info msg="Executing migration" id="add column labels to alert_rule"
2024-01-16T03:37:32.515924111Z stderr F 2024-01-16T03:37:32.515Z [pebble] GET /v1/files?action=read&path=%2Fetc%2Fgrafana%2Fgrafana-config.ini 97.751µs 200
2024-01-16T03:37:32.519063318Z stderr F 2024-01-16T03:37:32.518Z [pebble] GET /v1/plan?format=yaml 135.037µs 200
2024-01-16T03:37:32.522480764Z stderr F 2024-01-16T03:37:32.522Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Fgrafana%2Fgrafana.crt&itself=true 138.467µs 200
2024-01-16T03:37:32.523503602Z stderr F 2024-01-16T03:37:32.523Z [pebble] GET /v1/plan?format=yaml 154.979µs 200
2024-01-16T03:37:32.612165021Z stdout F 2024-01-16T03:37:32.611Z [grafana] logger=migrator t=2024-01-16T03:37:32.610363226Z level=info msg="Executing migration" id="remove unique index from alert_rule on org_id, title columns"
2024-01-16T03:37:32.751029367Z stdout F 2024-01-16T03:37:32.750Z [grafana] logger=migrator t=2024-01-16T03:37:32.74940012Z level=info msg="Executing migration" id="add index in alert_rule on org_id, namespase_uid and title columns"
2024-01-16T03:37:32.930828644Z stdout F 2024-01-16T03:37:32.922Z [grafana] logger=migrator t=2024-01-16T03:37:32.922290584Z level=info msg="Executing migration" id="add dashboard_uid column to alert_rule"
2024-01-16T03:37:33.083516937Z stdout F 2024-01-16T03:37:33.083Z [grafana] logger=migrator t=2024-01-16T03:37:33.083346919Z level=info msg="Executing migration" id="add panel_id column to alert_rule"
2024-01-16T03:37:33.274148687Z stdout F 2024-01-16T03:37:33.274Z [grafana] logger=migrator t=2024-01-16T03:37:33.273938776Z level=info msg="Executing migration" id="add index in alert_rule on org_id, dashboard_uid and panel_id columns"
2024-01-16T03:37:33.389053587Z stderr F 2024-01-16T03:37:33.388Z [pebble] POST /v1/files 15.649629ms 200
2024-01-16T03:37:33.400407947Z stderr F 2024-01-16T03:37:33.396Z [pebble] GET /v1/files?action=list&path=%2Fetc%2Fgrafana%2Fgrafana.crt&itself=true 91.165µs 200
2024-01-16T03:37:33.402971885Z stderr F 2024-01-16T03:37:33.402Z [pebble] GET /v1/plan?format=yaml 178µs 200
2024-01-16T03:37:33.413484984Z stderr F 2024-01-16T03:37:33.413Z [pebble] POST /v1/layers 206.291µs 200
2024-01-16T03:37:33.416391526Z stderr F 2024-01-16T03:37:33.415Z [pebble] GET /v1/services?names=grafana 48.368µs 200
2024-01-16T03:37:33.424911275Z stdout F 2024-01-16T03:37:33.423Z [grafana] logger=migrator t=2024-01-16T03:37:33.423545041Z level=info msg="Executing migration" id="add rule_group_idx column to alert_rule"
2024-01-16T03:37:33.429173723Z stderr F 2024-01-16T03:37:33.428Z [pebble] POST /v1/services 10.490339ms 202
2024-01-16T03:37:33.480006617Z stderr F 2024-01-16T03:37:33.479Z [pebble] Service "grafana" stopped
2024-01-16T03:37:33.487917974Z stderr F 2024-01-16T03:37:33.487Z [pebble] GET /v1/changes/18/wait?timeout=4.000s 57.393551ms 200
2024-01-16T03:37:33.494897995Z stderr F 2024-01-16T03:37:33.493Z [pebble] POST /v1/files 4.487638ms 200
2024-01-16T03:37:33.504232314Z stderr F 2024-01-16T03:37:33.503Z [pebble] POST /v1/files 9.445754ms 200
2024-01-16T03:37:33.510299309Z stderr F 2024-01-16T03:37:33.510Z [pebble] POST /v1/files 5.822565ms 200
2024-01-16T03:37:33.524761196Z stderr F 2024-01-16T03:37:33.523Z [pebble] POST /v1/exec 8.118187ms 202
2024-01-16T03:37:38.542857354Z stderr F 2024-01-16T03:37:38.541Z [pebble] Exec 19: timeout waiting for websocket connections: context deadline exceeded
Bug Description
On a deployment of COS stable with a 3 node microk8s and microceph back-end. The grafana unit stayed stuck executing for over 4 hours, bumping into our timeout for hanging.
The unit keeps executing config changed events for the whole duration, but never resolves if it is executing or not. The grafana logs show that it still runs a handful of commands but then stops with a
Exec 19: timeout waiting for websocket connections: context deadline exceeded
and never runs anything else over the 4 hour period.Nothing else stands out in the bugs description.
To Reproduce
Environment
Running on:
microk8s - v1.28.3
microceph (snap) - latest/edge
cos-lite - latest/stable:11
metallb - v0.13.10
Relevant log output
Additional context
crashdump can be found at: https://oil-jenkins.canonical.com/artifacts/39937174-e6c2-401b-940d-64cb8655ef02/generated/generated/microk8s/juju-crashdump-microk8s-2024-01-16-07.40.28.tar.gz
The text was updated successfully, but these errors were encountered: