Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus fails to restart after loki upgrade #521

Closed
PietroPasotti opened this issue Sep 7, 2023 · 4 comments · Fixed by #524
Closed

prometheus fails to restart after loki upgrade #521

PietroPasotti opened this issue Sep 7, 2023 · 4 comments · Fixed by #524

Comments

@PietroPasotti
Copy link
Contributor

PietroPasotti commented Sep 7, 2023

Bug Description

prometheus locked in error state after refreshing loki in cos-lite (+ tls overlay)

To Reproduce

juju deploy cos-lite --trust --overlay tls-overlay.yaml
juju refresh loki --channel edge

Environment

microk8s
edge prometheus (rev 143)

Relevant log output

unit-prometheus-0: 17:27:29 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/pods/prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:30 INFO unit.prometheus/0.juju-log reqs=ResourceRequirements(claims=None, limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(claims=None, limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(claims=None, limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prometheus-0: 17:27:30 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/apis/apps/v1/namespaces/bar/statefulsets/prometheus "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:30 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/pods/prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:30 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/pods/prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:31 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/persistentvolumeclaims/prometheus-database-88da8e7c-prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:31 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/pods/prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:31 INFO unit.prometheus/0.juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/bar/persistentvolumeclaims/prometheus-database-88da8e7c-prometheus-0 "HTTP/1.1 200 OK"
unit-prometheus-0: 17:27:31 ERROR unit.prometheus/0.juju-log Failed to replan; pebble layer: {'summary': 'Prometheus layer', 'description': 'Pebble layer configuration for Prometheus', 'services': {'prometheus': {'summary': 'prometheus daemon', 'startup': 'enabled', 'override': 'replace', 'command': '/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus --web.enable-lifecycle --web.console.templates=/usr/share/prometheus/consoles --web.console.libraries=/usr/share/prometheus/console_libraries --web.config.file=/etc/prometheus/prometheus-web-config.yml --web.external-url=https://prometheus-0.prometheus-endpoints.bar.svc.cluster.local:9090/bar-prometheus-0 --web.route-prefix=/ --log.level=info --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0.8GB'}}}; cannot perform the following tasks:
- Start service "prometheus" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2023-09-07T15:27:31Z INFO Most recent service output:
    ts=2023-09-07T15:27:31.263Z caller=main.go:585 level=info msg="Starting Prometheus Server" mode=server version="(version=2.46.0, branch=HEAD, revision=cbb69e51423565ec40f46e74f4ff2dbb3b7fb4f0)"
    ts=2023-09-07T15:27:31.263Z caller=main.go:590 level=info build_context="(go=go1.19.11, platform=linux/amd64, user=root@rockcraft-prometheus-322591, date=20230730-12:08:01, tags=netgo,builtinassets,stringlabels)"
    ts=2023-09-07T15:27:31.263Z caller=main.go:591 level=info host_details="(Linux 6.2.0-32-generic #32~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 18 10:40:13 UTC 2 x86_64 prometheus-0 (none))"
    ts=2023-09-07T15:27:31.263Z caller=main.go:592 level=info fd_limits="(soft=65536, hard=65536)"
    ts=2023-09-07T15:27:31.263Z caller=main.go:593 level=info vm_limits="(soft=unlimited, hard=unlimited)"
    ts=2023-09-07T15:27:31.265Z caller=web.go:563 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
    ts=2023-09-07T15:27:31.265Z caller=main.go:846 level=error msg="Unable to validate web configuration file" err="failed to load X509KeyPair: open /etc/prometheus/server.cert: no such file or directory"
2023-09-07T15:27:31Z ERROR cannot start service: exited quickly with code 1
-----
unit-prometheus-0: 17:27:31 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

Additional context

No response

@sed-i
Copy link
Contributor

sed-i commented Sep 7, 2023

error msg="Unable to validate web configuration file" err="failed to load X509KeyPair: open /etc/prometheus/server.cert: no such file or directory"

Duplicates #506. Partial fix in #509 (?)

@sed-i
Copy link
Contributor

sed-i commented Sep 8, 2023

@PietroPasotti
Copy link
Contributor Author

Just got this right after deploy bundle (with tls overlay):

unit-loki-0: 16:15:22 ERROR unit.loki/0.juju-log Uncaught exception while in charm code:                                                                 
Traceback (most recent call last):                                                                                                                       
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 2792, in _run                                                                    
    result = subprocess.run(args, **kwargs)  # type: ignore                                                                                              
  File "/usr/lib/python3.8/subprocess.py", line 516, in run                                                                                              
    raise CalledProcessError(retcode, process.args,                                                                                                      
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-loki-0/relation-get', '-r', '3', '-', 'loki/0', '--format=json')' returned non-zero ex
it status 1.                                                                                                                                             
                                                                                                                                                         
During handling of the above exception, another exception occurred:                                                                                      
                                                                                                                                                         
Traceback (most recent call last):                                                                                                                       
  File "./src/charm.py", line 523, in <module>                                                                                                           
    main(LokiOperatorCharm, use_juju_for_storage=True)                                                                                                   
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/main.py", line 429, in main                                                                      
    charm = charm_class(framework)                                                                                                                       
  File "./src/charm.py", line 118, in __init__                                                                                                           
    source_url=self._external_url,                                                                                                                       
  File "./src/charm.py", line 249, in _external_url                                                                                                      
    scheme = "https" if self.server_cert.cert else "http"                                                                                                
  File "/var/lib/juju/agents/unit-loki-0/charm/lib/charms/observability_libs/v0/cert_handler.py", line 323, in cert                                      
    return self._server_cert                                                                                                                             
  File "/var/lib/juju/agents/unit-loki-0/charm/lib/charms/observability_libs/v0/cert_handler.py", line 333, in _server_cert                              
    return self._peer_relation.data[self.charm.unit].get("certificate", None)                                                                            
  File "/usr/lib/python3.8/_collections_abc.py", line 660, in get                                                                                        
    return self[key]                                                                                                                                     
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 1582, in __getitem__                                                             
    return super().__getitem__(key)                                                                                                                      
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 686, in __getitem__                                                              
    return self._data[key]                                                                                                                               
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 670, in _data                                                                    
    data = self._lazy_data = self._load()                                                                                                                
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 1466, in _load                                                                   
    return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)                                                                 
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 2873, in relation_get                                                            
    raw_data_content = self._run(*args, return_output=True, use_json=True)                                                                               
  File "/var/lib/juju/agents/unit-loki-0/charm/venv/ops/model.py", line 2794, in _run                                                                    
    raise ModelError(e.stderr)                                                                                                                           
ops.model.ModelError: ERROR permission denied                                                                                                            
                                                                                                                                                         
unit-loki-0: 16:15:22 ERROR juju.worker.uniter.operation hook "loki-chunks-storage-detaching" (via hook dispatching script: dispatch) failed: exit status
 1                                                                                                                                                       

@PietroPasotti
Copy link
Contributor Author

PietroPasotti commented Sep 11, 2023

the first update-status clears it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants