Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/dashboard: add prometheus federation config for mullti-cluster monitoring #54964

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

aaSharma14
Copy link
Contributor

@aaSharma14 aaSharma14 commented Dec 19, 2023

Introduce prometheus fedeartion in ceph dashboard. This is done by adding a federate job to the prometheus configuration. We can add/remove targets (remote cluster's prometheus service endpoint) to this job to scrape data from different clusters. These targets are getting added in the prometheus config file by exposing two new orch clis -

  1. ceph orch prometheus set-target
  2. ceph orch prometheus remove-target

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Copy link
Member

@nizamial09 nizamial09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial impressions!

src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/monitoring.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/monitoring.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/_interface.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/monitoring.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/monitoring.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/_interface.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/module.py Outdated Show resolved Hide resolved
@aaSharma14
Copy link
Contributor Author

jenkins retest this please

@nizamial09
Copy link
Member

jenkins test make check

Copy link
Contributor

@adk3798 adk3798 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. Can't speak much to the changes to the actual prometheus conf, but generally the code looks okay outside of the tests failing.

src/pybind/mgr/cephadm/services/monitoring.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/orchestrator/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
@aaSharma14 aaSharma14 force-pushed the add-prometheus-federation-cli branch 2 times, most recently from 96182fd to c21d7ac Compare February 5, 2024 11:14
Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left some minor comments + some other more specific to security. Plz, I'd like to know if we have done any security assessment of what are the implications of enabling the security + this new feature. Is the system still secure? if not what security issues could we face when enabling this new feature and what should we do to overcome them.

src/pybind/mgr/cephadm/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/cephadm/services/monitoring.py Show resolved Hide resolved
Comment on lines 41 to 44
relabel_configs:
- source_labels: [__address__]
target_label: cluster
replacement: {{ cluster_fsid }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section assumes you are using secure communication. Is this the case? what security implications has this new feature? are we taking them into account? have we did any security assessment for the impact?

@@ -964,6 +964,18 @@ def _set_prometheus_access_info(self, username: Optional[str] = None, password:
except ArgumentError as e:
return HandleCommandResult(-errno.EINVAL, "", (str(e)))

@_cli_write_command('orch prometheus set-target')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I give a target like http://<ip>:port it kind of kills the prometheus daemon and i had to remove the target and restart prometheus module to get it working. Should it fail like that for a simple error. And if this is a crucial mistake, then it should have a proper validation set-up or we might end up breaking a deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atleast some helpers mentioning how the prometheus target should look like would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nizamial09 , this issue is being tracked here - https://tracker.ceph.com/issues/64369, Will open a separate PR for the mentioned issues soon.

@cloudbehl
Copy link
Contributor

@adk3798 can we merge this if the teuthology run is okay? (asking since there are a series of bug fixes planned after #55574, which currently depends on this PR) thanks.

yeah, it looks like my testing tag got removed so this wasn't in my last run, but I'll do another run in the next day or 2 and if this doesn't break anything we can merge.

@adk3798 any updates?

@aaSharma14 aaSharma14 force-pushed the add-prometheus-federation-cli branch from d7faaaf to 852ecb8 Compare March 4, 2024 06:03
@nizamial09
Copy link
Member

@aaSharma14 i saw these in the unit test failures

DEBUG    cephadm.serve:serve.py:828 Daemons that will be removed: []
DEBUG    cephadm.serve:serve.py:909 Placing haproxy.ingress.test.mhaubs on host test
DEBUG    cephadm.services.ingress:ingress.py:71 prepare_create haproxy.ingress.test.mhaubs on host test with spec IngressSpec.from_json(yaml.safe_load('''service_type: ingress
service_id: ingress
service_name: ingress.ingress
placement:
  count: 2
spec:
  backend_service: rgw.foo
  first_virtual_router_id: 50
  frontend_port: 8089
  keepalived_password: '12345'
  monitor_password: '12345'
  monitor_port: 8999
  monitor_user: admin
  virtual_ip: 1.2.3.4/32
'''))
DEBUG    cephadm.services.ingress:ingress.py:171 enabled default server opts: []
DEBUG    asyncio:selector_events.py:54 Using selector: EpollSelector
INFO     cephadm.serve:serve.py:1340 Deploying daemon haproxy.ingress.test.mhaubs on test
DEBUG    cephadm.inventory:inventory.py:894 Host test has no devices to save
DEBUG    cephadm.serve:serve.py:909 Placing keepalived.ingress.test.uonwnj on host test
DEBUG    cephadm.services.ingress:ingress.py:215 prepare_create keepalived.ingress.test.uonwnj on host test with spec IngressSpec.from_json(yaml.safe_load('''service_type: ingress
service_id: ingress
service_name: ingress.ingress
placement:
  count: 2
spec:
  backend_service: rgw.foo
  first_virtual_router_id: 50
  frontend_port: 8089
  keepalived_password: '12345'
  monitor_password: '12345'
  monitor_port: 8999
  monitor_user: admin
  virtual_ip: 1.2.3.4/32
'''))
INFO     cephadm.services.ingress:ingress.py:263 1.2.3.4 is in 1.2.3.0/24 on test interface if0

@adk3798
Copy link
Contributor

adk3798 commented Mar 4, 2024

@adk3798 can we merge this if the teuthology run is okay? (asking since there are a series of bug fixes planned after #55574, which currently depends on this PR) thanks.

yeah, it looks like my testing tag got removed so this wasn't in my last run, but I'll do another run in the next day or 2 and if this doesn't break anything we can merge.

@adk3798 any updates?

https://pulpito.ceph.com/adking-2024-03-02_21:18:48-orch:cephadm-wip-adk-testing-2024-03-01-1302-distro-default-smithi/

failures were all the in cluster log stuff which is okay as we work on the ignorelist for the orch/cephadm suite, mds upgrade sequence test failures which are a known issue, and the test_repos test failing to get jammy packages for quincy which is also a known issue. So no regressions caused by this, and I'm okay merging once the CI here is passing.

monitoring

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
@aaSharma14 aaSharma14 force-pushed the add-prometheus-federation-cli branch from f8c0940 to 82b50b4 Compare March 5, 2024 06:29
@aaSharma14
Copy link
Contributor Author

jenkins test api

@nizamial09 nizamial09 merged commit 7d58640 into ceph:main Mar 6, 2024
12 of 14 checks passed
@nizamial09 nizamial09 deleted the add-prometheus-federation-cli branch March 6, 2024 06:29
frittentheke added a commit to frittentheke/ceph that referenced this pull request Mar 22, 2024
Rendering the dashboards and alerts with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Mar 25, 2024
Rendering the dashboards and alerts with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Apr 8, 2024
Rendering the dashboards and alerts with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Apr 12, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Fixes: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Apr 12, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Apr 19, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request Apr 22, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
adk3798 pushed a commit to adk3798/ceph that referenced this pull request Apr 24, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Resolves: rhbz#2275936

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 2457451)
joscollin pushed a commit to joscollin/ceph that referenced this pull request May 6, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
frittentheke added a commit to frittentheke/ceph that referenced this pull request May 14, 2024
Rendering the dashboards with showMultiCluster=True allows for
them to work with multiple clusters storing their metrics in a single
Prometheus instance. This works via the cluster label and that functionality
already existed. This just fixes some inconsistencies in applying the label
filters.

Additionally this contains updates to the tests to have them succeed with
with both configurations and avoid the introduction of regressions in
regards to multiCluster in the future.

There also are some consistency cleanups here and there:
 * `datasource` was not used consistently
 * `cluster` label_values are determined from `ceph_health_status`
 * `job` template and filters on this label were removed to align multi cluster
    support solely via the `cluster` label
 * `ceph_hosts` filter now uses label_values from any ceph_metadata metrici
    to now show all instance values, but those of hosts with some Ceph
    component / daemon.
 *  Enable showMultiCluster=True since `cluster` label is now always present,
    via ceph#54964

Improves: https://tracker.ceph.com/issues/64321
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
(cherry picked from commit 090b8e1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
6 participants