Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quincy: Revert "ceph-exporter: cephadm changes" #51053

Merged
merged 1 commit into from May 24, 2023

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Apr 12, 2023

This reverts commit 5f04222.

Issues were found with the ceph-exporter service in 17.2.6.


  • Ceph metrics could be duplicated: from the existing centralized mgr/prometheus exporter (legacy approach), and from the new ceph-exporter, which is deployed as a side-car container on each node and reports Ceph metrics for all colocated services in that node.

  • In a few cases (e.g.: ceph_mds_mem_cap+), where those metrics contain characters not supported by Prometheus (e.g.: +), Prometheus would complain about malformed metric names (however those metrics will still be properly reported via mgr/prometheus as ceph_mds_mem_cap_plus).

  • In a few cases (e.g.: ceph_mds_mem_cap_minus), with metrics ending in -, they would have different names in Prometheus (e.g.: ceph_mds_mem_cap_minus when coming from mgr/prometheus and ceph_mds_mem_cap_ when coming from ceph-exporter).


Therefore we've decided it's best to revert cephadm's support for ceph-exporter in quincy

Conflicts:
src/cephadm/cephadm
src/pybind/mgr/cephadm/tests/test_services.py


I tested this change using an image based on the 17.2.6 release plus this reversion. The branch used for that can be found here https://github.com/adk3798/ceph/commits/17-2-6-cephadm-ceph-exporter-reversion. The testing deployed a 17.2.6 cluster, including the ceph-exporter service, and then upgraded to an image based on the linked branch that includes the ceph-exporter reversion. Before upgrade daemons and services were

[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (85m)     6m ago  89m    20.9M        -  0.23.0   ba2b418f427c  7706462e3cb7  
ceph-exporter.vm-00  vm-00               running (79m)     6m ago  79m    7344k        -  17.2.6   9cea3956c04b  47b2a6bd0212  
ceph-exporter.vm-01  vm-01               running (78m)     6m ago  78m    7402k        -  17.2.6   9cea3956c04b  73fe9aed14dc  
ceph-exporter.vm-02  vm-02               running (79m)     6m ago  79m    7319k        -  17.2.6   9cea3956c04b  e20ea3a4b85a  
crash.vm-00          vm-00               running (89m)     6m ago  89m    7415k        -  17.2.6   9cea3956c04b  bb0514e714c8  
crash.vm-01          vm-01               running (87m)     6m ago  87m    7428k        -  17.2.6   9cea3956c04b  20c9f6c8969a  
crash.vm-02          vm-02               running (86m)     6m ago  86m    7419k        -  17.2.6   9cea3956c04b  84b8dd65649e  
grafana.vm-00        vm-00  *:3000       running (85m)     6m ago  88m    56.4M        -  8.3.5    dad864ee21e9  300c43bc851d  
mgr.vm-00.omfhaj     vm-00  *:9283       running (90m)     6m ago  90m     436M        -  17.2.6   9cea3956c04b  ab4f32f01014  
mgr.vm-01.ojvcxa     vm-01  *:8443,9283  running (87m)     6m ago  87m     475M        -  17.2.6   9cea3956c04b  c35e352ddabd  
mon.vm-00            vm-00               running (90m)     6m ago  90m    69.9M    2048M  17.2.6   9cea3956c04b  f588c90066e1  
mon.vm-01            vm-01               running (87m)     6m ago  87m    65.1M    2048M  17.2.6   9cea3956c04b  e464653ca34d  
mon.vm-02            vm-02               running (86m)     6m ago  86m    63.9M    2048M  17.2.6   9cea3956c04b  40f0dae4f643  
node-exporter.vm-00  vm-00  *:9100       running (88m)     6m ago  88m    13.5M        -  1.3.1    1dbe0e931976  9cfe7d2c4bf9  
node-exporter.vm-01  vm-01  *:9100       running (87m)     6m ago  87m    23.0M        -  1.3.1    1dbe0e931976  5c59aae80dd4  
node-exporter.vm-02  vm-02  *:9100       running (86m)     6m ago  86m    19.9M        -  1.3.1    1dbe0e931976  bbcd2965b854  
prometheus.vm-01     vm-01  *:9095       running (78m)     6m ago  86m    77.0M        -  2.33.4   514e6a882f6e  5c1406872133  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  7m ago     90m  count:1    
ceph-exporter                   3/3  7m ago     79m  *          
crash                           3/3  7m ago     90m  *          
grafana        ?:3000           1/1  7m ago     90m  count:1    
mgr                             2/2  7m ago     90m  count:2    
mon                             3/5  7m ago     90m  count:5    
node-exporter  ?:9100           3/3  7m ago     90m  *          
prometheus     ?:9095           1/1  7m ago     90m  count:1    

Luckily, it seems the impact of the reversion was minimal. I saw a single log message

[WRN] unable to load spec for ceph-exporter: ServiceSpec: __init__() got an unexpected keyword argument 'prio_limit

followed by cephadm discarding the ceph-exporter spec as it can't load it and the ceph-exporter daemons being subsequently removed as they no longer had a matching service spec. No other issues were seen. Daemons and services post upgrade were

[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION               IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (3m)     3m ago  97m    16.0M        -  0.23.0                ba2b418f427c  7460239b02bb  
crash.vm-00          vm-00               running (4m)     3m ago  97m    7424k        -  17.2.6-129-g2e256435  e49ad829f207  3617d5f2746c  
crash.vm-01          vm-01               running (4m)     4m ago  95m    7482k        -  17.2.6-129-g2e256435  e49ad829f207  702aef98c870  
crash.vm-02          vm-02               running (4m)     4m ago  94m    7448k        -  17.2.6-129-g2e256435  e49ad829f207  b461874595d5  
grafana.vm-00        vm-00  *:3000       running (3m)     3m ago  96m    35.8M        -  8.3.5                 dad864ee21e9  5e887c0ed520  
mgr.vm-00.omfhaj     vm-00  *:8443,9283  running (6m)     3m ago  98m     438M        -  17.2.6-129-g2e256435  e49ad829f207  f1cd5e3ab612  
mgr.vm-01.ojvcxa     vm-01  *:8443,9283  running (6m)     4m ago  95m     479M        -  17.2.6-129-g2e256435  e49ad829f207  5799f00c3046  
mon.vm-00            vm-00               running (5m)     3m ago  98m    51.0M    2048M  17.2.6-129-g2e256435  e49ad829f207  f57d6bce57fd  
mon.vm-01            vm-01               running (5m)     4m ago  95m    41.6M    2048M  17.2.6-129-g2e256435  e49ad829f207  f3a1eef00d51  
mon.vm-02            vm-02               running (4m)     4m ago  94m    26.4M    2048M  17.2.6-129-g2e256435  e49ad829f207  9f192be8abe6  
node-exporter.vm-00  vm-00  *:9100       running (4m)     3m ago  96m    10.6M        -  1.3.1                 1dbe0e931976  82a108beffa6  
node-exporter.vm-01  vm-01  *:9100       running (4m)     4m ago  95m    16.7M        -  1.3.1                 1dbe0e931976  a28e91a8aae9  
node-exporter.vm-02  vm-02  *:9100       running (4m)     4m ago  94m    3623k        -  1.3.1                 1dbe0e931976  acbf509b02fc  
prometheus.vm-01     vm-01  *:9095       running (4m)     4m ago  94m    24.7M        -  2.33.4                514e6a882f6e  5ddad65370ac  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  3m ago     98m  count:1    
crash                           3/3  4m ago     98m  *          
grafana        ?:3000           1/1  3m ago     98m  count:1    
mgr                             2/2  4m ago     98m  count:2    
mon                             3/5  4m ago     98m  count:5    
node-exporter  ?:9100           3/3  4m ago     98m  *          
prometheus     ?:9095           1/1  4m ago     98m  count:1    

which is the same as before but with the ceph-exporter daemons/service removed. Given the minimal impact of a single warning level log message (no health warnings were seen either) I think upgrading from a 17.2.6 cluster with the ceph-exporter deployed to one with cephadm's support for it reverted should be fine.

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@adk3798 adk3798 added this to the quincy milestone Apr 12, 2023
@adk3798 adk3798 requested a review from a team as a code owner April 12, 2023 17:53
This reverts commit 5f04222.

Issues were found with the ceph-exporter service in 17.2.6.

---

- Ceph metrics could be duplicated: from the existing centralized mgr/prometheus exporter
(legacy approach), and from the new ceph-exporter, which is deployed as a side-car
container on each node and reports Ceph metrics for all colocated services in that node.

- In a few cases (e.g.: ceph_mds_mem_cap+), where those metrics contain characters not
supported by Prometheus (e.g.: +), Prometheus would complain about malformed metric names
(however those metrics will still be properly reported via mgr/prometheus as ceph_mds_mem_cap_plus).

- In a few cases (e.g.: ceph_mds_mem_cap_minus), with metrics ending in -, they would have
different names in Prometheus (e.g.: ceph_mds_mem_cap_minus when coming from mgr/prometheus
and ceph_mds_mem_cap_ when coming from ceph-exporter).

---

Therefore we've decided it's best to revert cephadm's support for
ceph-exporter in quincy

Signed-off-by: Adam King <adking@redhat.com>

Conflicts:
	src/cephadm/cephadm
	src/pybind/mgr/cephadm/tests/test_services.py
@adk3798 adk3798 force-pushed the quincy-revert-cephadm-ceph-exporter branch from f0a5d68 to 524390a Compare April 12, 2023 18:06
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I repeated the revert based on 17.2.6 tag instead of the tip of the quincy branch and came up with a nearly identical commit (just a tiny context change). The two conflicts in src/cephadm/cephadm and src/pybind/mgr/cephadm/tests/test_services.py are trivial, caused by ec770e8 and 4191614 respectively.

@idryomov
Copy link
Contributor

> [ceph: root@vm-00 /]# ceph orch ps
> NAME                 HOST   PORTS        STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION               IMAGE ID      CONTAINER ID  
> alertmanager.vm-00   vm-00  *:9093,9094  running (3m)     3m ago  97m    16.0M        -  0.23.0                ba2b418f427c  7460239b02bb  
> crash.vm-00          vm-00               running (4m)     3m ago  97m    7424k        -  17.2.6-129-g2e256435  e49ad829f207  3617d5f2746c

I don't think this invalidates your testing by any means, but technically this was an upgrade to the tip of the quincy branch, not to 17.2.6 + revert.

Can we also try an upgrade from 17.2.5 just in case?

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 12, 2023

I repeated the revert based on 17.2.6 tag instead of the tip of the quincy branch and came up with a nearly identical commit (just a tiny context change). The two conflicts in src/cephadm/cephadm and src/pybind/mgr/cephadm/tests/test_services.py are trivial, caused by ec770e8 and 4191614 respectively.

yeah, the patches seem to be near identical. I've tried cherry-picking the commit in this PR directly on top of the 17.2.6 tag and there are no merge conflicts given the conflicts are the same for cherry-picking on 17.2.6 vs. current quincy branch (and as you mentioned they're all trivial conflicts anyway).

@idryomov
Copy link
Contributor

idryomov commented Apr 12, 2023

the upgrade was to a build based off of https://github.com/adk3798/ceph/commits/17-2-6-cephadm-ceph-exporter-reversion which is the 17.2.6 tag plus the reversion.

The daemons landed on 17.2.6-129-g2e256435 which is the tip of the quincy branch (as of yesterday), not the tag plus the reversion.

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 12, 2023

> [ceph: root@vm-00 /]# ceph orch ps
> NAME                 HOST   PORTS        STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION               IMAGE ID      CONTAINER ID  
> alertmanager.vm-00   vm-00  *:9093,9094  running (3m)     3m ago  97m    16.0M        -  0.23.0                ba2b418f427c  7460239b02bb  
> crash.vm-00          vm-00               running (4m)     3m ago  97m    7424k        -  17.2.6-129-g2e256435  e49ad829f207  3617d5f2746c

I don't think this invalidates your testing by any means, but technically this was an upgrade to the tip of the quincy branch, not to 17.2.6 + revert.

Can we also try an upgrade from 17.2.5 just in case?

sure, I hadn't tried upgrading from 17.2.5 since that didn't allow deploying the ceph-exporter and I figured that was the only sticking point. I'll test 17.2.5 to 17.2.6 + reversion. As for " technically this was an upgrade to the tip of the quincy branch, not to 17.2.6 + revert" the build was based on https://github.com/adk3798/ceph/commits/17-2-6-cephadm-ceph-exporter-reversion which is 17.2.6 + the reversion. Neither of the builds involved in the testing were based on current quincy branch
I was wrong here. Redid the test with an actual 17.2.6 + reversion build and posted below

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 12, 2023

the upgrade was to a build based off of https://github.com/adk3798/ceph/commits/17-2-6-cephadm-ceph-exporter-reversion which is the 17.2.6 tag plus the reversion.

The daemons landed on 17.2.6-129-g2e256435 which is the tip of the quincy branch (as of yesterday), not the tag plus the reversion.

ACK, you're right. I only guaranteed the python code for cephadm/orchestrator stuff was on 17.2.6, but everything else was latest quincy. I'll redo it real quick with the changes on the actual 17.2.6 build

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 12, 2023

the upgrade was to a build based off of https://github.com/adk3798/ceph/commits/17-2-6-cephadm-ceph-exporter-reversion which is the 17.2.6 tag plus the reversion.

The daemons landed on 17.2.6-129-g2e256435 which is the tip of the quincy branch (as of yesterday), not the tag plus the reversion.

retest with an actual 17.2.6 + reversion build

[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (12m)    39s ago  24m    23.3M        -  0.23.0   ba2b418f427c  a6e84411ec9a  
ceph-exporter.vm-00  vm-00               running (10m)    39s ago  10m    6127k        -  17.2.6   9cea3956c04b  0324a529bfe7  
ceph-exporter.vm-01  vm-01               running (10m)     6m ago  10m    6237k        -  17.2.6   9cea3956c04b  920edb43816f  
ceph-exporter.vm-02  vm-02               running (10m)    40s ago  10m    6135k        -  17.2.6   9cea3956c04b  1221ea249519  
crash.vm-00          vm-00               running (13m)    39s ago  24m    7415k        -  17.2.6   9cea3956c04b  70ec2930e7a4  
crash.vm-01          vm-01               running (13m)     6m ago  22m    7407k        -  17.2.6   9cea3956c04b  2bd0575bd3ec  
crash.vm-02          vm-02               running (13m)    40s ago  21m    7402k        -  17.2.6   9cea3956c04b  b53763888e23  
grafana.vm-00        vm-00  *:3000       running (12m)    39s ago  23m    53.3M        -  8.3.5    dad864ee21e9  afebb50901bc  
mgr.vm-00.wltqgk     vm-00  *:8443,9283  running (15m)    39s ago  25m     476M        -  17.2.6   9cea3956c04b  96c1263cdb59  
mgr.vm-01.mipazh     vm-01  *:8443,9283  running (7m)      6m ago  22m    47.6M        -  17.2.6   9cea3956c04b  3527a6a0e29d  
mon.vm-00            vm-00               running (14m)    39s ago  25m    61.3M    2048M  17.2.6   9cea3956c04b  6226387c8a4d  
mon.vm-01            vm-01               running (14m)     6m ago  22m    49.1M    2048M  17.2.6   9cea3956c04b  3c4dc29eac05  
mon.vm-02            vm-02               running (13m)    40s ago  21m    51.4M    2048M  17.2.6   9cea3956c04b  9e4983082521  
node-exporter.vm-00  vm-00  *:9100       running (12m)    39s ago  23m    11.5M        -  1.3.1    1dbe0e931976  741934b33440  
node-exporter.vm-01  vm-01  *:9100       running (12m)     6m ago  22m    21.6M        -  1.3.1    1dbe0e931976  b83a8b3bf028  
node-exporter.vm-02  vm-02  *:9100       running (12m)    40s ago  21m    20.9M        -  1.3.1    1dbe0e931976  4198f572cca7  
prometheus.vm-01     vm-01  *:9095       running (10m)     6m ago  22m    58.4M        -  2.33.4   514e6a882f6e  c966081e31fe  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  42s ago    25m  count:1    
ceph-exporter                   3/3  7m ago     11m  *          
crash                           3/3  7m ago     25m  *          
grafana        ?:3000           1/1  42s ago    25m  count:1    
mgr                             2/2  7m ago     25m  count:2    
mon                             3/5  7m ago     25m  count:5    
node-exporter  ?:9100           3/3  7m ago     25m  *          
prometheus     ?:9095           1/1  7m ago     25m  count:1    
[ceph: root@vm-00 /]# ceph orch upgrade start quay.io/adk3798/ceph:quincy-testing
Initiating upgrade to quay.io/adk3798/ceph:quincy-testing
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS          REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (59s)     45s ago  28m    16.0M        -  0.23.0   ba2b418f427c  a4269e6117be  
crash.vm-00          vm-00               running (114s)    45s ago  28m    7419k        -  17.2.6   962a54d7f68f  a1061dc917c1  
crash.vm-01          vm-01               running (109s)    68s ago  26m    7470k        -  17.2.6   962a54d7f68f  d7b98b5ac0c1  
crash.vm-02          vm-02               running (106s)    87s ago  25m    7465k        -  17.2.6   962a54d7f68f  61f97fd65fb8  
grafana.vm-00        vm-00  *:3000       running (48s)     45s ago  27m    38.0M        -  8.3.5    dad864ee21e9  c32f4a3735be  
mgr.vm-00.wltqgk     vm-00  *:8443,9283  running (3m)      45s ago  30m     473M        -  17.2.6   962a54d7f68f  69b1e6ce554b  
mgr.vm-01.mipazh     vm-01  *:8443,9283  running (2m)      68s ago  26m     427M        -  17.2.6   962a54d7f68f  cc4e3a1d4c68  
mon.vm-00            vm-00               running (2m)      45s ago  30m    50.9M    2048M  17.2.6   962a54d7f68f  5dd247b77334  
mon.vm-01            vm-01               running (2m)      68s ago  26m    40.6M    2048M  17.2.6   962a54d7f68f  19a69795ac9c  
mon.vm-02            vm-02               running (2m)      87s ago  25m    39.0M    2048M  17.2.6   962a54d7f68f  d6c6bc28d589  
node-exporter.vm-00  vm-00  *:9100       running (95s)     45s ago  27m    9512k        -  1.3.1    1dbe0e931976  ac49d3d9c341  
node-exporter.vm-01  vm-01  *:9100       running (91s)     68s ago  26m    9.98M        -  1.3.1    1dbe0e931976  4f24e162466c  
node-exporter.vm-02  vm-02  *:9100       running (88s)     87s ago  25m    5406k        -  1.3.1    1dbe0e931976  c5d5cb56c4ad  
prometheus.vm-01     vm-01  *:9095       running (70s)     68s ago  26m    22.8M        -  2.33.4   514e6a882f6e  b3c1527ee543  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  48s ago    29m  count:1    
crash                           3/3  90s ago    29m  *          
grafana        ?:3000           1/1  48s ago    29m  count:1    
mgr                             2/2  71s ago    29m  count:2    
mon                             3/5  90s ago    29m  count:5    
node-exporter  ?:9100           3/3  90s ago    29m  *          
prometheus     ?:9095           1/1  71s ago    29m  count:1    
[ceph: root@vm-00 /]# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

results are the same. Will test 17.2.5 to this build as well

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 12, 2023

> [ceph: root@vm-00 /]# ceph orch ps
> NAME                 HOST   PORTS        STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION               IMAGE ID      CONTAINER ID  
> alertmanager.vm-00   vm-00  *:9093,9094  running (3m)     3m ago  97m    16.0M        -  0.23.0                ba2b418f427c  7460239b02bb  
> crash.vm-00          vm-00               running (4m)     3m ago  97m    7424k        -  17.2.6-129-g2e256435  e49ad829f207  3617d5f2746c

I don't think this invalidates your testing by any means, but technically this was an upgrade to the tip of the quincy branch, not to 17.2.6 + revert.

Can we also try an upgrade from 17.2.5 just in case?

17.2.5 to 17.2.6 + reversion saw no issues either

ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS          REFRESHED   AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (40s)     18s ago    3m    13.8M        -           ba2b418f427c  7113f722e7f3  
crash.vm-00          vm-00               running (3m)      18s ago    3m    6983k        -  17.2.5   768e01abdf0b  25a2a222958f  
crash.vm-01          vm-01               running (117s)    18s ago  117s    7142k        -  17.2.5   768e01abdf0b  f8b401310dc6  
crash.vm-02          vm-02               running (56s)     19s ago   56s    7134k        -  17.2.5   768e01abdf0b  74cc3975f84b  
grafana.vm-00        vm-00  *:3000       running (36s)     18s ago    2m    41.7M        -  8.3.5    dad864ee21e9  e1a77cd40b50  
mgr.vm-00.bpmtqk     vm-00  *:9283       running (5m)      18s ago    5m     459M        -  17.2.5   768e01abdf0b  74a1f1c521dd  
mgr.vm-01.znuocs     vm-01  *:8443,9283  running (115s)    18s ago  115s     427M        -  17.2.5   768e01abdf0b  47db1c12c923  
mon.vm-00            vm-00               running (5m)      18s ago    5m    45.1M    2048M  17.2.5   768e01abdf0b  b6eccb100c98  
mon.vm-01            vm-01               running (111s)    18s ago  111s    36.7M    2048M  17.2.5   768e01abdf0b  23b44413e547  
mon.vm-02            vm-02               running (53s)     19s ago   53s    33.1M    2048M  17.2.5   768e01abdf0b  a131de769804  
node-exporter.vm-00  vm-00  *:9100       running (2m)      18s ago    2m    11.4M        -           1dbe0e931976  c296f37d07b0  
node-exporter.vm-01  vm-01  *:9100       running (107s)    18s ago  106s    20.1M        -           1dbe0e931976  a944abffd253  
node-exporter.vm-02  vm-02  *:9100       running (50s)     19s ago   49s    13.3M        -           1dbe0e931976  5dc1bb19cc52  
prometheus.vm-01     vm-01  *:9095       running (26s)     18s ago   92s    30.3M        -           514e6a882f6e  13a8ded33c1f  
[ceph: root@vm-00 /]# 
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  25s ago    4m   count:1    
crash                           3/3  26s ago    4m   *          
grafana        ?:3000           1/1  25s ago    4m   count:1    
mgr                             2/2  26s ago    4m   count:2    
mon                             3/5  26s ago    4m   count:5    
node-exporter  ?:9100           3/3  26s ago    4m   *          
prometheus     ?:9095           1/1  26s ago    4m   count:1    
[ceph: root@vm-00 /]# ceph orch upgrade start quay.io/adk3798/ceph:quincy-testing
Initiating upgrade to quay.io/adk3798/ceph:quincy-testing
[ceph: root@vm-00 /]# ceph orch ps
NAME                 HOST   PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
alertmanager.vm-00   vm-00  *:9093,9094  running (25s)    12s ago   8m    13.9M        -  0.23.0   ba2b418f427c  0ac8d31ebab6  
crash.vm-00          vm-00               running (82s)    12s ago   8m    7411k        -  17.2.6   962a54d7f68f  473a74dd4ca0  
crash.vm-01          vm-01               running (78s)    35s ago   6m    7486k        -  17.2.6   962a54d7f68f  009d5d0dc957  
crash.vm-02          vm-02               running (75s)    53s ago   5m    7478k        -  17.2.6   962a54d7f68f  f8479bd502a8  
grafana.vm-00        vm-00  *:3000       running (14s)    12s ago   7m    36.1M        -  8.3.5    dad864ee21e9  c9a534055c8c  
mgr.vm-00.bpmtqk     vm-00  *:8443,9283  running (3m)     12s ago   9m     470M        -  17.2.6   962a54d7f68f  4f0387dbbc7e  
mgr.vm-01.znuocs     vm-01  *:8443,9283  running (2m)     35s ago   6m     428M        -  17.2.6   962a54d7f68f  295321cf6326  
mon.vm-00            vm-00               running (2m)     12s ago  10m    44.8M    2048M  17.2.6   962a54d7f68f  be6b47800695  
mon.vm-01            vm-01               running (2m)     35s ago   6m    38.7M    2048M  17.2.6   962a54d7f68f  e6085cb2f58d  
mon.vm-02            vm-02               running (99s)    53s ago   5m    36.0M    2048M  17.2.6   962a54d7f68f  a098100bf191  
node-exporter.vm-00  vm-00  *:9100       running (61s)    12s ago   7m    5440k        -  1.3.1    1dbe0e931976  a1e389cf22f6  
node-exporter.vm-01  vm-01  *:9100       running (58s)    35s ago   6m    17.3M        -  1.3.1    1dbe0e931976  58571a79a271  
node-exporter.vm-02  vm-02  *:9100       running (55s)    53s ago   5m    5427k        -  1.3.1    1dbe0e931976  38d79ff25ab5  
prometheus.vm-01     vm-01  *:9095       running (37s)    35s ago   6m    23.7M        -  2.33.4   514e6a882f6e  aafa4516f83c  
[ceph: root@vm-00 /]# ceph orch ls
NAME           PORTS        RUNNING  REFRESHED  AGE  PLACEMENT  
alertmanager   ?:9093,9094      1/1  15s ago    9m   count:1    
crash                           3/3  57s ago    9m   *          
grafana        ?:3000           1/1  15s ago    9m   count:1    
mgr                             2/2  38s ago    9m   count:2    
mon                             3/5  57s ago    9m   count:5    
node-exporter  ?:9100           3/3  57s ago    9m   *          
prometheus     ?:9095           1/1  38s ago    9m   count:1    

no ceph-exporter support in 17.2.5 so this was just a standard upgrade

@adk3798
Copy link
Contributor Author

adk3798 commented May 2, 2023

https://pulpito.ceph.com/adking-2023-04-26_04:14:04-orch:cephadm-wip-adk3-testing-2023-04-25-1440-quincy-distro-default-smithi/

Reruns of failed jobs: https://pulpito.ceph.com/adking-2023-05-01_12:44:48-orch:cephadm-wip-adk3-testing-2023-04-25-1440-quincy-distro-default-smithi/

After reruns, 9 failed jobs:

  • 1 is the test_non_existent_cluster test. This is known to fail currently in quincy runs
  • the other 8 are upgrade failures caused by the PR in this run that added the mon crush location work.

Overall, the mon crush location PR and any upgrade related PRs shouldn't be merged, but most others should be fine.

@adk3798
Copy link
Contributor Author

adk3798 commented May 24, 2023

https://pulpito.ceph.com/adking-2023-05-21_23:26:59-orch:cephadm-wip-adk3-testing-2023-05-21-1607-quincy-distro-default-smithi/

Reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-05-22_14:09:29-orch:cephadm-wip-adk3-testing-2023-05-21-1607-quincy-distro-default-smithi/

After reruns, 2 failed, 1 dead job:

  • nfs-ingress test failed zipping logs after the actual test was complete. From what I can tell from the logs the test itself completed successfully
  • test_non_existent_cluster failed in test_nfs test. This is a known issue specifically on quincy
  • upgrade_with_workload timed out on the workload portion post upgrade. Also a known issue

Overall, nothing to block merging

@adk3798 adk3798 merged commit c6d53b1 into ceph:quincy May 24, 2023
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants