`resume-upgrade` fails if highest unit is also the leader unit #303

phvalguima · 2024-05-07T15:00:44Z

The resume-upgrade fails with:

Running operation 7 with 1 task
  - task 8 on unit-failover-1

Waiting for task 8...
Action id 8 failed: Highest number unit is unhealthy. Upgrade will not resume.

If the leader unit is running on the unit with the highest identifier.

Using pdb, I can confirm the following, on:

  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(267)<module>()
-> main(OpenSearchOperatorCharm)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(544)main()
-> manager.run()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(520)run()
-> self._emit()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(509)_emit()
-> _emit_charm_event(self.charm, self.dispatcher.event_name)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(143)_emit_charm_event()
-> event_to_emit.emit(*args, **kwargs)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(352)emit()
-> framework._emit(event)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(851)_emit()
-> self._reemit(event_path)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(941)_reemit()
-> custom_handler(event)
  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(188)_on_resume_upgrade_action()
-> self._upgrade.reconcile_partition(action_event=event)
> /var/lib/juju/agents/unit-failover-1/charm/src/machine_upgrade.py(114)reconcile_partition()
-> unhealthy = state is not upgrade.UnitState.HEALTHY

The charm will fail as state reports:

(Pdb) state
<UnitState.UPGRADING: 'upgrading'>

Full Status:

Model                                Controller           Cloud/Region         Version  SLA          Timestamp
test-large-deployment-upgrades-36oo  localhost-localhost  localhost/localhost  3.4.2    unsupported  16:59:24+02:00

App                       Version  Status   Scale  Charm                               Channel        Rev  Exposed  Message
failover                           blocked      2  opensearch                                           1  no       Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
main                               active       1  pguimaraes-opensearch-upgrade-test  latest/edge     19  no       
opensearch                         active       3  opensearch                                           0  no       
self-signed-certificates           active       1  self-signed-certificates            latest/stable   72  no       

Unit                         Workload  Agent      Machine  Public address  Ports     Message
failover/0                   active    idle       0        10.173.208.166  9200/tcp  OpenSearch 2.12.0 running; Snap rev 40 (outdated); Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty+3cebf...
failover/1*                  active    executing  1        10.173.208.236  9200/tcp  (resume-upgrade) OpenSearch 2.12.0 running; Snap rev 44; Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty...
main/0*                      active    idle       2        10.173.208.119  9200/tcp  
opensearch/0                 active    idle       3        10.173.208.182  9200/tcp  
opensearch/1*                active    idle       4        10.173.208.21   9200/tcp  
opensearch/2                 active    idle       5        10.173.208.245  9200/tcp  
self-signed-certificates/0*  active    idle       6        10.173.208.15             

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.173.208.166  juju-bb32e7-0  ubuntu@22.04      Running
1        started  10.173.208.236  juju-bb32e7-1  ubuntu@22.04      Running
2        started  10.173.208.119  juju-bb32e7-2  ubuntu@22.04      Running
3        started  10.173.208.182  juju-bb32e7-3  ubuntu@22.04      Running
4        started  10.173.208.21   juju-bb32e7-4  ubuntu@22.04      Running
5        started  10.173.208.245  juju-bb32e7-5  ubuntu@22.04      Running
6        started  10.173.208.15   juju-bb32e7-6  ubuntu@22.04      Running

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-07T15:01:00Z

https://warthogs.atlassian.net/browse/DPE-4306

phvalguima · 2024-05-07T15:02:13Z

I believe we should check if either state is in either UPGRADING or HEALTHY state.

If the leader happens to be the unit with highest id, then the `resume-upgrade` will fail. This PR broadens the health check to also consider `UnitState.UPGRADING` status as valid healthy status. Closes #303

phvalguima · 2024-05-08T05:42:54Z

Likewise, we have another point where that is a problem here

carlcsaposs-canonical · 2024-05-08T07:35:42Z

I don't think this is a bug

the highest unit should have upgraded & be healthy before the upgrade is resumed (without force)

carlcsaposs-canonical · 2024-05-08T11:14:10Z

for history, conclusion:
issue (reason why resume-upgrade failed) was

unit-failover-1: 12:39:13 INFO unit.failover/1.juju-log Current health of cluster: ignore
unit-failover-1: 12:39:13 ERROR unit.failover/1.juju-log Cluster is not healthy after upgrade. Manual intervention required. To rollback, `juju refresh` to the previous revision

and cluster health (checked here:

opensearch-operator/lib/charms/opensearch/v0/opensearch_base_charm.py

Line 985 in 6670d19

health = self.health.apply(wait_for_green_first=True, app=False)

) should not have returned ignore

phvalguima · 2024-05-16T08:29:53Z

As discussed with @carlcsaposs-canonical the issue was on the self.health.apply and moving to self.health.get .

phvalguima added the bug Something isn't working label May 7, 2024

phvalguima mentioned this issue May 7, 2024

resume-upgrade returns failed although the charm is ready for upgrade #299

Closed

phvalguima mentioned this issue May 7, 2024

[DPE-4306] resume-upgrade does not work on leader as highest unit #304

Closed

phvalguima closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resume-upgrade` fails if highest unit is also the leader unit #303

`resume-upgrade` fails if highest unit is also the leader unit #303

phvalguima commented May 7, 2024

github-actions bot commented May 7, 2024

phvalguima commented May 7, 2024

phvalguima commented May 8, 2024

carlcsaposs-canonical commented May 8, 2024

carlcsaposs-canonical commented May 8, 2024

phvalguima commented May 16, 2024

resume-upgrade fails if highest unit is also the leader unit #303

resume-upgrade fails if highest unit is also the leader unit #303

Comments

phvalguima commented May 7, 2024

github-actions bot commented May 7, 2024

phvalguima commented May 7, 2024

phvalguima commented May 8, 2024

carlcsaposs-canonical commented May 8, 2024

carlcsaposs-canonical commented May 8, 2024

phvalguima commented May 16, 2024

`resume-upgrade` fails if highest unit is also the leader unit #303

`resume-upgrade` fails if highest unit is also the leader unit #303