Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume-upgrade fails if highest unit is also the leader unit #303

Closed
phvalguima opened this issue May 7, 2024 · 6 comments
Closed

resume-upgrade fails if highest unit is also the leader unit #303

phvalguima opened this issue May 7, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@phvalguima
Copy link
Contributor

The resume-upgrade fails with:

Running operation 7 with 1 task
  - task 8 on unit-failover-1

Waiting for task 8...
Action id 8 failed: Highest number unit is unhealthy. Upgrade will not resume.

If the leader unit is running on the unit with the highest identifier.

Using pdb, I can confirm the following, on:

  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(267)<module>()
-> main(OpenSearchOperatorCharm)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(544)main()
-> manager.run()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(520)run()
-> self._emit()
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(509)_emit()
-> _emit_charm_event(self.charm, self.dispatcher.event_name)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/main.py(143)_emit_charm_event()
-> event_to_emit.emit(*args, **kwargs)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(352)emit()
-> framework._emit(event)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(851)_emit()
-> self._reemit(event_path)
  /var/lib/juju/agents/unit-failover-1/charm/venv/ops/framework.py(941)_reemit()
-> custom_handler(event)
  /var/lib/juju/agents/unit-failover-1/charm/src/charm.py(188)_on_resume_upgrade_action()
-> self._upgrade.reconcile_partition(action_event=event)
> /var/lib/juju/agents/unit-failover-1/charm/src/machine_upgrade.py(114)reconcile_partition()
-> unhealthy = state is not upgrade.UnitState.HEALTHY

The charm will fail as state reports:

(Pdb) state
<UnitState.UPGRADING: 'upgrading'>

Full Status:

Model                                Controller           Cloud/Region         Version  SLA          Timestamp
test-large-deployment-upgrades-36oo  localhost-localhost  localhost/localhost  3.4.2    unsupported  16:59:24+02:00

App                       Version  Status   Scale  Charm                               Channel        Rev  Exposed  Message
failover                           blocked      2  opensearch                                           1  no       Upgrading. Verify highest unit is healthy & run `resume-upgrade` action. To rollback, `juju refresh` to last revision
main                               active       1  pguimaraes-opensearch-upgrade-test  latest/edge     19  no       
opensearch                         active       3  opensearch                                           0  no       
self-signed-certificates           active       1  self-signed-certificates            latest/stable   72  no       

Unit                         Workload  Agent      Machine  Public address  Ports     Message
failover/0                   active    idle       0        10.173.208.166  9200/tcp  OpenSearch 2.12.0 running; Snap rev 40 (outdated); Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty+3cebf...
failover/1*                  active    executing  1        10.173.208.236  9200/tcp  (resume-upgrade) OpenSearch 2.12.0 running; Snap rev 44; Charmed operator 1+3cebf31-dirty+3cebf31-dirty+3cebf31-dirty...
main/0*                      active    idle       2        10.173.208.119  9200/tcp  
opensearch/0                 active    idle       3        10.173.208.182  9200/tcp  
opensearch/1*                active    idle       4        10.173.208.21   9200/tcp  
opensearch/2                 active    idle       5        10.173.208.245  9200/tcp  
self-signed-certificates/0*  active    idle       6        10.173.208.15             

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.173.208.166  juju-bb32e7-0  ubuntu@22.04      Running
1        started  10.173.208.236  juju-bb32e7-1  ubuntu@22.04      Running
2        started  10.173.208.119  juju-bb32e7-2  ubuntu@22.04      Running
3        started  10.173.208.182  juju-bb32e7-3  ubuntu@22.04      Running
4        started  10.173.208.21   juju-bb32e7-4  ubuntu@22.04      Running
5        started  10.173.208.245  juju-bb32e7-5  ubuntu@22.04      Running
6        started  10.173.208.15   juju-bb32e7-6  ubuntu@22.04      Running
@phvalguima phvalguima added the bug Something isn't working label May 7, 2024
Copy link

github-actions bot commented May 7, 2024

@phvalguima
Copy link
Contributor Author

I believe we should check if either state is in either UPGRADING or HEALTHY state.

phvalguima added a commit that referenced this issue May 7, 2024
If the leader happens to be the unit with highest id, then the `resume-upgrade` will fail.

This PR broadens the health check to also consider `UnitState.UPGRADING` status as valid healthy status.

Closes #303
@phvalguima
Copy link
Contributor Author

Likewise, we have another point where that is a problem here

@carlcsaposs-canonical
Copy link
Contributor

I don't think this is a bug

the highest unit should have upgraded & be healthy before the upgrade is resumed (without force)

@carlcsaposs-canonical
Copy link
Contributor

for history, conclusion:
issue (reason why resume-upgrade failed) was

unit-failover-1: 12:39:13 INFO unit.failover/1.juju-log Current health of cluster: ignore
unit-failover-1: 12:39:13 ERROR unit.failover/1.juju-log Cluster is not healthy after upgrade. Manual intervention required. To rollback, `juju refresh` to the previous revision

and cluster health (checked here:

health = self.health.apply(wait_for_green_first=True, app=False)
) should not have returned ignore

@phvalguima
Copy link
Contributor Author

As discussed with @carlcsaposs-canonical the issue was on the self.health.apply and moving to self.health.get .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants