Blocked status `failed to recover cluster.` #329

carlcsaposs-canonical · 2023-10-23T09:43:51Z

Steps to reproduce

Steps 3-6 from https://microstack.run/#get-started
juju refresh mysql --channel 8.0/edge

Expected behavior

mysql app upgrades successfully & goes into active state

Actual behavior

mysql app enters blocked status failed to recover cluster.

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.2.3-genericlinux-amd64

Juju agent: 3.2.0

Charm revision: 99 before refresh (current 8.0/stable), 109 after refresh (current 8.0/edge)

microk8s: MicroK8s v1.26.9 revision 6059

Log output

Juju debug log:
sunbeam-debug-log.txt
sunbeam-debug-log-filtered.txt

unit-mysql-0: 09:22:26 INFO juju.cmd running containerAgent [3.2.0 c7107ada8c471aa3ba105e5433e61861227e2ed4 gc go1.20.4]
unit-mysql-0: 09:22:26 INFO juju.worker.upgradesteps upgrade steps for 3.2.0 have already been run.
unit-mysql-0: 09:22:26 INFO juju.api connection established to "wss://10.150.15.206:17070/model/9b07ebf5-8cf1-4858-8a94-3086f8416535/api"
unit-mysql-0: 09:22:26 INFO juju.worker.migrationminion migration phase is now: NONE
unit-mysql-0: 09:22:26 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-mysql-0: 09:22:26 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-mysql-0: 09:22:26 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-mysql-0
unit-mysql-0: 09:22:27 INFO juju.worker.uniter hooks are retried true
unit-mysql-0: 09:22:27 INFO juju.downloader downloading from ch:amd64/jammy/mysql-k8s-109
unit-mysql-0: 09:22:27 INFO juju.downloader download verified ("ch:amd64/jammy/mysql-k8s-109")
unit-mysql-0: 09:22:37 INFO juju.worker.uniter found queued "upgrade-charm" hook
unit-mysql-0: 09:22:39 ERROR unit.mysql/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.
unit-mysql-0: 09:22:39 INFO juju.worker.uniter found queued "config-changed" hook
unit-mysql-0: 09:22:40 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-mysql-0: 09:22:40 INFO juju.worker.uniter reboot detected; triggering implicit start hook to notify charm
unit-mysql-0: 09:22:41 INFO unit.mysql/0.juju-log Running legacy hooks/start.
unit-mysql-0: 09:22:44 INFO unit.mysql/0.juju-log Setting up the logrotate configurations
unit-mysql-0: 09:22:51 INFO unit.mysql/0.juju-log Unit workload member-state is offline with member-role unknown
unit-mysql-0: 09:22:52 ERROR unit.mysql/0.juju-log Failed to reboot cluster
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 684, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-mysql-0/charm/venv/ops/pebble.py", line 1359, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr="Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2023-10-23T09:22:52Z: Loading startup files...\nverbose: 2023-10-23T09:22:52Z: Loading plugins...\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints\nverbose: 2023-10-23T09:22:52Z: Shell.connect: tid=33: CONNECTED: mysql-0.mysql-endpoints\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000\nverbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=34: CONNECTED: mysql-0.mysql-endpoints:3306\nverbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000\nverbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=35: CONNECTED: mysql-0.mysql-endpoints:3306\nverbose: 2023-10-23T09:22:52Z: Group Replication 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362\nverbose: 2023-10-23T09:22:52Z: Metadata 'group_name' value: 072799b1-718" [truncated]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-0/charm/lib/charms/mysql/v0/mysql.py", line 1989, in reboot_from_complete_outage
    self._run_mysqlsh_script("\n".join(reboot_from_outage_command))
  File "/var/lib/juju/agents/unit-mysql-0/charm/src/mysql_k8s_helpers.py", line 687, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2023-10-23T09:22:52Z: Loading startup files...
verbose: 2023-10-23T09:22:52Z: Loading plugins...
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: clusteradmin@mysql-0.mysql-endpoints
verbose: 2023-10-23T09:22:52Z: Shell.connect: tid=33: CONNECTED: mysql-0.mysql-endpoints
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=34: CONNECTED: mysql-0.mysql-endpoints:3306
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=35: CONNECTED: mysql-0.mysql-endpoints:3306
verbose: 2023-10-23T09:22:52Z: Group Replication 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362
verbose: 2023-10-23T09:22:52Z: Metadata 'group_name' value: 072799b1-7180-11ee-bc9f-76d5c7fb0362
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=36: CONNECTED: mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306
verbose: 2023-10-23T09:22:52Z: Connecting to MySQL at: mysql://clusteradmin@mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306?connect-timeout=5000
verbose: 2023-10-23T09:22:52Z: Dba.reboot_cluster_from_complete_outage: tid=37: CONNECTED: mysql-0.mysql-endpoints.openstack.svc.cluster.local:3306
No PRIMARY member found for cluster 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'
verbose: 2023-10-23T09:22:52Z: ClusterSet info: member, primary, not primary_invalidated, not removed from set, primary status: UNKNOWN
Restoring the Cluster 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5' from complete outage...

ERROR: RuntimeError: The current session instance does not belong to the Cluster: 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'.
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The current session instance does not belong to the Cluster: 'cluster-b56bbe7bd4a6cc012b44ba93360df3b5'.


unit-mysql-0: 09:22:53 INFO juju.worker.uniter.operation ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch)

Additional context

Attempted to reproduce issue encountered by @javacruft

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-23T09:48:53Z

https://warthogs.atlassian.net/browse/DPE-2832

carlcsaposs-canonical · 2023-10-23T09:53:34Z

potential cause: ERROR unit.mysql/0.juju-log Cluster upgrade failed, ensure pre-upgrade checks are ran first.

carlcsaposs-canonical · 2023-10-23T11:49:16Z

Tried with pre-upgrade-check before juju refresh

Result: blocked status upgrade failed. Check logs for rollback instruction

pre-upgrade-debug-log.txt
pre-upgrade-debug-log-filtered.txt

gboutry · 2023-10-31T17:37:36Z

Encountered the same issue in a deployment with 7 mysql servers. 5 out of the 7 failed to recover after a machine reboot with the same error.

Complete debug log:
debug-log.log
Each failing mysql server logs:
cinder-mysql.log
heat-mysql.log
keystone-mysql.log
nova-mysql.log
placement-mysql.log

paulomach · 2023-11-01T09:25:34Z

Encountered the same issue in a deployment with 7 mysql servers. 5 out of the 7 failed to recover after a machine reboot with the same error.

@gboutry there's a fix on PR #324, released in edge channel. We are working to promote it to stable.

paulomach · 2024-01-17T21:23:51Z

@gboutry have you had the chance to validate the fix?

carlcsaposs-canonical added the bug Something isn't working label Oct 23, 2023

carlcsaposs-canonical transferred this issue from canonical/mysql-operator Oct 23, 2023

canonical deleted a comment from github-actions bot Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocked status `failed to recover cluster.` #329

Blocked status `failed to recover cluster.` #329

carlcsaposs-canonical commented Oct 23, 2023 •

edited

github-actions bot commented Oct 23, 2023

carlcsaposs-canonical commented Oct 23, 2023

carlcsaposs-canonical commented Oct 23, 2023 •

edited

gboutry commented Oct 31, 2023

paulomach commented Nov 1, 2023

paulomach commented Jan 17, 2024

Blocked status failed to recover cluster. #329

Blocked status failed to recover cluster. #329

Comments

carlcsaposs-canonical commented Oct 23, 2023 • edited

Steps to reproduce

Expected behavior

Actual behavior

Versions

Log output

Additional context

github-actions bot commented Oct 23, 2023

carlcsaposs-canonical commented Oct 23, 2023

carlcsaposs-canonical commented Oct 23, 2023 • edited

gboutry commented Oct 31, 2023

paulomach commented Nov 1, 2023

paulomach commented Jan 17, 2024

Blocked status `failed to recover cluster.` #329

Blocked status `failed to recover cluster.` #329

carlcsaposs-canonical commented Oct 23, 2023 •

edited

carlcsaposs-canonical commented Oct 23, 2023 •

edited