Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mysql-router-k8s: bootstrap failure #345

Closed
javacruft opened this issue Nov 7, 2023 · 4 comments · Fixed by canonical/mysql-router-k8s-operator#187 or canonical/mysql-router-operator#100
Labels
bug Something isn't working

Comments

@javacruft
Copy link

javacruft commented Nov 7, 2023

Steps to reproduce

Failed multi-node test run from Canonical Solutions QA team.

Multi-node microstack deployment on baremetal with deployment in many-mysql mode - mysql per service.

Majority of mysql apps deploy and scale correctly however on mysql-router-k8s instance failed to bootstrap.

Expected behavior

All mysql-router-k8s units bootstrap correctly.

Actual behavior

Failure of single mysql-router-k8s unit.

Versions

Operating system: 22.04

Juju CLI: 3.2.3
Juju agent: 3.2.3

mysql-k8s charm revision: 99
mysql-router-k8s charm revision: 69

microk8s: 1.26-strict/stable

Log output

Logs from the failed deployment are linked from:

https://bugs.launchpad.net/snap-openstack/+bug/2042906

direct link:

https://oil-jenkins.canonical.com/artifacts/628e5903-4772-4a3e-9b0a-80cc04d3c6d3/index.html

Additional context

https://bugs.launchpad.net/snap-openstack/+bug/2042906

@javacruft javacruft added the bug Something isn't working label Nov 7, 2023
Copy link
Contributor

github-actions bot commented Nov 7, 2023

@javacruft javacruft changed the title mysql-router fails to br mysql-router-k8s: bootstrap failure Nov 7, 2023
@carlcsaposs-canonical
Copy link
Contributor

Waiting for canonical/data-platform-libs#108 before investigating

@carlcsaposs-canonical
Copy link
Contributor

It looks like the connection to MySQL server was quite unreliable

My interpretation of the logs for the failed unit:

2023-11-06T19:52:22.166Z [container-agent] 2023-11-06 19:52:22 ERROR juju-log backend-database:159: Failed to run logged_commands=["shell.connect('relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306')", 'result = session.run_sql("SELECT USER, ATTRIBUTE->>\'$.router_id\' FROM INFORMATION_SCHEMA.USER_ATTRIBUTES WHERE ATTRIBUTE->\'$.created_by_user\'=\'relation-159\' AND ATTRIBUTE->\'$.created_by_juju_unit\'=\'heat-cfn-mysql-router/0\'")', 'print(result.fetch_all())']
2023-11-06T19:52:22.166Z [container-agent] stderr:
2023-11-06T19:52:22.166Z [container-agent] Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
2023-11-06T19:52:22.166Z [container-agent] Traceback (most recent call last):
2023-11-06T19:52:22.166Z [container-agent]   File "<string>", line 1, in <module>
2023-11-06T19:52:22.166Z [container-agent] mysqlsh.DBError: MySQL Error (2003): Shell.connect: Can't connect to MySQL server on 'heat-mysql-primary.openstack.svc.cluster.local:3306' (111)

Router fails here when checking if an old router user+metadata needs to be cleaned up:
https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L230

2023-11-06T19:53:02.306Z [container-agent] 2023-11-06 19:53:02 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:53:02.306Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:53:02.306Z [container-agent] stderr:
2023-11-06T19:53:02.306Z [container-agent] Error: The provided server is currently not in a InnoDB cluster group with quorum and thus may contain inaccurate or outdated data.

Router has succeeded in cleaning up the old router user & metadata (since it's failing a line later):
https://github.com/canonical/mysql-router-k8s-operator/blob/1704b4e190e394cfaba6b68b06debd7ae2b9a606/src/workload.py#L231

2023-11-06T19:54:20.935Z [container-agent] 2023-11-06 19:54:20 ERROR juju-log backend-database:159: Failed to bootstrap router
2023-11-06T19:54:20.935Z [container-agent] logged_command=['--bootstrap', 'relation-159:***@heat-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
2023-11-06T19:54:20.935Z [container-agent] stderr:
2023-11-06T19:54:20.935Z [container-agent] Error: It appears that a router instance named 'system' has been previously configured in this host. If that instance no longer exists, use the --force option to overwrite it.

While MySQL server was recovering, I'm guessing it overrode/reverted the changes mysql-router made to the router metadata (but not the router user)


I believe the cause of this issue is the same as here: #260 (comment)

MySQL server is providing connection information to MySQL Router when it is not ready to serve traffic (i.e. not in a quorum)

Router, when it sees the connection information, assumes that the cluster is available and that any operators router performs will be persisted. Router deletes the router user & router cluster metadata, assuming that if one of those changes goes through, both changes will go through (it deletes the user after the metadata as a safe guard). However, during server's recovery process, the user deletion goes through but the metadata deletion is reverted—causing router to fail to bootstrap

@carlcsaposs-canonical carlcsaposs-canonical transferred this issue from canonical/mysql-router-k8s-operator Dec 6, 2023
Copy link
Contributor

github-actions bot commented Dec 6, 2023

carlcsaposs-canonical added a commit to canonical/mysql-router-k8s-operator that referenced this issue Jan 10, 2024
Instead of relying on the existence of old router user to cleanup router from cluster metadata, force bootstrap so that the metadata does not need to be cleaned up.

Fixes canonical/mysql-k8s-operator#345. The issue was that the router charm would delete the user & router metadata, but that only the user deletion would go through. Then, on the router charm's next hook, the bootstrap failed because the metadata was not cleaned up.
carlcsaposs-canonical added a commit to canonical/mysql-router-k8s-operator that referenced this issue Jan 10, 2024
Instead of relying on the existence of old router user to cleanup router
from cluster metadata, force bootstrap so that the metadata does not
need to be cleaned up.

Fixes canonical/mysql-k8s-operator#345. The
issue was that the router charm would delete the user & router metadata,
but that only the user deletion would go through. Then, on the router
charm's next hook, the bootstrap failed because the metadata was not
cleaned up.

Additional context:
https://chat.canonical.com/canonical/pl/temrphcp3in5xqftkaxgowqtkr
carlcsaposs-canonical added a commit to canonical/mysql-router-operator that referenced this issue Jan 10, 2024
Instead of relying on the existence of old router user to cleanup router from cluster metadata, force bootstrap so that the metadata does not need to be cleaned up.

Fixes canonical/mysql-k8s-operator#345. The issue was that the router charm would delete the user & router metadata, but that only the user deletion would go through. Then, on the router charm's next hook, the bootstrap failed because the metadata was not cleaned up.

Ported from canonical/mysql-router-k8s-operator#187
carlcsaposs-canonical added a commit to canonical/mysql-router-operator that referenced this issue Mar 5, 2024
Instead of relying on the existence of old router user to cleanup router from cluster metadata, force bootstrap so that the metadata does not need to be cleaned up.

Fixes canonical/mysql-k8s-operator#345. The issue was that the router charm would delete the user & router metadata, but that only the user deletion would go through. Then, on the router charm's next hook, the bootstrap failed because the metadata was not cleaned up.

Ported from canonical/mysql-router-k8s-operator#187
carlcsaposs-canonical added a commit to canonical/mysql-router-operator that referenced this issue Mar 5, 2024
Instead of relying on the existence of old router user to cleanup router
from cluster metadata, force bootstrap so that the metadata does not
need to be cleaned up.

Fixes canonical/mysql-k8s-operator#345. The
issue was that the router charm would delete the user & router metadata,
but that only the user deletion would go through. Then, on the router
charm's next hook, the bootstrap failed because the metadata was not
cleaned up.

Ported from
canonical/mysql-router-k8s-operator#187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants