Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mysql-router-k8s: One of three units down #173

Closed
gnuoy opened this issue Dec 4, 2023 · 12 comments
Closed

mysql-router-k8s: One of three units down #173

gnuoy opened this issue Dec 4, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@gnuoy
Copy link

gnuoy commented Dec 4, 2023

Steps to reproduce

Failed multi-node test run from Canonical Solutions QA team.

Multi-node microstack deployment on baremetal with deployment in many-mysql mode - mysql per service.
Then enabled telemetry with sunbeam enable telemetry

All other mysql and mysql-router apps deploy and scale correctly however one mysql-router-k8s instance failed to bootstrap.

Expected behavior

All mysql-router-k8s units bootstrap correctly.

Actual behavior

Failure of single mysql-router-k8s unit. This is not reflected in the workload status of the unit:

$ juju status -m openstack  --format=short aodh-mysql-router
- aodh-mysql-router/0:  (agent:idle, workload:active)
- aodh-mysql-router/1:  (agent:idle, workload:active)
- aodh-mysql-router/2:  (agent:idle, workload:active)
$ juju ssh -m openstack --container mysql-router aodh-mysql-router/0 "ps -ef | grep mysqlrouter | grep -v grep"
mysql         26       1  0 09:45 ?        00:00:12 mysqlrouter --config /etc/mysqlrouter/mysqlrouter.conf
$ juju ssh -m openstack --container mysql-router aodh-mysql-router/1 "ps -ef | grep mysqlrouter | grep -v grep"
ERROR command terminated with exit code 1
$ juju ssh -m openstack --container mysql-router aodh-mysql-router/2 "ps -ef | grep mysqlrouter | grep -v grep"
mysql         27       1  0 09:45 ?        00:00:19 mysqlrouter --config /etc/mysqlrouter/mysqlrouter.conf

Versions

Operating system: 22.04.3

Juju CLI: 3.2.4-genericlinux-amd64

Juju agent: 3.2.4

mysql-k8s charm revision: 99
mysql-router-k8s charm revision: 69

microk8s: 1.28-strict/stable

Log output

Juju debug log:

Log from failed run: https://oil-jenkins.canonical.com/artifacts/e149fed6-4aee-4a4a-b6ee-d6d931a503f5/index.html

# cat mysqlrouter.log 
2023-12-04 09:45:55 io INFO [7fd9e2ebe240] starting 4 io-threads, using backend 'linux_epoll'
2023-12-04 09:45:55 http_server INFO [7fd9e2ebe240] listening on 127.0.0.1:8443
2023-12-04 09:45:55 metadata_cache_plugin INFO [7fd9d4ff9640] Starting Metadata Cache
2023-12-04 09:45:55 metadata_cache INFO [7fd9d4ff9640] Connections using ssl_mode 'PREFERRED'
2023-12-04 09:45:55 metadata_cache INFO [7fd9dc5ea640] Starting metadata cache refresh thread
2023-12-04 09:45:55 routing INFO [7fd9a9ffb640] [routing:bootstrap_ro] started: routing strategy = round-robin-with-fallback
2023-12-04 09:45:55 routing INFO [7fd9a9ffb640] Start accepting connections for routing routing:bootstrap_ro listening on 6447
2023-12-04 09:45:55 routing INFO [7fd9a97fa640] [routing:bootstrap_rw] started: routing strategy = first-available
2023-12-04 09:45:55 routing INFO [7fd9a97fa640] Start accepting connections for routing routing:bootstrap_rw listening on 6446
2023-12-04 09:45:55 routing INFO [7fd9a8ff9640] [routing:bootstrap_x_ro] started: routing strategy = round-robin-with-fallback
2023-12-04 09:45:55 routing INFO [7fd9a8ff9640] Start accepting connections for routing routing:bootstrap_x_ro listening on 6449
2023-12-04 09:45:55 routing INFO [7fd9a3fff640] [routing:bootstrap_x_rw] started: routing strategy = first-available
2023-12-04 09:45:55 routing INFO [7fd9a3fff640] Start accepting connections for routing routing:bootstrap_x_rw listening on 6448
2023-12-04 09:45:55 metadata_cache INFO [7fd9dc5ea640] Connected with metadata server running on aodh-mysql-0.aodh-mysql-endpoints.openstack.svc.cluster.local:3306
2023-12-04 09:45:55 metadata_cache INFO [7fd9dc5ea640] New router options read from the metadata '{"target_cluster": "primary", "stats_updates_frequency": 0, "use_replica_primary_as_rw": false, "invalidated_cluster_policy": "drop_all"}', was ''
2023-12-04 09:45:55 metadata_cache INFO [7fd9dc5ea640] New target cluster assigned in the metadata: 'cluster-96cae741a2541f69d730bc71ff97820d'
2023-12-04 09:45:55 metadata_cache WARNING [7fd9dc5ea640] GR member aodh-mysql-2.aodh-mysql-endpoints.openstack.svc.cluster.local:3306 (ca55f21f-9289-11ee-958d-66e6c3edb049) Online, missing in the metadata
2023-12-04 09:45:55 metadata_cache WARNING [7fd9dc5ea640] aodh-mysql-0.aodh-mysql-endpoints.openstack.svc.cluster.local:3306 is not part of quorum for cluster 'cluster-96cae741a2541f69d730bc71ff97820d'
2023-12-04 09:45:55 metadata_cache ERROR [7fd9dc5ea640] Unable to fetch live group_replication member data from any server in cluster 'cluster-96cae741a2541f69d730bc71ff97820d'

Additional context

This is very similar to issue canonical/mysql-k8s-operator#345 but in this case there is no mysql-router hook error, the mysql router bootstrap appears to have silently failed,

@gnuoy gnuoy added the bug Something isn't working label Dec 4, 2023
Copy link
Contributor

github-actions bot commented Dec 4, 2023

@javacruft
Copy link

I just finished a triage run through of our last set of tests from Solutions QA and I think I can identify 4-5 other test runs that seem to be suffering from the same underlying issue - the mysql-router apps are all active from Juju's perspective but the consuming service has trouble accessing its DB with a connection refused message.

@carlcsaposs-canonical
Copy link
Contributor

2023-12-04 09:45:55 metadata_cache WARNING [7fd9dc5ea640] aodh-mysql-0.aodh-mysql-endpoints.openstack.svc.cluster.local:3306 is not part of quorum for cluster 'cluster-96cae741a2541f69d730bc71ff97820d'
2023-12-04 09:45:55 metadata_cache ERROR [7fd9dc5ea640] Unable to fetch live group_replication member data from any server in cluster 'cluster-96cae741a2541f69d730bc71ff97820d'

this makes me think it's related to canonical/mysql-k8s-operator#345

but I'm quite surprised that the juju status is active—did the router recover afterwards?

do you have juju debug logs for the openstack model? or logs for the pods? I'd like to see if the router charm logged any errors in the debug log

from the mysqlrouter.log, it looks like the bootstrap was successful but the cluster went offline after the bootstrap. Not 100% sure though—also if that happened, the mysql-router charm's health check should've shown an error on the next update-status

@carlcsaposs-canonical
Copy link
Contributor

it's quite difficult to figure out what happened without the juju debug log

@carlcsaposs-canonical
Copy link
Contributor

I just finished a triage run through of our last set of tests from Solutions QA and I think I can identify 4-5 other test runs that seem to be suffering from the same underlying issue - the mysql-router apps are all active from Juju's perspective but the consuming service has trouble accessing its DB with a connection refused message.

@javacruft do you have debug logs for these runs? do the routers remain active (i.e. did an update-status fire and they're still active)? is MySQL server active? are you able to connect to the server directly?

@javacruft
Copy link

@javacruft
Copy link

@carlcsaposs-canonical
Copy link
Contributor

@carlcsaposs-canonical
Copy link
Contributor

okay, I am seeing an error in the debug log

unit-aodh-mysql-router-1: 09:45:42 ERROR unit.aodh-mysql-router/1.juju-log backend-database:179: Failed to bootstrap router
logged_command=['--bootstrap', 'relation-179:***@aodh-mysql-primary.openstack.svc.cluster.local:3306', '--strict', '--conf-set-option', 'http_server.bind_address=127.0.0.1', '--conf-use-gr-notifications']
stderr:
Error: The provided server is currently not in a InnoDB cluster group with quorum and thus may contain inaccurate or outdated data.


unit-aodh-mysql-router-1: 09:45:42 ERROR unit.aodh-mysql-router/1.juju-log backend-database:179: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/./src/kubernetes_charm.py", line 177, in <module>
    ops.main.main(KubernetesRouterCharm)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/lib/charms/data_platform_libs/v0/data_interfaces.py", line 1059, in _on_relation_changed_event
    getattr(self.on, "database_created").emit(
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/abstract_charm.py", line 216, in reconcile
    workload_.enable(tls=self._tls_certificate_saved, unit_name=self.unit.name)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/workload.py", line 216, in enable
    self._bootstrap_router(tls=tls)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/workload.py", line 188, in _bootstrap_router
    raise Exception("Failed to bootstrap router") from None
Exception: Failed to bootstrap router

did it recover afterwards? it appears so, but not sure without DEBUG-level logs

did it show active status but connections didn't work?

@carlcsaposs-canonical
Copy link
Contributor

1.txt
debug log for unit 1

@carlcsaposs-canonical
Copy link
Contributor

@javacruft are you able to reproduce this with debug-level logs?

while this issue initially looks similar to canonical/mysql-k8s-operator#345, on a closer look, it seems like it might be a different issue

what's surprising to me is this part of the logs

unit-aodh-mysql-router-1: 09:45:55 ERROR unit.aodh-mysql-router/1.juju-log backend-database:179: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/./src/kubernetes_charm.py", line 177, in <module>
    ops.main.main(KubernetesRouterCharm)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/lib/charms/data_platform_libs/v0/data_interfaces.py", line 1059, in _on_relation_changed_event
    getattr(self.on, "database_created").emit(
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 342, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 839, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/framework.py", line 928, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/abstract_charm.py", line 216, in reconcile
    workload_.enable(tls=self._tls_certificate_saved, unit_name=self.unit.name)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/workload.py", line 220, in enable
    self._container.update_mysql_router_service(enabled=True, tls=tls)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/src/rock.py", line 109, in update_mysql_router_service
    self._container.restart(self._SERVICE_NAME)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/model.py", line 2045, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/pebble.py", line 1746, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-aodh-mysql-router-1/charm/venv/ops/pebble.py", line 1767, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "mysql_router" (cannot start service: exited quickly with code 2)
----- Logs from task 0 -----
2023-12-04T09:45:55Z INFO Service "mysql_router" has never been started.
----- Logs from task 1 -----
2023-12-04T09:45:55Z INFO Most recent service output:
    Application got fatal signal: 11
    stack_bottom = 0 thread_stack 0x0
    /usr/lib/mysql-router/libmysqlharness.so.1(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x7fd9e3a87b81]
    /usr/lib/mysql-router/libmysqlharness.so.1(+0x8fb01) [0x7fd9e3a79b01]
    /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fd9e3172520]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(GRClusterSetMetadataBackend::find_rw_server()+0x33) [0x7fd9e1f74d93]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(GRClusterSetMetadataBackend::update_clusterset_status_from_gr(metadata_cache::ClusterTopology&, bool, bool, RouterClusterSetOptions const&, std::vector<mysql_harness::TCPAddress, std::allocator<mysql_harness::TCPAddress> > const&)+0x229) [0x7fd9e1f767a9]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(GRClusterSetMetadataBackend::fetch_cluster_topology(mysqlrouter::MySQLSession::Transaction&, mysqlrouter::MetadataSchemaVersion const&, mysqlrouter::TargetCluster&, unsigned int, mysql_harness::TCPAddress const&, std::vector<mysql_harness::TCPAddress, std::allocator<mysql_harness::TCPAddress> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)+0xaff) [0x7fd9e1f7900f]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(GRClusterMetadata::fetch_cluster_topology(std::atomic<bool> const&, mysqlrouter::TargetCluster&, unsigned int, std::vector<mysql_harness::TCPAddress, std::allocator<mysql_harness::TCPAddress> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, unsigned long&)+0x538) [0x7fd9e1f7ad98]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(GRMetadataCache::refresh(bool)+0x99) [0x7fd9e1f86979]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(MetadataCache::refresh_thread()+0xa9) [0x7fd9e1f76c29]
    /usr/lib/mysql-router/libmysqlrouter_metadata_cache.so.1(MetadataCache::run_thread(void*)+0x1e) [0x7fd9e1f771de]
    /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fd9e31c4b43]
    /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fd9e3256a00]
2023-12-04T09:45:55Z ERROR cannot start service: exited quickly with code 2
-----

it seems like a transient issue that went away on a retry, but it sounded like the router was showing active status with the router configuration file missing—if it's the second case, debug logs would be quite helpful

@carlcsaposs-canonical
Copy link
Contributor

Likely fixed by #187. Please re-open if you're able to reproduce or if you have debug-level logs

@carlcsaposs-canonical carlcsaposs-canonical closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants