Skip to content

New master is not replicating after a failover #613

@michaellzc

Description

@michaellzc

Info

Version: the latest operator and cluster chart from this repository.

Description

Our test environment consists of 3 nodes.

  • $release-mysqlcluster-db-0 (master)
  • $release-mysqlcluster-db-1
  • $release-mysqlcluster-db-2

We are simulating a master node failure by killing the mysqld process in db-0

In the event of a DeadMaster event, orchestrator automatically promotes db-1 to master, but the new master node is stuck at not replicating error.

What would be the correct recovery process?

Operator log

{"severity":"INFO","timestamp":"2020-10-15T21:21:41.371071513Z","logger":"orchestrator-reconciler","message":"cluster not ready for acknowledge","key":"$namespace/$release-mysqlcluster-db","threshold":600}
{"severity":"ERROR","timestamp":"2020-10-15T21:26:14.340158975Z","logger":"kubebuilder.controller","message":"Reconciler error","controller":"mysqlbackup-controller","request":"$namespace/$release-mysql-cluster-db-auto-2020-10-14t19-24-00","error":"MysqlCluster.mysql.presslabs.org \"$release-mysql-cluster-db\" not found","stacktrace":"github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Dead master pod (db-0) log after restart

2020-10-15T21:11:23.994309Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
2020-10-15T21:11:23.995177Z 3 [Note] Got an error reading communication packets
2020-10-15T21:11:23.995742Z 4 [Note] Got an error reading communication packets
2020-10-15T21:11:24.173001Z 6 [Note] Start binlog_dump to master_thread_id(6) slave_server(101), pos(, 4)
2020-10-15T21:11:28.133393Z 18 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=$release-mysqlcluster-db-mysql-0-relay-bin' to avoid this problem.
2020-10-15T21:11:28.150677Z 18 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='$release-mysqlcluster-db-mysql-1.mysql.$release', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2020-10-15T21:11:28.175401Z 20 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-10-15T21:11:28.176719Z 21 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './$release-mysqlcluster-db-mysql-0-relay-bin.000001' position: 4
2020-10-15T21:11:28.184056Z 20 [Note] Slave I/O thread for channel '': connected to master 'sys_replication@$release-mysqlcluster-db-mysql-1.mysql.$release:3306',replication started in log 'FIRST' at position 4
2020-10-15T21:11:31.551274Z 6 [Note] Aborted connection 6 to db: 'unconnected' user: 'sys_replication' host: '172.30.254.227' (failed on flush_net())

New master node (db-1)

2020-10-15T21:40:51.873465Z 576 [ERROR] Slave I/O for channel '': error connecting to master 'sys_replication@//$release-mysqlcluster-db-mysql-0.mysql.$namespace:3306' - retry-time: 1  retries: 1755, Error_code: 2005

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions