New master is not replicating after a failover

## Info

Version: the latest `operator` and `cluster` chart from this repository.

## Description

Our test environment consists of 3 nodes.

- $release-mysqlcluster-db-0 (master)
- $release-mysqlcluster-db-1
- $release-mysqlcluster-db-2

We are simulating a master node failure by killing the `mysqld` process in `db-0`

In the event of a `DeadMaster` event, `orchestrator` automatically promotes `db-1` to master, but the new master node is stuck at `not replicating` error.

What would be the correct recovery process?

### Operator log

```
{"severity":"INFO","timestamp":"2020-10-15T21:21:41.371071513Z","logger":"orchestrator-reconciler","message":"cluster not ready for acknowledge","key":"$namespace/$release-mysqlcluster-db","threshold":600}
```

```
{"severity":"ERROR","timestamp":"2020-10-15T21:26:14.340158975Z","logger":"kubebuilder.controller","message":"Reconciler error","controller":"mysqlbackup-controller","request":"$namespace/$release-mysql-cluster-db-auto-2020-10-14t19-24-00","error":"MysqlCluster.mysql.presslabs.org \"$release-mysql-cluster-db\" not found","stacktrace":"github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/presslabs/mysql-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/presslabs/mysql-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
```

### Dead master pod (db-0) log after restart

```
2020-10-15T21:11:23.994309Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
2020-10-15T21:11:23.995177Z 3 [Note] Got an error reading communication packets
2020-10-15T21:11:23.995742Z 4 [Note] Got an error reading communication packets
2020-10-15T21:11:24.173001Z 6 [Note] Start binlog_dump to master_thread_id(6) slave_server(101), pos(, 4)
2020-10-15T21:11:28.133393Z 18 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=$release-mysqlcluster-db-mysql-0-relay-bin' to avoid this problem.
2020-10-15T21:11:28.150677Z 18 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='$release-mysqlcluster-db-mysql-1.mysql.$release', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2020-10-15T21:11:28.175401Z 20 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-10-15T21:11:28.176719Z 21 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './$release-mysqlcluster-db-mysql-0-relay-bin.000001' position: 4
2020-10-15T21:11:28.184056Z 20 [Note] Slave I/O thread for channel '': connected to master 'sys_replication@$release-mysqlcluster-db-mysql-1.mysql.$release:3306',replication started in log 'FIRST' at position 4
2020-10-15T21:11:31.551274Z 6 [Note] Aborted connection 6 to db: 'unconnected' user: 'sys_replication' host: '172.30.254.227' (failed on flush_net())
```

### New master node (db-1)

```
2020-10-15T21:40:51.873465Z 576 [ERROR] Slave I/O for channel '': error connecting to master 'sys_replication@//$release-mysqlcluster-db-mysql-0.mysql.$namespace:3306' - retry-time: 1  retries: 1755, Error_code: 2005
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New master is not replicating after a failover #613

Info

Description

Operator log

Dead master pod (db-0) log after restart

New master node (db-1)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

New master is not replicating after a failover #613

Description

Info

Description

Operator log

Dead master pod (db-0) log after restart

New master node (db-1)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions