Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql dont start completely and readiness fail #416

Open
cpanato opened this issue Oct 24, 2019 · 24 comments
Open

Mysql dont start completely and readiness fail #416

cpanato opened this issue Oct 24, 2019 · 24 comments
Labels

Comments

@cpanato
Copy link
Contributor

cpanato commented Oct 24, 2019

Notice sometimes when the pod restart for any reason it never comes up again
for example

NAME                       READY   STATUS     RESTARTS   AGE
db-mysql-0                 3/4     Running    0          5m29s

events:

Events:
  Type     Reason     Age                From                                    Message
  ----     ------     ----               ----                                    -------
  Normal   Scheduled  2m2s               default-scheduler                       Successfully assigned 6c7951pozpb68x3w3es6pbodqe/db-mysql-0 to ipxxxxxxxx
  Normal   Pulled     103s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    103s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    103s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     102s               kubelet, ipxxxxxxxx  Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
  Normal   Created    102s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    102s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "prom/mysqld-exporter:v0.11.0" already present on machine
  Normal   Created    101s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    101s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
  Normal   Created    101s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    101s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    100s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    100s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     100s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    100s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    100s               kubelet, ipxxxxxxxx  Started container
  Warning  Unhealthy  84s (x7 over 96s)  kubelet, ipxxxxxxxx  Readiness probe failed:

logs from the pods

Create rclone.conf file.
2019-10-24T15:47:32.823Z	INFO	sidecar	environment is not set	{"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:32.823Z	INFO	sidecar	cloning command	{"host": "db-mysql-0"}
2019-10-24T15:47:32.823Z	INFO	sidecar	data already exists! Remove manually PVC to cleanup or to reinitialize.
2019-10-24T15:47:32.824Z	INFO	sidecar	configuring server	{"host": "db-mysql-0"}
2019-10-24T15:47:32.824Z	INFO	sidecar	error while reading PURGE GTID from xtrabackup info file	{"error": "open /var/lib/mysql/xtrabackup_binlog_info: no such file or directory"}
Initialization complete, now exiting!
2019-10-24T15:47:34.526691Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-10-24T15:47:34.528701Z 0 [Note] mysqld (mysqld 5.7.26-29-log) starting as process 1 ...
2019-10-24T15:47:34.538056Z 0 [Note] InnoDB: PUNCH HOLE support available
2019-10-24T15:47:34.538155Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-10-24T15:47:34.538178Z 0 [Note] InnoDB: Uses event mutexes
2019-10-24T15:47:34.538215Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2019-10-24T15:47:34.538233Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2019-10-24T15:47:34.538251Z 0 [Note] InnoDB: Using Linux native AIO
2019-10-24T15:47:34.538754Z 0 [Note] InnoDB: Number of pools: 1
2019-10-24T15:47:34.538969Z 0 [Note] InnoDB: Using CPU crc32 instructions
2019-10-24T15:47:34.545153Z 0 [Note] InnoDB: Initializing buffer pool, total size = 512M, instances = 1, chunk size = 128M
2019-10-24T15:47:34.569645Z 0 [Note] InnoDB: Completed initialization of buffer pool
2019-10-24T15:47:34.578085Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-10-24T15:47:34.595271Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2019-10-24T15:47:34.597393Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2019-10-24T15:47:34.633116Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 3932160 bytes
2019-10-24T15:47:34.867432Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-10-24T15:47:34.867512Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-10-24T15:47:35.011558Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-10-24T15:47:35.012721Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2019-10-24T15:47:35.012737Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2019-10-24T15:47:35.013453Z 0 [Note] InnoDB: Waiting for purge to start
2019-10-24T15:47:35.064136Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.26-29 started; log sequence number 203893653
2019-10-24T15:47:35.064470Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2019-10-24T15:47:35.064657Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-10-24T15:47:35.095022Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2019-10-24T15:47:35.095166Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2019-10-24T15:47:35.099564Z 0 [Warning] CA certificate ca.pem is self signed.
2019-10-24T15:47:35.099626Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2019-10-24T15:47:35.100921Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2019-10-24T15:47:35.100962Z 0 [Note] IPv6 is available.
2019-10-24T15:47:35.100976Z 0 [Note]   - '::' resolves to '::';
2019-10-24T15:47:35.100998Z 0 [Note] Server socket created on IP: '::'.
2019-10-24T15:47:35.174622Z 0 [Note] InnoDB: Buffer pool(s) load completed at 191024 15:47:35
2019-10-24T15:47:35.214181Z 0 [Note] Event Scheduler: Loaded 0 events
2019-10-24T15:47:35.214376Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' started.
2019-10-24T15:47:35.224302Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' ended.
2019-10-24T15:47:35.224498Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
Create rclone.conf file.
2019-10-24T15:47:34.787Z	INFO	sidecar	environment is not set	{"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:34.788Z	INFO	sidecar	start http server for backups
time="2019-10-24T15:47:35Z" level=info msg="Starting mysqld_exporter (version=0.11.0, branch=HEAD, revision=5d7179615695a61ecc3b5bf90a2a7c76a9592cdd)" source="mysqld_exporter.go:206"
time="2019-10-24T15:47:35Z" level=info msg="Build context (go=go1.10.3, user=root@3d3ff666b0e4, date=20180629-15:00:35)" source="mysqld_exporter.go:207"
time="2019-10-24T15:47:35Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.info_schema.tables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.heartbeat" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg="Listening on 0.0.0.0:9125" source="mysqld_exporter.go:232"
Create rclone.conf file.
Usage: /usr/local/bin/sidecar-entrypoint.sh {files-config|clone|config-and-serve}
Now runs your command.
pt-heartbeat --update --replace --check-read-only --create-table --database sys_operator --table heartbeat --defaults-file /etc/mysql/heartbeat.conf --fail-successive-errors=20

any advice in how to recover from that?

/cc @AMecea @jwilander

@AMecea
Copy link
Contributor

AMecea commented Oct 29, 2019

Hi @cpanato, can you please describe the pod that is not ready (by running kubectl describe <pod>)? So I can see which container is not ready. If it's the mysql container then I need some logs of the operator too or at least check the logs for reconciling failures.

The operator sets the pod ready by updating the configured key under the OperatorStatusTableName table if it fails then the pod will not be ready. See the code here.

Hope this helps,

@imriss
Copy link
Contributor

imriss commented Apr 8, 2020

I have a similar issue with the readiness probe. The workaround has been to delete the mysql-operator pod. Any advise? The mysql operator is on 0.3.8. Thanks

@onelittleant
Copy link

onelittleant commented Aug 19, 2020

Similar issue here. If a mysql pod is killed and crash recovery takes too long, then the readiness probe fails for the pod and it is again killed before it can finish crash recovery. The result is a pod that just keeps restarting and will never finish recovery and rejoin the cluster.

Warning  Unhealthy               5m43s (x5 over 9m43s)  kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf  Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
/bin/sh: line 0: test: -eq: unary operator expected
  Normal   Killing    5m18s                  kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf  Container mysql failed liveness probe, will be restarted
helm ls -n db
NAME          	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART               	APP VERSION
btc2dev       	db       	31      	2020-08-19 12:36:02.215646 -0600 MDT	deployed	mysql-cluster-0.2.0 	1.0        
mysql-operator	db       	10      	2020-08-19 13:08:29.740631 -0600 MDT	deployed	mysql-operator-0.4.0	v0.4.0     

@Erically
Copy link

resolved?

@enoliveira
Copy link

still happening.

@onelittleant
Copy link

Because we are unable to configure the readinessProbe we cannot provide cluster stability in the case of a mysql pod that requires crash recovery. As a result we are unable to use this solution in a production environment. We really like the operator model and have been eager to use it in production. If we could give readinessProbe configurations in the mysql podSpec it would get us there, but as-is we can only use this for dev/test environments.

@HendrikRoehm
Copy link

This is a serious bug, which is not resolved yet. In my case, the mysql container was not ready.

After ssh-ing into the mysql container and manually executing UPDATE sys_operator.status SET value="1" WHERE name="configured", the pod turned ready for me...

@HendrikRoehm
Copy link

One side note: We had two instance with more than 10mb of data and one with approx. 1.5mb of data. The one with 1.5mb of data completed the readiness check successfully as the only one.

@hmilkovi
Copy link

Same here and it's easy to reproduce:
kubectl delete pod ha-mysql-db-mysql-0 --grace-period=0 --force

@haslersn
Copy link

After ssh-ing into the mysql container and manually executing UPDATE sys_operator.status SET value="1" WHERE name="configured", the pod turned ready for me...

@HendrikRoehm Can configured have different values in different Pods of the same MySQL cluster? Are these from different (local) databases? So do I need to do this workaround in multiple Pods?

@HendrikRoehm
Copy link

@haslersn I don't really know as I am no expert on this. In my case this was a small test mysql instance with no replication and thus only 1 pod. I would guess that you should set it on the master instance of the mysql cluster if there are multiple instances.

@worldofgeese
Copy link

We ran into this issue today when running a helm upgrade.

@frjaraur
Copy link

Is there any workaround for this issue?. I manage to make my cluster work updating sys_operator.status manually getting inside mysql pod, and everything goes up, but this is not the right way (if you have to move pod to another node issue will come back again). Does operator manage this value?.

@worldofgeese
Copy link

Can @calind take a look at this? This bug wipes out any serious use of this operator.

@codebard
Copy link

codebard commented Dec 24, 2021

Just encountered this with a cluster whose node was recycled. The mysql pod reports its unready despite showing as healthy. Mysql is not accessible from outside. This killed my plans to use this operator in production. I had to move on.

@tebaly
Copy link

tebaly commented Dec 24, 2021

#777 (comment)

Are you sure it has nothing to do with calico? In my case, the problem with this

@mazizang
Copy link

mazizang commented Jan 5, 2022

@tebaly
our production enviroment is using flannel, same problem

@milero milero added the bug label Jan 5, 2022
@imriss
Copy link
Contributor

imriss commented Jan 6, 2022

This could be a bug related to this delete table when it assumes the file does not exists while the file actually exists:

_, err := os.Stat(path.Join(dataDir, constants.OperatorDbName, constants.OperatorStatusTableName+".ibd"))
if os.IsNotExist(err) {
queries = append(queries, fmt.Sprintf("DROP TABLE IF EXISTS %s.%s",
constants.OperatorDbName, constants.OperatorStatusTableName))
}

I was able to make the mysql cluster get ready after deleting this file then deleting the pod:
/var/lib/mysql/sys_operator/status.ibd

@codebard
Copy link

Looking forward to the fix for this bug to be able to reconsider this operator for use in production.

@proligde
Copy link

proligde commented Feb 17, 2022

It seems like performing a manual failover, i.e. promoting one of the read replicas to be the new master, fixes this issue as well. At least that did the trick in my case.

This issue occurred in a freshly set up test cluster where I had shut down (gracefully using ACPI powerdown) all three nodes at the same time on purpose. After starting all three up again, I ended up in that very state, where basically everything worked fine, except kubernetes didn't activate the MySQL service as the master node stuck with the mysql container in state NotReady and manual intervention was necessary. (Also using flannel here by the way)

@randywatson1979
Copy link

My only workaround is:

  • set replicaset of mysql-operator to 0
  • delete the mysqlcluster statefullset
  • set replicaset of mysql-operator to 1
  • apply the mysql-cluster yaml

Not pretty though.

@olivernadj
Copy link

It is happening to me on kind.sigs.k8s.io v1.23.5 with 0.6.2 mysql-operator.

Looks like node_controller somehow "forget" to reconcile. Any edit on database pod trigger reconcile and eventually setting status=1

I still don't understand why reconcile was not happening and the only way I can reproduce the issue is when I shutdown my PC and start again. And even with this procedure, just cause the issue from time to time randomly.

@voarsh2
Copy link

voarsh2 commented Mar 7, 2023

readinessProbe/livenessProbe configs would be most helpful... this is basic.....

@mdshi5
Copy link

mdshi5 commented Jun 9, 2024

Still I have issue regarding readiness probe failed, I described and found "exec [/bin/sh -c test $(mysql --defaults-file=/etc/mysql/client.conf -NB -e 'SELECT COUNT(*) FROM sys_operator.status WHERE name="configured" AND value="1"') -eq 1] delay=5s timeout=5s period │
│ =2s #success=1 #failure=3". But I had to change the value of sys_operator.status to fix this, if I change one pod the value of another pod get changed, and I had to do a repetitive process to fix it temporarily. Tell me a workaround where I can configure this in the operator values or in mysqlcluster CR so that I can fix this readiness issue without any hassle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests