Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtrabackup-v2 SST donor stuck in DONOR/DESYNCED state when joiner is killed #333

Open
GeoffMontee opened this issue Jun 4, 2018 · 0 comments

Comments

@GeoffMontee
Copy link

This is related to the following MariaDB Jira issue:

https://jira.mariadb.org/browse/MDEV-15442

It looks like there is a problem with the xtrabackup-v2 SST script in which the donor node does not always detect that the joiner has died, so it can sometimes stream the backup to nowhere.

It looks like the donor executed the following command in this scenario:

WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)

Maybe the "keepalive", "connect-timeout=", and/or
"linger=" options from socat's socket option group would
be helpful here?

http://www.dest-unreach.org/socat/doc/socat.html#GROUP_SOCKET

Or maybe the "keepcnt=" and/or "abort-threshold="
options from socat's TCP option group?

http://www.dest-unreach.org/socat/doc/socat.html#GROUP_TCP

It also looks like the donor is determining if socat failed by
checking to see if its return value was 1:

https://github.com/MariaDB/server/blob/a15ab358fc1ea75634de266fa8150b3e89ac5593/scripts/wsrep_sst_xtrabackup-v2.sh#L975

Is this a good way to determine failure? The socat manual doesn't seem
to indicate that this is some special value that indicates a failure.
It seems to say that any positive or negative integer could mean a
failure:

"On exit, socat gives status 0 if it terminated due to EOF or
inactivity timeout, with a positive value on error, and with a
negative value on fatal error."

http://www.dest-unreach.org/socat/doc/socat.html#DIAGNOSTICS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant