Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop node when disk full #281

Open
prupert opened this issue Sep 5, 2016 · 10 comments
Open

Drop node when disk full #281

prupert opened this issue Sep 5, 2016 · 10 comments

Comments

@prupert
Copy link

prupert commented Sep 5, 2016

When the disk is full on one single node, the whole cluster can get into a halt. I recently experienced this issue in a 3-node setup with MariaDB-Galera-server-5.5.51-1.el6.x86_64 and galera-25.3.17-1.rhel6.el6.x86_64.

Upon investigation I found out somebody else had the exact same issue, see https://groups.google.com/forum/#!topic/codership-team/qWkhRfa7xcM. It was suggested to file an issue, but searching this repository, I couldn't find it so I decided to file the report myself.

Situation:

  • Three node cluster.
  • Disk on one node is full.
  • Galera/wsrep doesn't detect this, keeps operating the same cluster.
  • All queries on all nodes stall.

Logs on node with disk full:

160903  6:42:47 [ERROR] mysqld: Disk full (/tmp/#sql_a94_1); waiting for someone to free some space...
mysqldump: Couldn't execute 'SHOW TRIGGERS LIKE 'subscription'': Disk full (/tmp/#sql_a94_1); waiting for someone to free some space... (1021)
160903  6:42:47 [Note] WSREP: resuming provider at 12210618
160903  6:42:47 [Note] WSREP: Provider resumed.

Nothing was logged on the other 2 nodes. The wsrep cluster stayed in the same state uuid. Nodes still accepted queries, but were unable to answer them. They just kept stalled in a state (statistics, query end, ...) depending on the query type.

After freeing up disk space on the node with the full disk, the cluster continued normal operations as if nothing happened (and still nothing was being logged on the other nodes).

Suggested solution:

  • Galera/wsrep should detect full disk situation and remove bad node from cluster.
@arjenlentz
Copy link

...ping!

@cjs6891
Copy link

cjs6891 commented Mar 13, 2017

just experienced that exact scenario last week... I have a monitoring system that alerted disk at 95%, but it was to late by the time I was able to get on a computer & fix the problem.

had the node dropped, it wouldn't have been an issue.

please consider adding this feature to galera...

@prupert
Copy link
Author

prupert commented Mar 13, 2017

It may be useful for the code maintainers to know which version(s) of Galera wsrep you are using.

@cjs6891
Copy link

cjs6891 commented Mar 13, 2017

mysql-wsrep-client-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64
mysql-wsrep-devel-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64
mysql-wsrep-libs-compat-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64
mysql-wsrep-server-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64
mysql-wsrep-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64
mysql-wsrep-shared-5.6-5.6.35-25.18.20170106.1f9ae89.el6.x86_64

@igremmerlb
Copy link

We have the same issue. We are running a recent MariaDB image 10.1.22 with the wsrep components built in. I think this scenario will happen frequently as people are putting database nodes into Docker containers on systems with shared disk. Related to this, we have also experienced several recent SSD failures where the SSD will no longer accept writes and instead returns a disk full error. After a cold reboot the SSD becomes alive again. A fix to drop the node out of the cluster on write errors would help both scenarios (running out of disk space and having a bad node).

@lchanouha
Copy link

Occured to most recent Debian packaged version: 10.2.9-MariaDB-10.2.9+maria~jessie-log.
On 3 multi-master node, 1 went out of space, the whole cluster got stuck.
Queries timed out, and few minutes after until max connection limit reached.

@Sir-Will
Copy link

Does anyone have a solution for this yet?

@scaarup
Copy link

scaarup commented Oct 7, 2020

Just experienced this also. It seems like a really simple issue to resolve.

@blatouchm
Copy link

So it's debian specific bug and not feature? Then only solution is crontab script check left space every minute and shutdown node when full.

@jsfrerot
Copy link

experienced it too with maria 10.4.28 and galera 4-26.4.14-1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants