Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database corruption on host/docker daemon restart "Failed to start sitename: db service health check timed " #748

Closed
rfay opened this issue Mar 21, 2018 · 40 comments
Assignees
Milestone

Comments

@rfay
Copy link
Member

@rfay rfay commented Mar 21, 2018

What happened (or feature request):

We sometimes see "Failed to start sitename: db service health check timed " situations. The log often has something like "InnoDB: The log sequence number 1600607 in ibdata file do not match the log sequence number 10696720565 in the ib_logfiles!" and implies a crash.

Perhaps as in https://stackoverflow.com/questions/19378565/log-sequence-number-in-ibdata-files-does-not-match

This may be happening on a host restart; the db container may not have a reasonable shutdown sequence, so we end up with a corrupted db. We could delete the logfiles on start (so they'd be created automatically). There are probably other options.

Note: This is almost always caused by a corrupted database, most often in a larger database. Since v0.16.0, this is generally only caused by docker being shut down in an ungraceful way. Unfortunately, both Docker for WIndows and Docker for Mac shut down without notifying the container during upgrade, with a manual Docker exit, or at system shutdown. To fix this, remove the database with ddev remove --remove-data and then ddev start and re-import the db. It can be avoided by stopping or remove-ing your projects before letting Docker exit.

What you expected to happen:

How to reproduce this:

Version: Please include the output of ddev version, docker version and the project's .ddev/config.yaml.

Anything else do we need to know:

Related source links or issues:

Please use a complexity rating of 1-5 (5 is high) for a feature request. (High complexity implies more PR planning)

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

Studying https://docs.docker.com/compose/compose-file/#stop_grace_period I see a stop_grace_period that we might want to increase. The default stop_signal SIGTERM that is normally sent should work OK with the db container, and mysqld is pid 1, so that's correct.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

I am able to recreate this with a modest sized database (226MB uncompressed, 14MB compressed sql). Import the database, visit the site, Stop docker, start docker again, ddev start, fails with corruption.

@rfay rfay changed the title Database corruption on host restart (maybe other times) Database corruption on host or docker daemon restart (maybe other times) Mar 21, 2018
@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

I added stop_grace_period: 1m30s to the docker-compose.yaml in the db service, but it did not affect the behavior. However, on a stopped/broken container I did a rm ~/.ddev/<project>/mysql/ib* and then a start and it did fine, as those files are automatically recreated from scratch if they don't exist.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

Interesting paths also might include live restore (tried it and no impact)

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

Adding this to .ddev/mysql/recovery.cnf seems to completely resolve this problem:

[mysqld]
innodb_force_recovery = 1

I think we should consider just adding that to the mariadb config. See forcing recovery; This would be questionable on a prod system, but that's not what we have here. The '1" value is the gentlest, and it works for me.

@cweagans

This comment has been minimized.

Copy link
Contributor

@cweagans cweagans commented Mar 21, 2018

It almost sounds like Docker for Mac doesn't gently stop containers before going down. If it's just a hard power off to the VM, that's really not a great situation, and it's no surprise that InnoDB is complaining about it. I think this solution works as a stopgap, but we should also file an issue upstream to stop containers nicely before killing the VM. I would also say that this solution may not be 100% effective. For instance, if Docker for Mac stops while MySQL is in the middle of writing data to the disk, I think there would be a substantially worse problem that may not be easy to recover from.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Mar 21, 2018

Update: The PR in drud/mariadb-local#52 doesn't work because you can't apparently create a new db with that setting. However, The approach in #748 (comment) does work perfectly for an existing project with db created. I think now that we understand this we'll be able to get it sorted out.

@rickmanelius

This comment has been minimized.

Copy link
Contributor

@rickmanelius rickmanelius commented Apr 5, 2018

@rfay I'm assuming this was complete given the PR #763 merge. Leaving it open if you have any subsequent actions to take pre-release.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Apr 5, 2018

No, #763 doesn't solve the problem when docker is shut down (or updates) or the host reboots. I'm not sure if there's a way to solve that but it's bitten some people for sure.

@rickmanelius rickmanelius referenced this issue Apr 5, 2018
6 of 6 tasks complete
@rickmanelius rickmanelius modified the milestones: v0.17.0, v0.18.0 Apr 17, 2018
@rfay rfay changed the title Database corruption on host or docker daemon restart (maybe other times) Database corruption on host or docker daemon restart (maybe other times) "Failed to start sitename: db service health check timed " Apr 23, 2018
@rfay rfay changed the title Database corruption on host or docker daemon restart (maybe other times) "Failed to start sitename: db service health check timed " Database corruption on host/docker daemon restart "Failed to start sitename: db service health check timed " Apr 24, 2018
@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Apr 24, 2018

This has happened enough times that we need to look for ways to alert people to the actual cause, and help them debug or solve it. The fact that we have no control over Docker's failure in this is no excuse, we need to look for better ways.

  • Just have the database write everything without caching? Pretty sure you can do that, it might have performance impact.
  • Explain to people what the failure to start the db container might mean?
  • It might be possible, though ugly, to repair the db often by doing something like what is in drud/mariadb-local#52
@rickmanelius rickmanelius modified the milestones: v0.18.0, v0.19.0 May 7, 2018
@rickmanelius rickmanelius self-assigned this May 7, 2018
@codemonkey1988

This comment has been minimized.

Copy link
Contributor

@codemonkey1988 codemonkey1988 commented Jun 13, 2018

@rickmanelius I think you tagged the wrong person.
Should this file only be added, when the the database is corrupted? Because when I add it by default after setting up ddev, the database does not start either. Even when it's new and empty one. Without this file, the startup when well. Until I quit docker of course.

@codemonkey1988

This comment has been minimized.

Copy link
Contributor

@codemonkey1988 codemonkey1988 commented Jun 13, 2018

Steps to reproduce for TYPO3 8.7

  1. composer create-project typo3/cms-base-distribution typo3-ddev ^8
  2. cd typo3-ddev
  3. ddev config --docroot public --projecttype typo3 --projectname typo3v8
  4. ddev start
  5. ddev exec /var/www/html/vendor/bin/typo3cms install:setup --no-interaction --database-user-name=db --database-user-password=db --database-host-name=db --database-port=3306 --database-name=db --use-existing-database --admin-user-name=admin --admin-password=admin123! --site-setup-type=site
  6. Quit docker or restart OS (without stopping the project)
  7. Start the project again with ddev start. You should get an error.
@andrewfrench

This comment has been minimized.

Copy link
Member

@andrewfrench andrewfrench commented Jun 13, 2018

I was able to reproduce and confirm the issue with @codemonkey1988's steps above (thanks!), and probing with the same versions used by @rickmanelius, here's what I found:

  • TYPO3 v8.7 w/ ddev v0.19: db service remains unhealthy
  • TYPO3 v8.7 w/ ddev v0.18: db service health check times out, db service is stopped
  • TYPO3 v8.7 w/ ddev v0.16: db service health check times out, db service is stopped
  • TYPO3 v8.7 w/ ddev v0.13: all services started normally

Seems to point to some kind of regression in either ddev itself or the container images between 0.13 and 0.16. Alternatively, the health check might be reporting issues that were a blind spot in previous versions. I'll keep digging.

Update - I was able to bound the breaking change to 0.14 → 0.15.

@Tuurlijk

This comment has been minimized.

Copy link

@Tuurlijk Tuurlijk commented Jun 14, 2018

Any way I can restore the data on the db image?

Ok, the innodb_force recovery trick worked:

cat .ddev/mysql/innodb.cnf
[mysqld]
innodb_force_recovery = 1
@rickmanelius

This comment has been minimized.

Copy link
Contributor

@rickmanelius rickmanelius commented Jun 15, 2018

@Tuurlijk whew. Glad that at least got you to operational. BTW, @andrewfrench has narrowed down the answerspace further and we should be able to review/address early next week. Thank you for your patience and feedback throughout this process!

@andrewfrench

This comment has been minimized.

Copy link
Member

@andrewfrench andrewfrench commented Jun 15, 2018

I have, and to write it down for posterity:

  • ddev v0.14.0 with mariadb-local:v0.7.1 (default combination): issue is not reproduced
  • ddev v0.15.0 with mariadb-local:v0.8.0 (default combination): issue is reproduced
  • ddev v0.14.0 with mariadb-local:v0.8.0: issue is reproduced
  • ddev v0.15.0 with mariadb-local:v0.7.1: issue is not reproduced

This leads me to believe that the issue was introduced in the mariadb-local container update from v0.7.1 to v0.8.0, which took place around 8 March 2018 in the mariadb-local repository (before containers were maintained in the ddev repo).

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 16, 2018

Just noting:

  • It is normal for a ddev site not to start automatically after a reboot of the system or of docker (it comes up stopped). This is a result of docker-compose's restart policy being set to "no", and it has several good effects. Normally (without database corruption) a simple ddev start will bring everything up.
  • The workaround with innodb_force_recovery is a one-time thing: You put that in there, start the db, and get it going, then remove that from the mysql config. It also can be upped as far as '6' for more aggressive recovery.
@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 17, 2018

@jeffsheltren Would you be willing to take a look at our MariaDB config and make suggestions about how we can survive the "server belly-up with no notice" problem, corrupting the db? Docker in many cases on windows and mac just kills the container with no notice (when Docker is going down, during host shutdown, or Docker upgrade). And the user db does not fare well. Of course, Docker can do better, but can we do better at handling this? Thanks!

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 27, 2018

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 27, 2018

I'm trying to set up a standard recreation technique for this one.

Drupal6 codebase: https://github.com/rfay/satirtraining.org
Enlarged db: (ask me for the link). It's a 192MB database, with loads of garbage in the watchdog table.

This site is no longer live, has been retired. The db I can provide has been sanitized.

Recreation technique:

ddev config
ddev start
ddev import-db --src=.tarballs/satirtraining.big_watchdog.sql.gz

The import-db takes about 3:05 on my system.

Hit the project with a 404 (to insert an entry into watchdog table), http://satirtraining.org.ddev.local/junk

(I actually used this technique between switches of the ddev version: ddev rm -R; rm .ddev/docker-compose.yaml; rm -rf ~/.ddev/satirtraining/; ddev start; time ddev import-db --src=.tarballs/satirtraining.big_watchdog.sql.gz`)

Then ddev ssh -s db. I like to tail -f /var/log/mysqld.log
Then exit docker (on macOS, pull down the menu and "Quit docker"

You'll be blown out of the container immediately.

Now restart docker (Just run docker.app on macOS)

ddev list # Will show the project as stopped
ddev start

The start will fail due to broken database.

(Before beginning any new tests, start with a clean ddev rm -R)

@rfay rfay removed the needs decision label Jun 28, 2018
@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 28, 2018

Using the technique in #748 (comment) I was able to (hopefully) confirm @andrewfrench 's results of the problem getting at least worse between v0.14.0 and v0.15.0.

The key change in the container between those mariadb-local versions v0.7.1 and v0.8.0 was using a starter db when starting the project from scratch. I wouldn't expect that to come into play here, since the project is not being recreated here. drud/mariadb-local@c0d8df2

We also

  • added mysql overrides in v0.15.0, and
  • this was the version in which we changed to docker-compose 'restart' => 'no' (meaning that the project comes up stopped)
@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 28, 2018

Setting docker compose 'restart' to 'always' in v0.15.0 does not fix this.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jun 28, 2018

So with this process I get a happy restart of ddev after a docker down event:

ddev v0.14.0
ddev import-db (from scratch, database complete deleted before)
Hit http://satirtraining.org.ddev.local/junk
Stop docker
Start docker
ddev start (everything is fine)
ddev rm
brew switch ddev 0.20.0
ddev start (everything fine, as expected)
Stop docker
Start docker
ddev start (Everything is fine, as NOT expected)

To me, since the key difference in this approach is when the base mysql and 'db' databases were created (and by whom) this says that something about the base db creation or base db installation has some type of problem. We could always go back to the old approach (creating the db inline), but the speedup in initial project creation was really significant, so it would be nice if we could just sort out the actual problem.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jul 3, 2018

#748 improves this behavior quite a lot, but the docker behavior is still wrong. There's quite a lot of diagnostic information in #748 and upstream docker-for-mac issue is docker/for-mac#3047

@rfay rfay closed this in 75e91c4 Jul 3, 2018
@garske

This comment has been minimized.

Copy link

@garske garske commented Jul 11, 2018

thanks for this thread.

I had to use this when windows computer locked up and of course nothing i did with dockers cleaned up the crashed mess..

create a file with .ddev/mysql/innodb.cnf
[mysqld]
innodb_force_recovery = 1

site is now working again.

@mac-dev

This comment has been minimized.

Copy link

@mac-dev mac-dev commented Jul 23, 2018

I do have some random crashes, recently while adding a field in a drupal content type. I couldn't see any docker restarts at that moment and the web container was still up and running.
Now I guess this might be some kind of db-size limit as our db is about 3GB excluding cash tables and thows can grow a lot too.
Thanks for the Info from this thread I can now start debugging this instead of reimporting the db everytime.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Jul 23, 2018

@mac-dev are you using v1.0.0? #748, which went in there, should make things a lot better I'm hoping, would love to hear your experience.

@mac-dev

This comment has been minimized.

Copy link

@mac-dev mac-dev commented Aug 1, 2018

Just got stuck with the same problem again with v1.0.0
Failed to restart hessen: db service health check timed out: labels map[com.ddev.site-name:hessen com.docker.compose.service:db] timed out without becoming healthy, status=exited

I've added the recovery.cnf but I guess it just takes to long.

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Aug 1, 2018

I would love to hear how big your database is, @mac-dev, and whether anything might have happened that would affect the db container's stability. Docker on mac and windows still does not provide notice to its containers when it is shutting down, so you will definitely be better to stop or rm your projects before rebooting or exiting docker, but we're seeing better overall behavior with v1.0.0 (although we do see that the db goes through a recovery process after one of those events).

@mac-dev

This comment has been minimized.

Copy link

@mac-dev mac-dev commented Aug 6, 2018

@rfay it might have been happend as you sayed during docker start / stops or windows reboots due to automated updates. I always try to stop the projects manually but sometimes I fail to do so.
DB is up to 20GB in size depending on caches. Actually I can see about 10GB in the projects mysql directory inside the users .ddev dir.
Meanwhile I changed the ram for docker to 8GB and CPUs to 4.
Trying to ddev start looked okay in the beginning but afterwards I saw that there is no connection to the db-container even it was running.
Then I tried a ddev remove but the the db-container is still there and I can't use it again.
I guess I have to start from scratch again.
I've attached some logs:
ddev_crash_logs.txt

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Aug 6, 2018

The crash log you show ERROR: for ddev-marsupilami-db Cannot create container for service db: Conflict. The container name "/ddev-marsupilami-db" is already in use by container "e9a2c1cb8ab03529c13ba7fdff065afb4796d7b68917e77f4ea4807f53633d67". You have to rem

means that you have a stopped container of the same name (probably for the same reason). You'll definitely have to kill it off before proceeding. A ddev rm will do that (and it doesn't kill off the db). But based on what you're describing, you have a broken db anyway, and you'll have a hard time getting it going.

Is there any possibility you have two codebases using projects with the name marsupilami? because that will cause enormous permanent pain.

I do think you're going to have never-ending pain trying to run a 20GB db in a dev environment. It's just not that likely to be successful. You may want to see if there are ways you can pare that db down to what's really important (like remove unnecessary users or content, etc.) The reality is 20GB databases in production are not often that successful and often need special care and feeding and infrastructure.

FYI you can use ddev list to see existing projects that ddev knows about. You can use docker ps -a to see all containers that docker knows about (including stopped ones). You can use ddev rm to get rid of an existing project; if you need to use docker to kill everything off you can do docker stop $(docker ps -aq) && docker rm $(docker ps -aq)

@mac-dev

This comment has been minimized.

Copy link

@mac-dev mac-dev commented Aug 7, 2018

I quess I ran into a concatenation of issues. I had to change my pw yesterday so I ran into docker/for-win#1761
The DB files and all other didn't get mounted anymore. So I decided to start from scratch.
Meanwhile we decided to truncate some db contents and we will continue testing this.
As an alternative we did already run ddev under linux with native mariadb and modified db-connection in settings.php

@rfay

This comment has been minimized.

Copy link
Member Author

@rfay rfay commented Aug 7, 2018

The alternative running maria elsewhere on a machine that's up to it is not a bad idea. Thanks for lettime me know about docker/for-win#1761 - I think I knew about that as a practical matter but didn't know the actual issue.

One thing you may want to consider is using mariabackup to load the db from a mount (inside the mariadb container). It is way faster than mysql against a text dump file (e.g. ddev import-db). It also will work better for using your external db server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.