Make ContainerWait and healthchecks smarter, fixes #1287 #1301

rfay · 2018-11-24T21:26:49Z

The Problem/Issue/Bug:

ContainerWait() was sometimes waiting for a really long time when one of the containers (usually db) had already exited. (It does seem like the code already handles this. I think the actual problem is waiting too long for an unhealthy container; in other words, the container does not display as unhealthy soon enough. )

It turns out that this behavior was mostly because we were polling health of the web container first and waiting for its healthcheck to complete, then tried the db container, which may have long since exited.

People have mentioned (and I've noticed) that ddev projects use quite a lot of CPU and battery. And it's mostly the healthcheck. (I experimented with a number of solutions to this, but all of them result in ddev start taking longer, which I think is probably unacceptable. So I backed out my experiments.)

How this PR Solves The Problem:

Check db container first.
Add healthcheck timeouts and retries to the docker-compose.yaml (and related) so we control them at docker-compose level.
back off how long we'll wait for db container. It was set up for 30 retries, trying every 2 seconds, so 60 seconds before it became "unhealthy".

Manual Testing Instructions:

Try ddev rm && ddev start.
Try breaking a container deliberately. My technique for this was to do a docker-compose.breakit.yaml with contents like this:

version: '3.6'
services:
  web:
    command: "bash -c 'sleep 2 && exit '"

Do things to break the db container:
- in mysql DROP DATABASE db; and then watch behavior as it changes (ddev list) then ddev rm && ddev start and see how long it takes to fail.
- In the db container, rm /var/lib/mysql/mysql/*.MYD. Watch behavior after you do it. Then try ddev rm && ddev start

Automated Testing Overview:

The overall behavior has not changed, I don't think it needs new tests.

This doesn't make a major change in behavior, so no test changes yet.

Release/Deployment notes:

It may be worth

Writing a stack overflow to tell people how to use longer healthchecks to save battery. Or perhaps adding a ddev start --slow-healthcheck or something.

…ed in order

…ver work with tests

andrewfrench

Code changes look good to me, and testing showed healthy and unhealthy containers being reported reliably.

ddev rm && ddev start reported healthy containers,
ddev rm && ddev start with .ddev/compose.breakit.yaml reported an unhealthy web container within a few seconds
After dropping the db database, ddev list reported healthy containers for a few seconds, then reported an unhealthy db container; the container was still reported as unhealthy after ddev rm && ddev start after just a few seconds (time ddev start reports 11s, the second execution, after other containers are running, takes 3.5s)
After deleting mysql files, ddev rm && ddev start reported unhealthy containers in around the same time as above

rfay added this to the v1.5.0 milestone Nov 24, 2018

rfay self-assigned this Nov 24, 2018

rfay requested a review from andrewfrench November 24, 2018 21:26

rfay force-pushed the 20181124_containerwait branch from 03efdf9 to 73abcc2 Compare November 24, 2018 21:39

rfay changed the title ~~Make ContainerWait smarter, less healthcheck churn, fixes #1287~~ Make ContainerWait and healthchecks smarter, fixes #1287 Nov 25, 2018

rfay force-pushed the 20181124_containerwait branch from 3cca345 to f4daacb Compare November 30, 2018 03:30

rfay added 6 commits November 30, 2018 05:55

ContainerWait accepts 'starting' as OK status

b84e26c

Make the web and db containers healthcheck intervals longer

64dfb75

Check db container first, as it's most likely to fail; they get check…

8a18d8a

…ed in order

Minor change to router health message

e4ee7ff

Massage router status; allow 'starting' as normal

68bb291

Give up on the idea of accepting 'starting' state as healthy. Will ne…

f779d38

…ver work with tests

rfay force-pushed the 20181124_containerwait branch from f4daacb to f779d38 Compare November 30, 2018 12:56

andrewfrench approved these changes Nov 30, 2018

View reviewed changes

rfay merged commit 7909843 into ddev:master Nov 30, 2018

rfay deleted the 20181124_containerwait branch November 30, 2018 20:03

rfay mentioned this pull request Dec 15, 2018

v1.5.0 Release Checklist Due 2018-12-18 #1293

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ContainerWait and healthchecks smarter, fixes #1287 #1301

Make ContainerWait and healthchecks smarter, fixes #1287 #1301

rfay commented Nov 24, 2018 •

edited

andrewfrench left a comment

Make ContainerWait and healthchecks smarter, fixes #1287 #1301

Make ContainerWait and healthchecks smarter, fixes #1287 #1301

Conversation

rfay commented Nov 24, 2018 • edited

The Problem/Issue/Bug:

How this PR Solves The Problem:

Manual Testing Instructions:

Automated Testing Overview:

Related Issue Link(s):

Release/Deployment notes:

andrewfrench left a comment

Choose a reason for hiding this comment

rfay commented Nov 24, 2018 •

edited