Skip to content
This repository has been archived by the owner on Nov 27, 2023. It is now read-only.

Healthcheck for EFS fails when part of db server initialisation process is stopping and restarting #2202

Closed
ahmed2m opened this issue Nov 25, 2022 · 1 comment

Comments

@ahmed2m
Copy link

ahmed2m commented Nov 25, 2022

Description

I have a docker-compose setup that include postgres database service.
Apparently part of Postgres server initialisation process is stopping and restarting.

part of a normal postgres log that I saw parts of breifly in EFS log on the console before teh compose-cli deleted everything.

local-postgres9.5 | LOG:  received fast shutdown request
local-postgres9.5 | LOG:  aborting any active transactions
local-postgres9.5 | LOG:  autovacuum launcher shutting down
local-postgres9.5 | LOG:  shutting down
local-postgres9.5 | waiting for server to shut down....LOG:  database system is shut down
local-postgres9.5 |  done
local-postgres9.5 | server stopped
local-postgres9.5 |
local-postgres9.5 | PostgreSQL init process complete; ready for start up.
local-postgres9.5 |
local-postgres9.5 | LOG:  database system was shut down at 2016-05-16 16:51:55 UTC
local-postgres9.5 | LOG:  MultiXact member wraparound protections are now enabled
local-postgres9.5 | LOG:  database system is ready to accept connections

FROM: this SO question as I wasn't fast enough to copy from the console

Steps to reproduce the issue:

  1. Run docker compose --project-name name --file file.yml up with a docker-compose file similar to mine.

Describe the results you received:
After the DbService is created successfully and while three other services are being created, the health check fails for DbService and after deleting everything, the error I get is:
DbService ServiceSchedulerInitiated: Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:us-east-1:[...])

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker-compose --version:
Docker Compose version 2.12.2

Output of docker version:

Client:
 Cloud integration: v1.0.29
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.19.2
 Git commit:        baeda1f82a
 Built:             Thu Oct 27 21:30:31 2022
 OS/Arch:           linux/amd64
 Context:           myecscontext
 Experimental:      true

Server:
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.2
  Git commit:       3056208812
  Built:            Thu Oct 27 21:29:34 2022
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          v1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f.m
 runc:
  Version:          1.1.4
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker context show:
You can also run docker context inspect context-name to give us more details but don't forget to remove sensitive content.

[
    {
        "Name": "myecscontext",
        "Metadata": {
            "Type": "ecs"
        },
        "Endpoints": {
            "docker": {
                "SkipTLSVerify": false
            },
            "ecs": {
                "Profile": "default"
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/home/ahmed/.docker/contexts/meta/8bb20fb47c4248774ab660063891d2332facb5997d914f61b3bdfeb471eaddba",
            "TLSPath": "/home/ahmed/.docker/contexts/tls/8bb20fb47c4248774ab660063891d2332facb5997d914f61b3bdfeb471eaddba"
        }
    }
]

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., v2.12.2)

Server:
 Containers: 15
  Running: 0
  Paused: 0
  Stopped: 15
 Images: 87
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f.m
 runc version: 
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.17.9-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.06GiB
 Name: probook
 ID: WFF4:4RZ7:KAVK:OCJC:BUIH:CNV6:4LT2:E3QF:YU6D:YQJN:5ZFA:S3B2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS ECS, Azure ACI, local, etc.):
My docker-compose generates about 88 tasks and takes a while to startup some of the services.

version: "3"

networks:
  the-network:
    driver: bridge

services:
  db:
    image: postgres:14.5
    env_file:
      - .env
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - postgres_backups:/backups
    networks:
      - the-network
    ports:
      - "5435:5435"

  redis:
    image: redis:5.0

  api: &api
    image: example.com:5050/repo/api:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    build:
      context: backend
    env_file:
      - .env
    environment:
      DJANGO_SETTINGS_MODULE: "api.settings.testing_env"
    volumes:
      - media_storage:/storage/django_media/:rw
      - static_storage:/storage/django_static/:rw
    networks:
      - the-network
    ports:
      - "8000:8000"
    depends_on:
      - db
      - redis

  celeryworker:
    <<: *api
    image: example.com:5050/repo/celeryworker:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    ports:
      - "8001:8001"
    command: bash -c "cd api;
      poetry run celery -A api worker -l info"
    depends_on:
      - api

  celerybeat:
    <<: *api
    image: example.com:5050/repo/celerybeat:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    networks:
      - the-network
    ports:
      - "8002:8002"
    command: bash -c "cd api;
      poetry run celery -A api beat -l info --pidfile=''"
    depends_on:
      - api

  frontend:
    image: example.com:5050/repo/frontend:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    build:
      context: frontend
      dockerfile: Dockerfile.prod
    networks:
      - the-network
    ports:
      - "3000:3000"
    env_file:
      - .env

  nginx:
    image: nginx:latest
    build:
      context: nginx
      args:
        API_INTERNAL_HOST: api:8000
        WEB_INTERNAL_HOST: frontend:3000
        API_HOSTNAME: ${API_HOSTNAME}
        WEB_HOSTNAME: ${WEB_HOSTNAME}
    command: '/bin/sh -c ''while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g "daemon off;"'''
    depends_on:
      - api
      - frontend
    volumes:
      - static_storage:/storage/django_static/:rw
      - media_storage:/storage/django_media/:rw
    env_file:
      - .env
    networks:
      - the-network
    ports:
      - "80:80"

volumes:
  postgres_data: {}
  postgres_backups: {}
  static_storage: {}
  media_storage: {}

NOTE: I tried to disable health check to make sure it would all work after successful deployment:

    healthcheck:
      disable: true

That generated HealthCheck: {} in DbTaskDefinition

but faced that error

Resource handler returned message: "Invalid request provided: Create TaskDefinition: You must specify a health check command for container 'db' (Service: AmazonECS; Status Code: 400; Error Code: ClientException; Request ID: 865908ff-bb32-4b54-b541-6c5df3cdb0c5; Proxy: null)" (RequestToken: 94984544-ffd6-d54f-bf7a-b942b6ba606a, HandlerErrorCode: InvalidRequest)
--
@ahmed2m
Copy link
Author

ahmed2m commented Nov 26, 2022

If anyone is here and wondering, it's something related to the image being a postgres container, healthcheck needs to be defined for example:

    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-d", "${POSTGRES_DB}"]
      interval: 30s
      timeout: 60s
      retries: 5

@ahmed2m ahmed2m closed this as completed Nov 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant