Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgbouncer readinessProbe timeouts cause livenessProbe failures #525

Closed
2 tasks done
emilianomoscato opened this issue Feb 18, 2022 · 1 comment · Fixed by #547
Closed
2 tasks done

pgbouncer readinessProbe timeouts cause livenessProbe failures #525

emilianomoscato opened this issue Feb 18, 2022 · 1 comment · Fixed by #547
Labels
kind/bug kind - things not working properly
Milestone

Comments

@emilianomoscato
Copy link

Checks

Chart Version

8.5.3

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:25:17Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.5"}

Description

When deploying chart to our AWS EKS cluster, deploy does not end successfully and keeps in CrashLoopBackOff state.

Disabling Pgbouncer solves the issue.

Relevant Logs

$ kubectl get pods
NAME                                           READY   STATUS                  RESTARTS   AGE
airflow-db-migrations-XXXXXXXX                 0/2     Init:CrashLoopBackOff   8          21m
airflow-pgbouncer-XXXXXX                       1/1     CrashLoopBackOff        9          21m
airflow-scheduler-XXXXXX                       0/2     Init:CrashLoopBackOff   8          21m
airflow-web-69b8564fd6-XXXXX                   0/2     Init:CrashLoopBackOff   7          21m

##################################

$ kubectl describe pod airflow-pgbouncer-XXXXXXXXXX
Name:         airflow-pgbouncer-XXXXXXX
Namespace:    airflow
Priority:     0
Node:         XXXXXXXXXXX
Start Time:   Fri, 18 Feb 2022 15:17:54 -0300
Labels:       app=airflow
              component=pgbouncer
              pod-template-hash=XXXXXX
              release=airflow
Annotations:  checksum/secret-config-envs: e5d68d97d93fdf19d5a40e6ceeba766e4ef2776ab0625bedb76958d78a15330d
              checksum/secret-pgbouncer: 5b0f432f568b291e6308fe9ad6f7ebd35e83976fc43447e1ff5cca374178e1cb
              cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           XXXXXXXX
IPs:
  IP:           XXXXXXXX
Controlled By:  ReplicaSet/airflow-pgbouncer-XXXXXXXX
Containers:
  pgbouncer:
    Container ID:  docker://a78c0d0174f1f7d80843a2ef553020468d387d2eede7346ee911c61c2604a10e
    Image:         ghcr.io/airflow-helm/pgbouncer:1.15.0-patch.0
    Image ID:      docker-pullable://ghcr.io/airflow-helm/pgbouncer@sha256:XXXXXXXX
    Port:          6432/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/dumb-init
      --rewrite=15:2
      --
    Args:
      /bin/sh
      -c
      /home/pgbouncer/config/gen_auth_file.sh && \
      exec pgbouncer /home/pgbouncer/config/pgbouncer.ini
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 18 Feb 2022 15:31:13 -0300
      Finished:     Fri, 18 Feb 2022 15:32:29 -0300
    Ready:          False
    Restart Count:  7
    Liveness:       exec [/bin/sh -c psql $(eval $DATABASE_PSQL_CMD) --tuples-only --command="SELECT 1;" | grep -q "1"] delay=5s timeout=60s period=30s #success=1 #failure=3
    Readiness:      tcp-socket :6432 delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      DATABASE_PASSWORD:               <set to the key 'password' in secret 'airflow-rds-root'>  Optional: false
      REDIS_PASSWORD:                  
      CONNECTION_CHECK_MAX_COUNT:      0
      AIRFLOW__CORE__FERNET_KEY:       <set to the key 'fernet_key' in secret 'airflow-secrets'>            Optional: false
      AIRFLOW__GOOGLE__CLIENT_ID:      <set to the key 'google_client_id' in secret 'airflow-secrets'>      Optional: false
      AIRFLOW__GOOGLE__CLIENT_SECRET:  <set to the key 'google_client_secret' in secret 'airflow-secrets'>  Optional: false
      AWS_DEFAULT_REGION:              eu-west-1
      AWS_REGION:                      eu-west-1
      AWS_ROLE_ARN:                    arn:aws:iam::XXXXXXXXX:role/airflow
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /home/pgbouncer/certs from pgbouncer-certs (ro)
      /home/pgbouncer/config from pgbouncer-config (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dl9fv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  pgbouncer-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  airflow-pgbouncer
    Optional:    false
  pgbouncer-certs:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          airflow-pgbouncer-certs
    SecretOptionalName:  <nil>
  kube-api-access-dl9fv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  17m                     default-scheduler  Successfully assigned airflow/airflow-pgbouncer-XXXXXXXXX to ip-XXXXXX.compute.internal
  Warning  Unhealthy  16m                     kubelet            Readiness probe failed: dial tcp 10.12.3.188:6432: connect: connection refused
  Normal   Pulled     13m (x4 over 17m)       kubelet            Container image "ghcr.io/airflow-helm/pgbouncer:1.15.0-patch.0" already present on machine
  Normal   Created    13m (x4 over 17m)       kubelet            Created container pgbouncer
  Normal   Started    13m (x4 over 17m)       kubelet            Started container pgbouncer
  Normal   Killing    13m (x3 over 16m)       kubelet            Container pgbouncer failed liveness probe, will be restarted
  Warning  Unhealthy  7m51s (x20 over 17m)    kubelet            Liveness probe failed: psql: error: could not translate host name "airflow-pgbouncer.airflow.svc.cluster.local" to address: Try again
  Warning  BackOff    2m48s (x19 over 7m21s)  kubelet            Back-off restarting failed container


#####################################


$ kubectl logs airflow-pgbouncer-64d75984c7-9hmlr

Successfully generated auth_file: /home/pgbouncer/users.txt
2022-02-18 18:31:13.751 UTC [7] LOG kernel file descriptor limit: 1048576 (hard: 1048576); max_client_conn: 100, max expected fd use: 112
2022-02-18 18:31:13.752 UTC [7] LOG listening on 0.0.0.0:6432
2022-02-18 18:31:13.752 UTC [7] LOG listening on [::]:6432
2022-02-18 18:31:13.753 UTC [7] LOG listening on unix:/tmp/.s.PGSQL.6432
2022-02-18 18:31:13.753 UTC [7] LOG process up: PgBouncer 1.15.0, libevent 2.1.12-stable (epoll), adns: c-ares 1.17.1, tls: OpenSSL 1.1.1k  25 Mar 2021
2022-02-18 18:32:13.753 UTC [7] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us
2022-02-18 18:32:29.267 UTC [7] LOG got SIGINT, shutting down
2022-02-18 18:32:29.419 UTC [7] LOG server connections dropped, exiting


########################################


$ kubectl describe pod airflow-db-migrations-XXXX

Name:         airflow-db-migrations-XXXXXXXX
Namespace:    airflow
Priority:     0
Node:         XXXXXXXXX
Start Time:   Fri, 18 Feb 2022 15:17:55 -0300
Labels:       app=airflow
              component=db-migrations
              pod-template-hash=XXXXXXX
              release=airflow
Annotations:  checksum/db-migrations-script: 5f00610c570937a76488380602536f1a0487c0dea26e2a421a63560257180aae
              checksum/secret-config-envs: e5d68d97d93fdf19d5a40e6ceeba766e4ef2776ab0625bedb76958d78a15330d
              checksum/secret-local-settings: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
              cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           XXXXXX
IPs:
  IP:           XXXXXXX
Controlled By:  ReplicaSet/airflow-db-migrations-XXXXXX
Init Containers:
  dags-git-clone:
    Container ID:   docker://XXXXXXX
    Image:          k8s.gcr.io/git-sync/git-sync:v3.2.2
    Image ID:       docker-pullable://k8s.gcr.io/git-sync/git-sync@sha256:XXXXXXXXX
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 18 Feb 2022 15:17:56 -0300
      Finished:     Fri, 18 Feb 2022 15:18:01 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  64Mi
    Requests:
      cpu:     50m
      memory:  64Mi
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      GIT_SYNC_ONE_TIME:               true
      GIT_SYNC_ROOT:                   /dags
      GIT_SYNC_DEST:                   repo
      GIT_SYNC_REPO:                   http://XXXXX/airflow-dags.git
      GIT_SYNC_BRANCH:                 master
      GIT_SYNC_REV:                    HEAD
      GIT_SYNC_DEPTH:                  1
      GIT_SYNC_WAIT:                   60
      GIT_SYNC_TIMEOUT:                120
      GIT_SYNC_ADD_USER:               true
      GIT_SYNC_MAX_SYNC_FAILURES:      -1
      GIT_KNOWN_HOSTS:                 false
      GIT_SYNC_USERNAME:               <set to the key 'username' in secret 'airflow-http-git-secret'>  Optional: false
      GIT_SYNC_PASSWORD:               <set to the key 'password' in secret 'airflow-http-git-secret'>  Optional: false
      DATABASE_PASSWORD:               <set to the key 'password' in secret 'airflow-rds-root'>         Optional: false
      REDIS_PASSWORD:                  
      CONNECTION_CHECK_MAX_COUNT:      0
      AIRFLOW__CORE__FERNET_KEY:       <set to the key 'fernet_key' in secret 'airflow-secrets'>            Optional: false
      AIRFLOW__GOOGLE__CLIENT_ID:      <set to the key 'google_client_id' in secret 'airflow-secrets'>      Optional: false
      AIRFLOW__GOOGLE__CLIENT_SECRET:  <set to the key 'google_client_secret' in secret 'airflow-secrets'>  Optional: false
      AWS_DEFAULT_REGION:              eu-west-1
      AWS_REGION:                      eu-west-1
      AWS_ROLE_ARN:                    arn:aws:iam::XXXXXXXXX:role/airflow
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /dags from dags-data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s929h (ro)
  check-db:
    Container ID:  docker://XXXXXXXXXXXX
    Image:         XXXXXX/airflow:master
    Image ID:      docker-pullable://XXXXXXXX/airflow@sha256:XXXXXXX
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/dumb-init
      --
      /entrypoint
    Args:
      bash
      -c
      exec timeout 60s airflow db check
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 18 Feb 2022 15:20:07 -0300
      Finished:     Fri, 18 Feb 2022 15:20:15 -0300
    Ready:          False
    Restart Count:  4
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      DATABASE_PASSWORD:               <set to the key 'password' in secret 'airflow-rds-root'>  Optional: false
      REDIS_PASSWORD:                  
      CONNECTION_CHECK_MAX_COUNT:      0
      AIRFLOW__CORE__FERNET_KEY:       <set to the key 'fernet_key' in secret 'airflow-secrets'>            Optional: false
      AIRFLOW__GOOGLE__CLIENT_ID:      <set to the key 'google_client_id' in secret 'airflow-secrets'>      Optional: false
      AIRFLOW__GOOGLE__CLIENT_SECRET:  <set to the key 'google_client_secret' in secret 'airflow-secrets'>  Optional: false
      AWS_DEFAULT_REGION:              eu-west-1
      AWS_REGION:                      eu-west-1
      AWS_ROLE_ARN:                    arn:aws:iam::XXXXXXX:role/airflow
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /opt/airflow/dags from dags-data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s929h (ro)
Containers:
  db-migrations:
    Container ID:  
    Image:         XXXXXXX/airflow:master
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/dumb-init
      --
      /entrypoint
    Args:
      python
      -u
      /mnt/scripts/db_migrations.py
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      DATABASE_PASSWORD:               <set to the key 'password' in secret 'airflow-rds-root'>  Optional: false
      REDIS_PASSWORD:                  
      CONNECTION_CHECK_MAX_COUNT:      0
      AIRFLOW__CORE__FERNET_KEY:       <set to the key 'fernet_key' in secret 'airflow-secrets'>            Optional: false
      AIRFLOW__GOOGLE__CLIENT_ID:      <set to the key 'google_client_id' in secret 'airflow-secrets'>      Optional: false
      AIRFLOW__GOOGLE__CLIENT_SECRET:  <set to the key 'google_client_secret' in secret 'airflow-secrets'>  Optional: false
      AWS_DEFAULT_REGION:              eu-west-1
      AWS_REGION:                      eu-west-1
      AWS_ROLE_ARN:                    arn:aws:iam::550261519331:role/airflow
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /mnt/scripts from scripts (ro)
      /opt/airflow/dags from dags-data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s929h (ro)
  dags-git-sync:
    Container ID:   
    Image:          k8s.gcr.io/git-sync/git-sync:v3.2.2
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  64Mi
    Requests:
      cpu:     50m
      memory:  64Mi
    Environment Variables from:
      airflow-config-envs  Secret  Optional: false
    Environment:
      GIT_SYNC_ROOT:                   /dags
      GIT_SYNC_DEST:                   repo
      GIT_SYNC_REPO:                   http://XXXXX/airflow-dags.git
      GIT_SYNC_BRANCH:                 master
      GIT_SYNC_REV:                    HEAD
      GIT_SYNC_DEPTH:                  1
      GIT_SYNC_WAIT:                   60
      GIT_SYNC_TIMEOUT:                120
      GIT_SYNC_ADD_USER:               true
      GIT_SYNC_MAX_SYNC_FAILURES:      -1
      GIT_KNOWN_HOSTS:                 false
      GIT_SYNC_USERNAME:               <set to the key 'username' in secret 'airflow-http-git-secret'>  Optional: false
      GIT_SYNC_PASSWORD:               <set to the key 'password' in secret 'airflow-http-git-secret'>  Optional: false
      DATABASE_PASSWORD:               <set to the key 'password' in secret 'airflow-rds-root'>         Optional: false
      REDIS_PASSWORD:                  
      CONNECTION_CHECK_MAX_COUNT:      0
      AIRFLOW__CORE__FERNET_KEY:       <set to the key 'fernet_key' in secret 'airflow-secrets'>            Optional: false
      AIRFLOW__GOOGLE__CLIENT_ID:      <set to the key 'google_client_id' in secret 'airflow-secrets'>      Optional: false
      AIRFLOW__GOOGLE__CLIENT_SECRET:  <set to the key 'google_client_secret' in secret 'airflow-secrets'>  Optional: false
      AWS_DEFAULT_REGION:              eu-west-1
      AWS_REGION:                      eu-west-1
      AWS_ROLE_ARN:                    arn:aws:iam::XXXXXXX:role/airflow
      AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /dags from dags-data (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s929h (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  dags-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  scripts:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  airflow-db-migrations
    Optional:    false
  kube-api-access-s929h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  2m57s                 default-scheduler  Successfully assigned airflow/airflow-db-migrations-XXXX to ip-XXXX.XXXX.compute.internal
  Normal   Pulled     2m56s                 kubelet            Container image "k8s.gcr.io/git-sync/git-sync:v3.2.2" already present on machine
  Normal   Created    2m56s                 kubelet            Created container dags-git-clone
  Normal   Started    2m56s                 kubelet            Started container dags-git-clone
  Normal   Pulled     2m50s                 kubelet            Successfully pulled image "XXXXXX/airflow:master" in 117.507452ms
  Normal   Pulled     2m38s                 kubelet            Successfully pulled image "XXXXXX/airflow:master" in 94.706844ms
  Normal   Pulled     2m15s                 kubelet            Successfully pulled image "XXXXXX/airflow:master" in 108.802107ms
  Normal   Created    103s (x4 over 2m50s)  kubelet            Created container check-db
  Normal   Started    103s (x4 over 2m50s)  kubelet            Started container check-db
  Normal   Pulling    103s (x4 over 2m50s)  kubelet            Pulling image "XXXXXX/airflow:master"
  Normal   Pulled     103s                  kubelet            Successfully pulled image "XXXXXX/airflow:master" in 108.808873ms
  Warning  BackOff    70s (x6 over 2m29s)   kubelet            Back-off restarting failed container


########################################


$ kubectl logs -c check-db airflow-db-migrations-74f9cddb8f-vjdnt

[2022-02-18 18:29:52,705] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.OperationalError) could not connect to server: Connection refused
	Is the server running on host "airflow-pgbouncer.airflow.svc.cluster.local" (172.20.32.223) and accepting
	TCP/IP connections on port 6432?

(Background on this error at: http://sqlalche.me/e/13/e3q8)
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
    self._dec_overflow()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
    return self._create_connection()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection refused
	Is the server running on host "airflow-pgbouncer.airflow.svc.cluster.local" (172.20.32.223) and accepting
	TCP/IP connections on port 6432?


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 92, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/db_command.py", line 96, in check
    db.check()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/db.py", line 782, in check
    session.execute('select 1 as is_alive;')
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1295, in execute
    return self._connection_for_bind(bind, close_with_result=True).execute(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1150, in _connection_for_bind
    return self.transaction._connection_for_bind(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 433, in _connection_for_bind
    conn = bind._contextual_connect()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2302, in _contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2339, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1583, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2336, in _wrap_pool_connect
    return fn()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 364, in connect
    return _ConnectionFairy._checkout(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 778, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
    rec = pool._do_get()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 140, in _do_get
    self._dec_overflow()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py", line 137, in _do_get
    return self._create_connection()
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 309, in _create_connection
    return _ConnectionRecord(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 440, in __init__
    self.__connect(first_connect_check=True)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py", line 656, in __connect
    connection = pool._invoke_creator(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
    return dialect.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
	Is the server running on host "airflow-pgbouncer.airflow.svc.cluster.local" (172.20.32.223) and accepting
	TCP/IP connections on port 6432?

(Background on this error at: http://sqlalche.me/e/13/e3q8)

Custom Helm Values

###################################
# Airflow - Custom Configs
###################################
airflow:
  
  ## configs for the docker image of the web/scheduler/worker
  ##
  image:
    
    repository: <custom-repository>/airflow-2.1.4
    tag: master
    pullPolicy: Always
    
  
  ## the airflow executor type to use
  executor: KubernetesExecutor

  config:

    
    AIRFLOW__LOGGING__REMOTE_LOGGING: "true"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "s3://XXXX/logs"

    # AUTH
    #AIRFLOW__WEBSERVER__RBAC: "true"
    AIRFLOW__WEBSERVER__BASE_URL: "https://XXXX/"
    AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX: "True"
    AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "true"

    # Workaround to avoid randomly getting SIGTERM in dag execution.
    # https://github.com/apache/airflow/issues/14672#issuecomment-844485176
    AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"

    PYTHONPATH: "/opt/airflow/dags/repo/"
    AIRFLOW__CORE__XCOM_BACKEND: "include.s3_xcom_backend.S3XComBackend"

  extraEnv:
    - name: AIRFLOW__CORE__FERNET_KEY
      valueFrom:
        secretKeyRef:
          name: XXXX
          key: fernet_key
    - name: AIRFLOW__GOOGLE__CLIENT_ID
      valueFrom:
        secretKeyRef:
          name: XXXX
          key: google_client_id
    - name: AIRFLOW__GOOGLE__CLIENT_SECRET
      valueFrom:
        secretKeyRef:
          name: XXXX
          key: google_client_secret 

###################################
# Airflow - WebUI Configs
###################################
web:

  webserverConfig:
    stringOverride: |-
      import os

      from airflow import configuration as conf
      from flask_appbuilder.security.manager import AUTH_OAUTH

      basedir = os.path.abspath(os.path.dirname(__file__))

      SQLALCHEMY_DATABASE_URI = conf.get('core', 'SQL_ALCHEMY_CONN')
      CSRF_ENABLED = True
      
      AUTH_TYPE = AUTH_OAUTH
      
      # registration configs
      AUTH_USER_REGISTRATION = True  # allow users who are not already in the FAB DB
      AUTH_USER_REGISTRATION_ROLE = "Viewer"  # this role will be given in addition to any AUTH_ROLES_MAPPING

      # the list of providers which the user can choose from
      OAUTH_PROVIDERS = [
          {
              'name': 'google',
              'icon': 'fa-google',
              'token_key': 'access_token',
              'remote_app': {
                  'client_id': os.environ.get("AIRFLOW__GOOGLE__CLIENT_ID"),
                  'client_secret': os.environ.get("AIRFLOW__GOOGLE__CLIENT_SECRET"),
                  'api_base_url': 'https://www.googleapis.com/oauth2/v2/',
                  'client_kwargs': {
                      'scope': 'email profile'
                  },
                  'request_token_url': None,
                  'access_token_url':'https://accounts.google.com/o/oauth2/token',
                  'authorize_url':'https://accounts.google.com/o/oauth2/auth',
              }
          }
      ]
      
      # force users to re-auth after 30min of inactivity (to keep roles in sync)
      PERMANENT_SESSION_LIFETIME = 1800

###################################
# Airflow - Worker Configs
###################################
workers:
  enabled: false

###################################
# Airflow - Flower Configs
###################################
flower:
  enabled: false

###################################
## CONFIG | Airflow DAGs
###################################
dags:
  ## the airflow dags folder
  ##
  path: /opt/airflow/dags
  gitSync:
    ## if the git-sync sidecar container is enabled
    ##
    enabled: true
    repo: "http://XXXX/airflow-dags.git"

###################################
## CONFIG | Kubernetes Ingress
###################################
ingress:
  enabled: True
  web:
    annotations:
      kubernetes.io/ingress.class: traefik
      traefik.ingress.kubernetes.io/router.tls: 'true'
      traefik.ingress.kubernetes.io/router.tls.certresolver: dns-resolver
    host: "XXXX"

###################################
# Kubernetes - RBAC
###################################
rbac:
  create: true
  events: False

###################################
# Kubernetes - Service Account
###################################
serviceAccount:
  create: False

###################################
## DATABASE | PgBouncer
###################################
pgbouncer:
  enabled: true

###################################
## DATABASE | Embedded Postgres
###################################
postgresql:
  enabled: False

###################################
## DATABASE | External Database
###################################
externalDatabase:
  type: postgres
  host: XXXX.rds.amazonaws.com
  port: 5432
  database: airflow
  user: airflow
  passwordSecret: "airflow-XXXX"
  passwordSecretKey: "XXXX"
  properties: ""

###################################
## DATABASE | Embedded Redis
###################################
redis:
  enabled: false

###################################
# Prometheus - ServiceMonitor
###################################
serviceMonitor:
  enabled: false

###################################
# Prometheus - PrometheusRule
###################################
prometheusRule:
  enabled: false
@emilianomoscato emilianomoscato added the kind/bug kind - things not working properly label Feb 18, 2022
@thesuperzapper thesuperzapper changed the title Can not deploy with pgbouncer enabled pgbouncer readinessProbe timeouts cause livenessProbe failures Feb 22, 2022
@thesuperzapper
Copy link
Member

@emilianomoscato thanks so much for raising this, it will really help others!

I believe the issue is:

  1. the readinessProbe is timing out (or otherwise failing)
  2. the pgbouncer pod, becomes "unready"
  3. airflow-pgbouncer.{NAMESPACE}.svc.cluster.local not stops resolving to an IP (because there are no ready pods)
  4. the livenessProbe having "connection refused", restarting the pod
  5. (possibly because so many connections are created after the restart, it overloads the pod, causing the readinessProbe to timeout, starting the whole process over again)

I have raised issue #526 to propose replacing the readinessProbe with a startupProbe, which should hopefully address this issue.

@thesuperzapper thesuperzapper added this to Unsorted in Issue Triage and PR Tracking via automation Feb 22, 2022
@thesuperzapper thesuperzapper moved this from Unsorted to PR | Needed in Issue Triage and PR Tracking Feb 22, 2022
@thesuperzapper thesuperzapper added this to the airflow-8.6.0 milestone Mar 2, 2022
@thesuperzapper thesuperzapper moved this from Triage | Needs PR to Triage | Needs Investigation in Issue Triage and PR Tracking Mar 22, 2022
@thesuperzapper thesuperzapper added this to the airflow-8.6.0 milestone Mar 22, 2022
@thesuperzapper thesuperzapper moved this from Triage | Needs Investigation to Triage | PR Created in Issue Triage and PR Tracking Mar 22, 2022
Issue Triage and PR Tracking automation moved this from Triage | Work Started to Done Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind - things not working properly
Development

Successfully merging a pull request may close this issue.

2 participants