a potential data corruption using EBS volumes #2815

jakubhajek · 2023-09-19T10:27:42Z

I am running the CNPG operator (v1.20.1) on the EKS platform. The worker nodes use the IO2 EBS volumes attached via storage class. The example cluster has assigned separate volumes for PGDATA and PGWAL. I ran some benchmarks using PGbench, and I got the following error:

pgbench: error: client 978 script 0 aborted in command 5 query 0: ERROR:  could not read block 35006 in file "base/57705/57726": Bad address
pgbench: error: client 983 script 0 aborted in command 5 query 0: ERROR:  could not read block 56875 in file "base/57705/57726": Bad address
pgbench: error: client 992 script 0 aborted in command 5 query 0: ERROR:  could not read block 115132 in file "base/57705/57718": Bad address

which could mean potential data corruption. Could you please guide me on what might be the reason of that error?

gbartolini · 2023-09-19T12:37:38Z

Can you please share the cluster definition here? Thanks.

jakubhajek · 2023-09-19T12:55:42Z

Absolutely. Here is the cluster definition with some anonymized configuration parameters. Please let me know if you need any other details concerning that issue. Thank you,

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pgbench0
  namespace: cnpg
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cnpg-workload
            operator: In
            values:
            - enabled
    podAntiAffinityType: preferred
    tolerations:
    - effect: NoSchedule
      key: cnpg-workload
      operator: Equal
      value: enabled
    topologyKey: topology.kubernetes.io/zone
  backup:
    barmanObjectStore:
      destinationPath: s3://xxx
      s3Credentials:
        inheritFromIAMRole: true
      wal:
        compression: gzip
        encryption: AES256
    retentionPolicy: 30d
    target: prefer-standby
  bootstrap:
    initdb:
      database: app
      encoding: UTF8
      localeCType: C
      localeCollate: C
      owner: app
  enableSuperuserAccess: true
  failoverDelay: 0
  imageName: private-repo/postgresql:13.11-6-202307041345
  imagePullPolicy: Always
  instances: 1
  logLevel: info
  maxSyncReplicas: 0
  minSyncReplicas: 0
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: false
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      auto_explain.log_min_duration: 10s
      autovacuum_analyze_scale_factor: "0.05"
      autovacuum_naptime: "20"
      autovacuum_vacuum_cost_delay: "10"
      autovacuum_vacuum_scale_factor: "0.05"
      checkpoint_completion_target: "0.9"
      checkpoint_timeout: "300"
      client_encoding: UTF8
      dynamic_shared_memory_type: posix
      effective_cache_size: 11GB
      huge_pages: "on"
      idle_in_transaction_session_timeout: "300000"
      log_checkpoints: "1"
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_hostname: "1"
      log_min_duration_statement: "5000"
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      maintenance_work_mem: 320MB
      max_connections: "2000"
      max_locks_per_transaction: "64"
      max_parallel_workers: "32"
      max_replication_slots: "10"
      max_stack_depth: "6144"
      max_wal_senders: "15"
      max_wal_size: "2048"
      max_worker_processes: "10"
      min_wal_size: "192"
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      pgaudit.log: all, -misc
      pgaudit.log_catalog: "off"
      pgaudit.log_parameter: "on"
      pgaudit.log_relation: "on"
      shared_buffers: 2048MB
      shared_memory_type: mmap
      shared_preload_libraries: ""
      timezone: UTC
      track_activity_query_size: "2048"
      track_commit_timestamp: "on"
      track_io_timing: "on"
      wal_keep_size: 512MB
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
      work_mem: 32MB
    pg_hba:
    - hostssl app all all cert
    shared_preload_libraries:
    - pg_stat_statements
    - pg_stat_monitor
    - pgaudit
    - auto_explain
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: restart
  primaryUpdateStrategy: unsupervised
  priorityClassName: infra-cluster-high
  resources:
    limits:
      cpu: "2"
      hugepages-2Mi: 2Gi
      memory: 12Gi
    requests:
      cpu: "2"
      hugepages-2Mi: 2Gi
      memory: 8Gi
  serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: xxx
  startDelay: 30
  stopDelay: 30
  storage:
    resizeInUseVolumes: true
    size: 100Gi
    storageClass: io2-pg-data
  switchoverDelay: 40000000
  walStorage:
    resizeInUseVolumes: true
    size: 20Gi
    storageClass: io2

gbartolini · 2023-09-20T14:00:34Z

@jakubhajek - I can't spot anything out of the ordinary in this definition, and unfortunately I don't have any comments on the io2 storage classes (maybe others can help here).

The only thing I suggest is to enable dataChecksums at initdb time - but only if you can repeat the experiment.

jakubhajek · 2023-09-21T10:05:46Z

@gbartolini - I think that it might be somehow related to the frivolously configured memory settings for the PG instance.
Configuring memory-related parameters close to the resources assigned to the pod can have an influence on the stability. The pod did not run out of memory but it crashed under load and I think it was the root cause of data corruption that I reported.
I don't have evidence for that assumption - and I did not have logs for the crash I mentioned - but I will try to replicate it. It seems that it is the user configuration issue rather than the bug in the operator itself.

Anyway, thanks a lot for investigating my issue.

armru · 2023-09-21T21:35:29Z

The concept of checking pod resources and Postgres configuration parameters is starting to get introduced somewhat in #2840 (in this particular case it wouldn't have helped).

jakubhajek · 2023-09-22T17:14:46Z

I have been configuring vm.nr_hugepages on the worker node level to allow address 4096Mb of shared buffers, so I set the vm.nr_hugepages to 4199. I don't use PG15 so I can't use the feature to calculate for me that parameter.
Then I configured the pod resources accordingly to allow address 4000Mi of hugepage-2Mi. Then I set shared_buffers to 4GB and restarted the cluster. After restarting the cluster I run the simple query to count(*) rows from one of the large tables (~10Gb), the session crashed with DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Once the server had been recovered from the crash, the data was corrupted. I was not able to describe the table in the schema:

my_schema=# \dt my_schema.*
ERROR:  could not read block 18 in file "base/16448/2691": Bad address
LINE 4:   pg_catalog.pg_get_userbyid(c.relowner) as "Owner"

There were more entries in the log files saying could not read block in file.

jakubhajek · 2023-09-27T10:00:56Z

Based on my further investigation, I believe that the issue is related to the memory settings and it is not the CNPG bug. The data corruption is a side effect of the postmaster process being crashed since the memory settings and pod resources were frivolously configured. Configuring the Postgres parameter and pod resources, especially if the hugepages-2Mi are used is a crucial aspect and has to be appropriately calculated.

gbartolini · 2023-12-10T21:08:47Z

Thanks!

gbartolini closed this as completed Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a potential data corruption using EBS volumes #2815

a potential data corruption using EBS volumes #2815

jakubhajek commented Sep 19, 2023

gbartolini commented Sep 19, 2023

jakubhajek commented Sep 19, 2023 •

edited

gbartolini commented Sep 20, 2023

jakubhajek commented Sep 21, 2023

armru commented Sep 21, 2023 •

edited

jakubhajek commented Sep 22, 2023

jakubhajek commented Sep 27, 2023

gbartolini commented Dec 10, 2023

a potential data corruption using EBS volumes #2815

a potential data corruption using EBS volumes #2815

Comments

jakubhajek commented Sep 19, 2023

gbartolini commented Sep 19, 2023

jakubhajek commented Sep 19, 2023 • edited

gbartolini commented Sep 20, 2023

jakubhajek commented Sep 21, 2023

armru commented Sep 21, 2023 • edited

jakubhajek commented Sep 22, 2023

jakubhajek commented Sep 27, 2023

gbartolini commented Dec 10, 2023

jakubhajek commented Sep 19, 2023 •

edited

armru commented Sep 21, 2023 •

edited