New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a potential data corruption using EBS volumes #2815
Comments
Can you please share the cluster definition here? Thanks. |
Absolutely. Here is the cluster definition with some anonymized configuration parameters. Please let me know if you need any other details concerning that issue. Thank you, apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pgbench0
namespace: cnpg
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cnpg-workload
operator: In
values:
- enabled
podAntiAffinityType: preferred
tolerations:
- effect: NoSchedule
key: cnpg-workload
operator: Equal
value: enabled
topologyKey: topology.kubernetes.io/zone
backup:
barmanObjectStore:
destinationPath: s3://xxx
s3Credentials:
inheritFromIAMRole: true
wal:
compression: gzip
encryption: AES256
retentionPolicy: 30d
target: prefer-standby
bootstrap:
initdb:
database: app
encoding: UTF8
localeCType: C
localeCollate: C
owner: app
enableSuperuserAccess: true
failoverDelay: 0
imageName: private-repo/postgresql:13.11-6-202307041345
imagePullPolicy: Always
instances: 1
logLevel: info
maxSyncReplicas: 0
minSyncReplicas: 0
monitoring:
customQueriesConfigMap:
- key: queries
name: cnpg-default-monitoring
disableDefaultQueries: false
enablePodMonitor: false
postgresGID: 26
postgresUID: 26
postgresql:
parameters:
archive_mode: "on"
archive_timeout: 5min
auto_explain.log_min_duration: 10s
autovacuum_analyze_scale_factor: "0.05"
autovacuum_naptime: "20"
autovacuum_vacuum_cost_delay: "10"
autovacuum_vacuum_scale_factor: "0.05"
checkpoint_completion_target: "0.9"
checkpoint_timeout: "300"
client_encoding: UTF8
dynamic_shared_memory_type: posix
effective_cache_size: 11GB
huge_pages: "on"
idle_in_transaction_session_timeout: "300000"
log_checkpoints: "1"
log_destination: csvlog
log_directory: /controller/log
log_filename: postgres
log_hostname: "1"
log_min_duration_statement: "5000"
log_rotation_age: "0"
log_rotation_size: "0"
log_truncate_on_rotation: "false"
logging_collector: "on"
maintenance_work_mem: 320MB
max_connections: "2000"
max_locks_per_transaction: "64"
max_parallel_workers: "32"
max_replication_slots: "10"
max_stack_depth: "6144"
max_wal_senders: "15"
max_wal_size: "2048"
max_worker_processes: "10"
min_wal_size: "192"
pg_stat_statements.max: "10000"
pg_stat_statements.track: all
pgaudit.log: all, -misc
pgaudit.log_catalog: "off"
pgaudit.log_parameter: "on"
pgaudit.log_relation: "on"
shared_buffers: 2048MB
shared_memory_type: mmap
shared_preload_libraries: ""
timezone: UTC
track_activity_query_size: "2048"
track_commit_timestamp: "on"
track_io_timing: "on"
wal_keep_size: 512MB
wal_receiver_timeout: 5s
wal_sender_timeout: 5s
work_mem: 32MB
pg_hba:
- hostssl app all all cert
shared_preload_libraries:
- pg_stat_statements
- pg_stat_monitor
- pgaudit
- auto_explain
syncReplicaElectionConstraint:
enabled: false
primaryUpdateMethod: restart
primaryUpdateStrategy: unsupervised
priorityClassName: infra-cluster-high
resources:
limits:
cpu: "2"
hugepages-2Mi: 2Gi
memory: 12Gi
requests:
cpu: "2"
hugepages-2Mi: 2Gi
memory: 8Gi
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn: xxx
startDelay: 30
stopDelay: 30
storage:
resizeInUseVolumes: true
size: 100Gi
storageClass: io2-pg-data
switchoverDelay: 40000000
walStorage:
resizeInUseVolumes: true
size: 20Gi
storageClass: io2 |
@jakubhajek - I can't spot anything out of the ordinary in this definition, and unfortunately I don't have any comments on the io2 storage classes (maybe others can help here). The only thing I suggest is to enable |
@gbartolini - I think that it might be somehow related to the frivolously configured memory settings for the PG instance. Anyway, thanks a lot for investigating my issue. |
The concept of checking pod resources and Postgres configuration parameters is starting to get introduced somewhat in #2840 (in this particular case it wouldn't have helped). |
I have been configuring my_schema=# \dt my_schema.*
ERROR: could not read block 18 in file "base/16448/2691": Bad address
LINE 4: pg_catalog.pg_get_userbyid(c.relowner) as "Owner" There were more entries in the log files saying |
Based on my further investigation, I believe that the issue is related to the memory settings and it is not the CNPG bug. The data corruption is a side effect of the postmaster process being crashed since the memory settings and pod resources were frivolously configured. Configuring the Postgres parameter and pod resources, especially if the |
Thanks! |
I am running the CNPG operator (v1.20.1) on the EKS platform. The worker nodes use the IO2 EBS volumes attached via storage class. The example cluster has assigned separate volumes for PGDATA and PGWAL. I ran some benchmarks using PGbench, and I got the following error:
which could mean potential data corruption. Could you please guide me on what might be the reason of that error?
The text was updated successfully, but these errors were encountered: