etcd timeout #9320

Lokicity · 2018-02-13T23:03:52Z

My Kubernetes master and etcd is running on the same VM is running into some intermittent issue with "etcd timing out" and "apiserver received an error that is not an metav1.Status: etcdserver: request timed out"

Feb 09 15:09:41 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 kube-apiserver[1267]: logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x125\n\x06\n\x00\x12\x00\x1a\x00\x12\aFailure\x1a\x1detcdserver: request timed out\"\x000\xf4\x03\x1a\x00\"\x00"
Feb 09 15:09:41 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 kube-apiserver[1267]: I0209 15:09:36.085202    1267 trace.go:76] Trace[376525130]: "GuaranteedUpdate etcd3: *api.Node" (started: 2018-02-09 15:09:29.07814287 +0000 UTC m=+423.501354923) (total time: 7.007027719s):
Feb 09 15:09:41 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 kube-apiserver[1267]: E0209 15:09:36.085234    1267 status.go:62] apiserver received an error that is not an metav1.Status: etcdserver: request timed out
Feb 09 15:09:41 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 kube-apiserver[1267]: logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x125\n\x06\n\x00\x12\x00\x1a\x00\x12\aFailure\x1a\x1detcdserver: request timed out\"\x000\xf4\x03\x1a\x00\"\x00"
Feb 09 15:09:44 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: timed out waiting for read index response
Feb 09 15:09:51 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: timed out waiting for read index response
Feb 09 15:09:53 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: sync duration of 23.383339906s, expected less than 1s
Feb 09 15:10:06 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: apply entries took too long [36.026037646s for 7 entries]
Feb 09 15:10:06 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: avoid queries with large range/delete range!
Feb 09 15:10:13 master-a4995c9c-0da9-11e8-94fe-02fa846d2442 etcd[1185]: timed out waiting for read index response

Attach is my etcd metrics. Can somebody point to me if there are any obvious problems in this metrics? Like I need to switch etcd to use better disk, etc?

# HELP etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds Bucketed histogram of db compaction pause duration.
# TYPE etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds histogram
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="1"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="2"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="4"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="8"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="16"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="32"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="64"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="128"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="256"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="512"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="1024"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="2048"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="4096"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_bucket{le="+Inf"} 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_sum 0
etcd_debugging_mvcc_db_compaction_pause_duration_milliseconds_count 0
# HELP etcd_debugging_mvcc_db_compaction_total_duration_milliseconds Bucketed histogram of db compaction total duration.
# TYPE etcd_debugging_mvcc_db_compaction_total_duration_milliseconds histogram
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="100"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="200"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="400"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="800"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="1600"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="3200"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="6400"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="12800"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="25600"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="51200"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="102400"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="204800"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="409600"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="819200"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_bucket{le="+Inf"} 82
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_sum 0
etcd_debugging_mvcc_db_compaction_total_duration_milliseconds_count 82
# HELP etcd_debugging_mvcc_db_total_size_in_bytes Total size of the underlying database in bytes.
# TYPE etcd_debugging_mvcc_db_total_size_in_bytes gauge
etcd_debugging_mvcc_db_total_size_in_bytes 2.17088e+06
# HELP etcd_debugging_mvcc_delete_total Total number of deletes seen by this member.
# TYPE etcd_debugging_mvcc_delete_total counter
etcd_debugging_mvcc_delete_total 0
# HELP etcd_debugging_mvcc_events_total Total number of events sent by this member.
# TYPE etcd_debugging_mvcc_events_total counter
etcd_debugging_mvcc_events_total 27284
# HELP etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds Bucketed histogram of index compaction pause duration.
# TYPE etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds histogram
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="0.5"} 0
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="1"} 69
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="2"} 75
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="4"} 76
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="8"} 76
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="16"} 76
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="32"} 77
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="64"} 77
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="128"} 77
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="256"} 77
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="512"} 81
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="1024"} 82
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_bucket{le="+Inf"} 82
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_sum 2309
etcd_debugging_mvcc_index_compaction_pause_duration_milliseconds_count 82
# HELP etcd_debugging_mvcc_keys_total Total number of keys.
# TYPE etcd_debugging_mvcc_keys_total gauge
etcd_debugging_mvcc_keys_total 179
# HELP etcd_debugging_mvcc_pending_events_total Total number of pending events to be sent.
# TYPE etcd_debugging_mvcc_pending_events_total gauge
etcd_debugging_mvcc_pending_events_total 0
# HELP etcd_debugging_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_debugging_mvcc_put_total counter
etcd_debugging_mvcc_put_total 0
# HELP etcd_debugging_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_debugging_mvcc_range_total counter
etcd_debugging_mvcc_range_total 104375
# HELP etcd_debugging_mvcc_slow_watcher_total Total number of unsynced slow watchers.
# TYPE etcd_debugging_mvcc_slow_watcher_total gauge
etcd_debugging_mvcc_slow_watcher_total 0
# HELP etcd_debugging_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_debugging_mvcc_txn_total counter
etcd_debugging_mvcc_txn_total 27952
# HELP etcd_debugging_mvcc_watch_stream_total Total number of watch streams.
# TYPE etcd_debugging_mvcc_watch_stream_total gauge
etcd_debugging_mvcc_watch_stream_total 60
# HELP etcd_debugging_mvcc_watcher_total Total number of watchers.
# TYPE etcd_debugging_mvcc_watcher_total gauge
etcd_debugging_mvcc_watcher_total 60
# HELP etcd_debugging_snap_save_marshalling_duration_seconds The marshalling cost distributions of save called by snapshot.
# TYPE etcd_debugging_snap_save_marshalling_duration_seconds histogram
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.001"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.002"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.004"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.008"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.016"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.032"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.064"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.128"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.256"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="0.512"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="1.024"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="2.048"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="4.096"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="8.192"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_bucket{le="+Inf"} 10
etcd_debugging_snap_save_marshalling_duration_seconds_sum 0.0003179209999999999
etcd_debugging_snap_save_marshalling_duration_seconds_count 10
# HELP etcd_debugging_snap_save_total_duration_seconds The total latency distributions of save called by snapshot.
# TYPE etcd_debugging_snap_save_total_duration_seconds histogram
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.001"} 0
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.002"} 0
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.004"} 8
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.008"} 8
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.016"} 8
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.032"} 9
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.064"} 9
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.128"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.256"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="0.512"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="1.024"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="2.048"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="4.096"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="8.192"} 10
etcd_debugging_snap_save_total_duration_seconds_bucket{le="+Inf"} 10
etcd_debugging_snap_save_total_duration_seconds_sum 0.12088795499999999
etcd_debugging_snap_save_total_duration_seconds_count 10
# HELP etcd_debugging_store_expires_total Total number of expired keys.
# TYPE etcd_debugging_store_expires_total counter
etcd_debugging_store_expires_total 0
# HELP etcd_debugging_store_reads_total Total number of reads action by (get/getRecursive), local to this member.
# TYPE etcd_debugging_store_reads_total counter
etcd_debugging_store_reads_total{action="get"} 128
etcd_debugging_store_reads_total{action="getRecursive"} 42
# HELP etcd_debugging_store_watch_requests_total Total number of incoming watch requests (new or reestablished).
# TYPE etcd_debugging_store_watch_requests_total counter
etcd_debugging_store_watch_requests_total 2
# HELP etcd_debugging_store_watchers Count of currently active watchers.
# TYPE etcd_debugging_store_watchers gauge
etcd_debugging_store_watchers 2
# HELP etcd_debugging_store_writes_total Total number of writes (e.g. set/compareAndDelete) seen by this member.
# TYPE etcd_debugging_store_writes_total counter
etcd_debugging_store_writes_total{action="create"} 8
etcd_debugging_store_writes_total{action="delete"} 8
etcd_debugging_store_writes_total{action="set"} 25
# HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend.
# TYPE etcd_disk_backend_commit_duration_seconds histogram
etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 0
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 23166
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 23601
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 23675
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 23707
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 23751
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 23832
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 23934
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 24247
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 25122
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 25844
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 25933
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 25944
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 25950
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 25966
etcd_disk_backend_commit_duration_seconds_sum 1444.4355971160062
etcd_disk_backend_commit_duration_seconds_count 25966
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by wal.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 81426
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 85807
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 91673
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 92134
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 93055
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 98054
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 98651
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 99059
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 99718
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 100661
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 101216
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 101353
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 101371
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 101376
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 101390
etcd_disk_wal_fsync_duration_seconds_sum 1650.672015552
etcd_disk_wal_fsync_duration_seconds_count 101390
# HELP etcd_grpc_proxy_cache_hits_total Total number of cache hits
# TYPE etcd_grpc_proxy_cache_hits_total gauge
etcd_grpc_proxy_cache_hits_total 0
# HELP etcd_grpc_proxy_cache_misses_total Total number of cache misses
# TYPE etcd_grpc_proxy_cache_misses_total gauge
etcd_grpc_proxy_cache_misses_total 0
# HELP etcd_grpc_proxy_events_coalescing_total Total number of events coalescing
# TYPE etcd_grpc_proxy_events_coalescing_total counter
etcd_grpc_proxy_events_coalescing_total 0
# HELP etcd_grpc_proxy_watchers_coalescing_total Total number of current watchers coalescing
# TYPE etcd_grpc_proxy_watchers_coalescing_total gauge
etcd_grpc_proxy_watchers_coalescing_total 0
# HELP etcd_http_failed_total Counter of handle failures of requests (non-watches), by method (GET/PUT etc.) and code (400, 500 etc.).
# TYPE etcd_http_failed_total counter
etcd_http_failed_total{code="404",method="DELETE"} 8
etcd_http_failed_total{code="404",method="QGET"} 27
etcd_http_failed_total{code="412",method="PUT"} 2
etcd_http_failed_total{code="500",method="PUT"} 1
etcd_http_failed_total{code="500",method="QGET"} 1
# HELP etcd_http_received_total Counter of requests received into the system (successfully parsed and authd).
# TYPE etcd_http_received_total counter
etcd_http_received_total{method="DELETE"} 8
etcd_http_received_total{method="PUT"} 30
etcd_http_received_total{method="QGET"} 60
# HELP etcd_http_successful_duration_seconds Bucketed histogram of processing time (s) of successfully handled requests (non-watches), by method (GET/PUT etc.).
# TYPE etcd_http_successful_duration_seconds histogram
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.0005"} 4
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.001"} 4
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.002"} 4
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.004"} 4
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.008"} 4
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.016"} 6
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.032"} 7
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.064"} 10
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.128"} 15
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.256"} 19
etcd_http_successful_duration_seconds_bucket{method="PUT",le="0.512"} 24
etcd_http_successful_duration_seconds_bucket{method="PUT",le="1.024"} 26
etcd_http_successful_duration_seconds_bucket{method="PUT",le="2.048"} 27
etcd_http_successful_duration_seconds_bucket{method="PUT",le="+Inf"} 27
etcd_http_successful_duration_seconds_sum{method="PUT"} 6.325096843000001
etcd_http_successful_duration_seconds_count{method="PUT"} 27
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.0005"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.001"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.002"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.004"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.008"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.016"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.032"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.064"} 25
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.128"} 26
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.256"} 27
etcd_http_successful_duration_seconds_bucket{method="QGET",le="0.512"} 30
etcd_http_successful_duration_seconds_bucket{method="QGET",le="1.024"} 31
etcd_http_successful_duration_seconds_bucket{method="QGET",le="2.048"} 32
etcd_http_successful_duration_seconds_bucket{method="QGET",le="+Inf"} 32
etcd_http_successful_duration_seconds_sum{method="QGET"} 3.5796803930000007
etcd_http_successful_duration_seconds_count{method="QGET"} 32
# HELP etcd_network_client_grpc_received_bytes_total The total number of bytes received from grpc clients.
# TYPE etcd_network_client_grpc_received_bytes_total counter
etcd_network_client_grpc_received_bytes_total 1.9240446e+07
# HELP etcd_network_client_grpc_sent_bytes_total The total number of bytes sent to grpc clients.
# TYPE etcd_network_client_grpc_sent_bytes_total counter
etcd_network_client_grpc_sent_bytes_total 5.9652246e+07
# HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1
# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1
# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 103076
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 103076
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 130
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.7515e-05
go_gc_duration_seconds{quantile="0.25"} 6.2877e-05
go_gc_duration_seconds{quantile="0.5"} 0.000112538
go_gc_duration_seconds{quantile="0.75"} 0.000194799
go_gc_duration_seconds{quantile="1"} 0.246447656
go_gc_duration_seconds_sum 1.108936047
go_gc_duration_seconds_count 324
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 966
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.378836e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.045720936e+09
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.59714e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.4471211e+07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.332672e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 2.378836e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.7303936e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 3.2169984e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 56163
# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes_total counter
go_memstats_heap_released_bytes_total 1.9423232e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 5.947392e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5182259817514844e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 20744
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2.4527374e+07
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 402648
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 540672
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.6783152e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 606756
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 5.013504e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 5.013504e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 6.9581048e+07
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Canceled",grpc_method="LeaseKeepAlive",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"} 96
grpc_server_handled_total{grpc_code="Internal",grpc_method="LeaseKeepAlive",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"} 167
grpc_server_handled_total{grpc_code="Internal",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 77
grpc_server_handled_total{grpc_code="OK",grpc_method="Compact",grpc_service="etcdserverpb.KV",grpc_type="unary"} 82
grpc_server_handled_total{grpc_code="OK",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 365
grpc_server_handled_total{grpc_code="OK",grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 49462
grpc_server_handled_total{grpc_code="OK",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27490
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 64
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 56
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 166
grpc_server_handled_total{grpc_code="Unknown",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 1
# HELP grpc_server_msg_received_total Total number of RPC stream messages received on the server.
# TYPE grpc_server_msg_received_total counter
grpc_server_msg_received_total{grpc_method="Compact",grpc_service="etcdserverpb.KV",grpc_type="unary"} 82
grpc_server_msg_received_total{grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 429
grpc_server_msg_received_total{grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 49489
grpc_server_msg_received_total{grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27547
grpc_server_msg_received_total{grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 303
# HELP grpc_server_msg_sent_total Total number of gRPC stream messages sent by the server.
# TYPE grpc_server_msg_sent_total counter
grpc_server_msg_sent_total{grpc_method="Compact",grpc_service="etcdserverpb.KV",grpc_type="unary"} 82
grpc_server_msg_sent_total{grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 365
grpc_server_msg_sent_total{grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 49462
grpc_server_msg_sent_total{grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27490
grpc_server_msg_sent_total{grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 27587
# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="Compact",grpc_service="etcdserverpb.KV",grpc_type="unary"} 82
grpc_server_started_total{grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 429
grpc_server_started_total{grpc_method="LeaseKeepAlive",grpc_service="etcdserverpb.Lease",grpc_type="bidi_stream"} 328
grpc_server_started_total{grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 49489
grpc_server_started_total{grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27547
grpc_server_started_total{grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 303
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} NaN
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="prometheus"} 0
http_request_duration_microseconds_count{handler="prometheus"} 0
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_request_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="prometheus"} 0
http_request_size_bytes_count{handler="prometheus"} 0
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="prometheus",quantile="0.5"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.9"} NaN
http_response_size_bytes{handler="prometheus",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="prometheus"} 0
http_response_size_bytes_count{handler="prometheus"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 241.24
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65536
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 92
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 6.3168512e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.51818851942e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.082361856e+10```

The text was updated successfully, but these errors were encountered:

gyuho · 2018-02-13T23:07:31Z

sync duration of 23.383339906s, expected less than 1s

Is this using HDD?

etcd is I/O intensive, persisting Raft entries with application data on disk.
You need better disk.

fanminshi · 2018-02-13T23:08:06Z

yeah, something doesn't look right with the sync duration.

Lokicity · 2018-02-13T23:11:29Z

Thank you for getting back so promptly. Is disk the only reason that cause sync duration high and etcd timeout? Why would etcd timeout so fast at the begining of cluster creation? Is Kubernetes stressing etcd too much? Do we have guidelines on CPU/Memory for etcd/k8s-apiserver VMs?

Thanks a lot.

Lokicity · 2018-02-13T23:20:49Z

I am either using t2.medium or m4.large instance on AWS, and according to aws documentation, those are EBS-only instances. I think it is slower than SSD.

gyuho · 2018-02-13T23:22:47Z

We have general hardward setting guide https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md, but not Kubernetes specific.

The metrics

grpc_server_handled_total{grpc_code="Unavailable",grpc_method="LeaseGrant",grpc_service="etcdserverpb.Lease",grpc_type="unary"} 64
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Range",grpc_service="etcdserverpb.KV",grpc_type="unary"} 27
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 56
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 166
grpc_server_handled_total{grpc_code="Unknown",grpc_method="Txn",grpc_service="etcdserverpb.KV",grpc_type="unary"} 1

indicates a bunch of v3 requests timed out, but these workloads are not that intensive. etcd with better disk should be able to handle much more pressing workloads.

fanminshi · 2018-02-13T23:29:22Z

for ebs backed instance, you might consider to use Provisioned IOPS (SSD) volumes offer storage with consistent and low-latency performance, and are designed for I/O intensive applications such as large relational or NoSQL databases. for better I/O performance.

gyuho added the type/question label Feb 25, 2018

gyuho closed this as completed Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd timeout #9320

etcd timeout #9320

Lokicity commented Feb 13, 2018

gyuho commented Feb 13, 2018

fanminshi commented Feb 13, 2018

Lokicity commented Feb 13, 2018

Lokicity commented Feb 13, 2018

gyuho commented Feb 13, 2018

fanminshi commented Feb 13, 2018

etcd timeout #9320

etcd timeout #9320

Comments

Lokicity commented Feb 13, 2018

gyuho commented Feb 13, 2018

fanminshi commented Feb 13, 2018

Lokicity commented Feb 13, 2018

Lokicity commented Feb 13, 2018

gyuho commented Feb 13, 2018

fanminshi commented Feb 13, 2018