Lost data on single vm-storage instance in cluster #294

freeseacher · 2020-01-23T15:48:56Z

Describe the bug
I have vm setup with two insert nodes and one storage node.
Today i've shutdown storage node and after starting it back get not all of my points from prometheus

To Reproduce
prometheus -1-> vm_balancer -2-> vm_insert -3- > vm_storage

1

- url: http://vm_balancer:8480/insert/0/prometheus/
  remote_timeout: 30s
  queue_config:
    capacity: 30000
    max_shards: 30
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms

vm_balancer is

resource "google_compute_forwarding_rule" "vm_insert_lb_tf" {
  all_ports             = "false"
  backend_service       = google_compute_region_backend_service.vm_insert_target_tf.self_link
  ip_protocol           = "TCP"
  load_balancing_scheme = "INTERNAL"
  name                  = "vm-insert-lb"
  network               = "projects/${var.project}/global/networks/${var.network}"
  ports                 = ["8480"]
  project               = var.project_id
  region                = var.region
  allow_global_access   = true
  provider              = google-beta
  subnetwork            = "projects/${var.vpc_project}/regions/${var.region}/subnetworks/${var.subnetwork}"
  ip_address            = google_compute_address.vminsert_lb_addr_tf.address
}

vm_insert is vminsert-20200117-162442-tags-v1.32.5-cluster-0-g29d21259

ExecStart=/opt/victoria-metrics/vminsert-prod \
    -storageNode 10.104.66.25:8400 \
    -httpListenAddr=":8480" \
    -loggerLevel="INFO"

vm-storage is vmstorage-20200117-162449-tags-v1.32.5-cluster-0-g29d21259

ExecStart=/opt/victoria-metrics/vmstorage-prod \
  -httpListenAddr=":8482" \
  -loggerLevel="INFO" \
  -retentionPeriod="12" \
  -storageDataPath="/var/lib/victoria-metrics" \
  -vminsertAddr=":8400" \
  -vmselectAddr=":8401"

Expected behavior
on storage node up all points from prometheus appered in vm_storage.

Screenshots

Additional context
vm_insert logs

-- Logs begin at Wed 2020-01-22 17:17:01 UTC. --
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.151Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:100        cannot send data to vmstorage 10.104.66.25:8400: cannot write data with size 33333336: cannot flush internal buffer to the underlying writer: write tcp4 10.104.66.21:40014->10.104.66.25:8400: write: broken pipe; re-routing data to healthy vmstorage nodes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.652Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:219        cannot reroute data among healthy vmstorage nodes: all the storage nodes are unhealthy
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.824Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5408711 bytes to storageNode "10.104.66.25:8400": 9455 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.845Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5968876 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.863Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6035237 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.886Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5967087 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.919Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5408711 bytes to storageNode "10.104.66.25:8400": 9455 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.920Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5968876 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.938Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6035237 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.955Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5967087 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:09 vm-insert-c vminsert[1290]: 2020-01-23T09:44:09.263Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6015044 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:09 vm-insert-c vminsert[1290]: 2020-01-23T09:44:09.865Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:100        cannot send data to vmstorage 10.104.66.25:8400: cannot dial "10.104.66.25:8400": dial tcp4 10.104.66.25:8400: connect: connection refused; re-routing data to healthy vmstorage nodes

Promtheus

ts=2020-01-23T12:16:07.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=4 to=7
ts=2020-01-23T12:16:17.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=7 to=11
ts=2020-01-23T12:16:27.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=11 to=16
ts=2020-01-23T12:16:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=16 to=21
ts=2020-01-23T12:17:07.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=21 to=14
ts=2020-01-23T12:17:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=14 to=9
ts=2020-01-23T12:18:07.675Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=9 to=6
ts=2020-01-23T12:18:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=6 to=4

The text was updated successfully, but these errors were encountered:

…are unavailable Updates #294

valyala · 2020-01-24T20:16:20Z

It has been appeared that vminsert could drop pending rows when all the vmstorage nodes were unavailable. This should be fixed in the commit 4d70a81 . This commit will be included in the next release of cluster version of VictoriaMetrics.

valyala · 2020-01-24T20:18:55Z

After the commit 4d70a81 vminsert should always return 503 error to Prometheus when it cannot write data to vmstorage, so Prometheus could enable retry mechanism until vmstorage are recovered. This should prevent from data loss on temporary unavailability of all the vmstorage nodes in the cluster.

valyala · 2020-01-24T21:02:47Z

@freeseacher , try v1.32.7 - this release should contain the fix.

valyala · 2020-01-27T17:33:02Z

Closing the issue as fixed. @freeseacher , feel free re-opening it if VictoriaMetrics v1.32.7 or newer release continues dropping data when all the vmstorage nodes are unavailable.

valyala added bug Something isn't working enhancement New feature or request labels Jan 24, 2020

valyala added a commit that referenced this issue Jan 24, 2020

app/vminsert: do not drop pending rows if all the vmstorage backends …

4d70a81

…are unavailable Updates #294

valyala closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost data on single vm-storage instance in cluster #294

Lost data on single vm-storage instance in cluster #294

freeseacher commented Jan 23, 2020

valyala commented Jan 24, 2020

valyala commented Jan 24, 2020

valyala commented Jan 24, 2020

valyala commented Jan 27, 2020

Lost data on single vm-storage instance in cluster #294

Lost data on single vm-storage instance in cluster #294

Comments

freeseacher commented Jan 23, 2020

valyala commented Jan 24, 2020

valyala commented Jan 24, 2020

valyala commented Jan 24, 2020

valyala commented Jan 27, 2020