Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost data on single vm-storage instance in cluster #294

Closed
freeseacher opened this issue Jan 23, 2020 · 4 comments
Closed

Lost data on single vm-storage instance in cluster #294

freeseacher opened this issue Jan 23, 2020 · 4 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@freeseacher
Copy link

Describe the bug
I have vm setup with two insert nodes and one storage node.
Today i've shutdown storage node and after starting it back get not all of my points from prometheus

To Reproduce
prometheus -1-> vm_balancer -2-> vm_insert -3- > vm_storage

1

- url: http://vm_balancer:8480/insert/0/prometheus/
  remote_timeout: 30s
  queue_config:
    capacity: 30000
    max_shards: 30
    min_shards: 4
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    min_backoff: 30ms

vm_balancer is

resource "google_compute_forwarding_rule" "vm_insert_lb_tf" {
  all_ports             = "false"
  backend_service       = google_compute_region_backend_service.vm_insert_target_tf.self_link
  ip_protocol           = "TCP"
  load_balancing_scheme = "INTERNAL"
  name                  = "vm-insert-lb"
  network               = "projects/${var.project}/global/networks/${var.network}"
  ports                 = ["8480"]
  project               = var.project_id
  region                = var.region
  allow_global_access   = true
  provider              = google-beta
  subnetwork            = "projects/${var.vpc_project}/regions/${var.region}/subnetworks/${var.subnetwork}"
  ip_address            = google_compute_address.vminsert_lb_addr_tf.address
}

vm_insert is vminsert-20200117-162442-tags-v1.32.5-cluster-0-g29d21259

ExecStart=/opt/victoria-metrics/vminsert-prod \
    -storageNode 10.104.66.25:8400 \
    -httpListenAddr=":8480" \
    -loggerLevel="INFO" 

vm-storage is vmstorage-20200117-162449-tags-v1.32.5-cluster-0-g29d21259

ExecStart=/opt/victoria-metrics/vmstorage-prod \
  -httpListenAddr=":8482" \
  -loggerLevel="INFO" \
  -retentionPeriod="12" \
  -storageDataPath="/var/lib/victoria-metrics" \
  -vminsertAddr=":8400" \
  -vmselectAddr=":8401" 

Expected behavior
on storage node up all points from prometheus appered in vm_storage.

Screenshots
image
image
image
image

Additional context
vm_insert logs

-- Logs begin at Wed 2020-01-22 17:17:01 UTC. --
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.151Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:100        cannot send data to vmstorage 10.104.66.25:8400: cannot write data with size 33333336: cannot flush internal buffer to the underlying writer: write tcp4 10.104.66.21:40014->10.104.66.25:8400: write: broken pipe; re-routing data to healthy vmstorage nodes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.652Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:219        cannot reroute data among healthy vmstorage nodes: all the storage nodes are unhealthy
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.824Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5408711 bytes to storageNode "10.104.66.25:8400": 9455 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.845Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5968876 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.863Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6035237 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.886Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5967087 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.919Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5408711 bytes to storageNode "10.104.66.25:8400": 9455 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.920Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5968876 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.938Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6035237 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:08 vm-insert-c vminsert[1290]: 2020-01-23T09:44:08.955Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 5967087 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:09 vm-insert-c vminsert[1290]: 2020-01-23T09:44:09.263Z        error        VictoriaMetrics@/lib/httpserver/httpserver.go:368        error in "/insert/0/prometheus/": cannot send 6015044 bytes to storageNode "10.104.66.25:8400": 10000 rows dropped because of reroutedBuf overflows 66522777 bytes
Jan 23 09:44:09 vm-insert-c vminsert[1290]: 2020-01-23T09:44:09.865Z        error        VictoriaMetrics@/app/vminsert/netstorage/netstorage.go:100        cannot send data to vmstorage 10.104.66.25:8400: cannot dial "10.104.66.25:8400": dial tcp4 10.104.66.25:8400: connect: connection refused; re-routing data to healthy vmstorage nodes

Promtheus

ts=2020-01-23T12:16:07.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=4 to=7
ts=2020-01-23T12:16:17.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=7 to=11
ts=2020-01-23T12:16:27.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=11 to=16
ts=2020-01-23T12:16:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=16 to=21
ts=2020-01-23T12:17:07.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=21 to=14
ts=2020-01-23T12:17:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=14 to=9
ts=2020-01-23T12:18:07.675Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=9 to=6
ts=2020-01-23T12:18:37.652Z caller=dedupe.go:111 component=remote level=info remote_name=05e8cd url=http://vm_balancer:8480/insert/0/prometheus/ msg="Remote storage resharding" from=6 to=4
@valyala valyala added bug Something isn't working enhancement New feature or request labels Jan 24, 2020
valyala added a commit that referenced this issue Jan 24, 2020
@valyala
Copy link
Collaborator

valyala commented Jan 24, 2020

It has been appeared that vminsert could drop pending rows when all the vmstorage nodes were unavailable. This should be fixed in the commit 4d70a81 . This commit will be included in the next release of cluster version of VictoriaMetrics.

@valyala
Copy link
Collaborator

valyala commented Jan 24, 2020

After the commit 4d70a81 vminsert should always return 503 error to Prometheus when it cannot write data to vmstorage, so Prometheus could enable retry mechanism until vmstorage are recovered. This should prevent from data loss on temporary unavailability of all the vmstorage nodes in the cluster.

@valyala
Copy link
Collaborator

valyala commented Jan 24, 2020

@freeseacher , try v1.32.7 - this release should contain the fix.

@valyala
Copy link
Collaborator

valyala commented Jan 27, 2020

Closing the issue as fixed. @freeseacher , feel free re-opening it if VictoriaMetrics v1.32.7 or newer release continues dropping data when all the vmstorage nodes are unavailable.

@valyala valyala closed this as completed Jan 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants