Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vmagent] update 1.39.0 causes duplicate data #653

Closed
posix opened this issue Jul 25, 2020 · 13 comments
Closed

[vmagent] update 1.39.0 causes duplicate data #653

posix opened this issue Jul 25, 2020 · 13 comments
Labels
bug Something isn't working

Comments

@posix
Copy link

posix commented Jul 25, 2020

Describe the bug
After updating the agent and the database to the new version, I saw duplicate data in grafana. Duplication occurs only with data from the influx protocol. I tried roll back the database to version 1.38.1, but it did fix the problem. I rolled back the agent to 1.38.1 and the duplicates stopped repeating.
Also, before updating, I tried new version in the test environment and did not notice any problems there.
We use the local telegraf as a message hub for other telegraf agents.

To Reproduce
To reproduce the bug, you need use: influx protocol, telegraf (1.15.1-1 or 1.14.4-1), vmagent version 1.39.0, Victoria-metrics database (1.39.0 or 1.38.1), OS Oracle Linux 7.8, kernel 5.4.17-2011.3.2.1.el7uek.x86_64 and probable high insertion rate.

Screenshots
Screenshot 2020-07-25 at 20 37 32
Screenshot 2020-07-25 at 20 37 07
Screenshot 2020-07-25 at 20 36 43
Screenshot 2020-07-25 at 20 36 32

Version

vmagent-prod --version
vmagent-20200725-094220-heads-master-0-ga0906270

[root@s4877]# victoria-metrics-prod --version
victoria-metrics-20200714-161623-tags-v1.38.1-0-gb442a42d

[root@s4876]# victoria-metrics-prod --version
victoria-metrics-20200725-094117-heads-master-0-ga0906270

[root@s4876]# telegraf --version
Telegraf 1.15.1 (git: HEAD 002696d8)

Used command-line flags

[root@s4876]# cat /etc/default/vmagent
VMA_OPTS='-httpListenAddr=:8428 -remoteWrite.queues 4 -remoteWrite.tmpDataPath /data/vm/agent -maxConcurrentInserts 512 -memory.allowedPercent 20 -remoteWrite.maxDiskUsagePerURL 0 -loggerLevel ERROR -promscrape.config /data/vm/etc/vma.yml -remoteWrite.url http://127.0.0.1:8429/api/v1/write -remoteWrite.url http://s4877:8429/api/v1/write'

cat /etc/default/victoriametrics
VM_OPTS='-storageDataPath /data/vm/dt -maxConcurrentInserts 1024 -memory.allowedPercent 80 -retentionPeriod 5 -search.maxQueryDuration 120s -httpListenAddr=:8429 -loggerLevel ERROR'

telegraph.conf:
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 100000000
  collection_jitter = "0s"
  flush_interval = "1s"
  flush_jitter = "0s"
  debug = false
  quiet = true
  logfile = "/var/log/telegraf/telegraf.log"
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8428"]
  timeout = "30s"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = true
[[inputs.disk]]
  interval = "60s"
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "nfs", "nfsd", "sysfs", "proc", "autofs", "rootfs"]
[[inputs.diskio]]
  interval = "1s"
  skip_serial_number = true
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
  interval = "1s"
[[inputs.internal]]
  collect_memstats = true
[[inputs.influxdb_listener]]
  service_address = ":8189"
  read_timeout = "30s"
  write_timeout = "30s"
@posix
Copy link
Author

posix commented Jul 25, 2020

I tried change config of vmagent, local telegraf, vm database, but did not get any good result.
After several attempts, I decided to exclude the local telegraf from the chain and started the second vmagent on port 8189, after which the data returned to normal and the duplicates stopped.

stopped_telegraf 2020-07-26 at 01 32 49

@valyala valyala added the bug Something isn't working label Jul 27, 2020
@valyala
Copy link
Collaborator

valyala commented Jul 27, 2020

This may be related to the commit ad62909 , which has been added in vmagent v1.39.0. This commit introduced a bug, which triggers after vmagent fails sending data to remote storage. The issue has been fixed in the commit cb8c690 .

@posix, could you build vmagent from the commit cb8c690 and verify whether this fixes the issue? See build instructions for vmagent.

@posix
Copy link
Author

posix commented Jul 28, 2020

I tried to update and work on the old scheme.
The problem still exists.

Screenshot description:
Switched to the old scheme at 11:19:56
Jul 28 11:19:56 s4876 telegraf[38627]: 2020-07-28T08:19:56Z I! Starting Telegraf 1.15.1
I updated the agent to version 1.39.1 at 11:39:24
Jul 28 11:39:24 s4876 systemd[1]: Started VictoriaMetrics Agent.
It worked until 11:45:23.
Jul 28 11:45:23 s4876 telegraf[1079]: 2020-07-28T08:45:23Z I! Starting Telegraf 1.15.1
Jul 28 11:45:28 s4876 systemd[1]: Started VictoriaMetrics Agent.
<--Second agent with same version
Screenshot 2020-07-28 at 11 55 19
It looks like that version 1.39.1 works fine but only without the telegraf-hub.

@valyala
Copy link
Collaborator

valyala commented Jul 28, 2020

Thanks for the update!

Could you look into logs for vmagent and VictoriaMetrics on the time range when the issue triggers?
It would be great also if you could share the query used for building the CPU Usage graph from the last screenshot.

@posix
Copy link
Author

posix commented Jul 28, 2020

I made a new test with debug of telegraf and turned on logging of the vmagent and vmdb(commented ERROR).
Screenshot 2020-07-28 at 20 24 05

logs.zip

Query:

sort_by_label(label_transform({__name__=~"cpu_usage_.*", host=~"$host", cpu="cpu-total"}, "__name__", "cpu_usage_", ""), "__name__")

valyala added a commit that referenced this issue Jul 28, 2020
…g data to remote storage

Missing body close could disable HTTP keep-alive connections.

Updates #653
valyala added a commit that referenced this issue Jul 28, 2020
…g data to remote storage

Missing body close could disable HTTP keep-alive connections.

Updates #653
@valyala
Copy link
Collaborator

valyala commented Jul 28, 2020

@posix , could you remove {{__name__}} string from Legend field on the graph above and then check the difference in labels for metrics with duplicate names?

@posix
Copy link
Author

posix commented Jul 28, 2020

It looks like labels drifts
Screenshot 2020-07-28 at 21 16 35
Screenshot 2020-07-28 at 21 25 01

valyala added a commit that referenced this issue Jul 28, 2020
…passed to Influx line protocol query

Previously `db` tag from the query string wasn't added to metrics after encountering `db` tag in the Influx line

Updates #653
valyala added a commit that referenced this issue Jul 28, 2020
…passed to Influx line protocol query

Previously `db` tag from the query string wasn't added to metrics after encountering `db` tag in the Influx line

Updates #653
@valyala
Copy link
Collaborator

valyala commented Jul 28, 2020

@posix , it looks like the root cause of the issue has been nailed down and fixed in the commit 0f63da3 . Could you build vmagent from the commit 0f63da3 and verify whether the issue is gone?

@posix
Copy link
Author

posix commented Jul 28, 2020

It looks pretty good. I also checked other graphs and everything is fine there too.
Thank you very much!
Screenshot 2020-07-28 at 21 53 49
Screenshot 2020-07-28 at 21 57 27

@valyala
Copy link
Collaborator

valyala commented Jul 28, 2020

@posix , thanks for the help in determining the root cause for the issue! The bugfix will be included in the next release.

@posix
Copy link
Author

posix commented Jul 28, 2020

Nice! It was really fast! Thank you again!

@posix posix closed this as completed Jul 28, 2020
@valyala
Copy link
Collaborator

valyala commented Jul 30, 2020

FYI, vmagent versions v1.39.0 and v1.39.1 had a bug, which prevented from re-using http keep-alive connections between vmagent and remote storage systems. This could result in increased resource usage on both vmagent and remote storage. This has been fixed in v1.39.2.

@valyala
Copy link
Collaborator

valyala commented Jul 30, 2020

The bugfix that could lead to duplicate time series with and without db tag has been included in v1.39.2 too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants