Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input.prometheus 插件采集数据不全 #510

Closed
song-yunfei opened this issue May 25, 2023 · 7 comments
Closed

input.prometheus 插件采集数据不全 #510

song-yunfei opened this issue May 25, 2023 · 7 comments

Comments

@song-yunfei
Copy link

song-yunfei commented May 25, 2023

Relevant config.toml

----config.toml----
[global]
# whether print configs
print_configs = false

# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = ""

# will not add label(agent_hostname) if true
omit_hostname = false

# s | ms
precision = "ms"

# global collect interval
interval = 60

# input provider settings; optional: local / http
providers = ["local"]

disable_usage_report = true

[global.labels]

datacenter = "IDC"
region = "North_China"
zone = "BeiJing"

[log]
# file_name is the file to write logs to
file_name = "stdout"

# options below will not be work when file_name is stdout or stderr
# max_size is the maximum size in megabytes of the log file before it gets rotated. It defaults to 100 megabytes.
max_size = 100
# max_age is the maximum number of days to retain old log files based on the timestamp encoded in their filename.
max_age = 1
# max_backups is the maximum number of old log files to retain.
max_backups = 1
# local_time determines if the time used for formatting the timestamps in backup files is the computer's local time.
local_time = true
# Compress determines if the rotated log files should be compressed using gzip.
compress = false

[writer_opt]
batch = 1000
chan_size = 1000000

[[writers]]
url = "http://n9e.xx.xx:19000/prometheus/v1/write"

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"

[ibex]
enable = false
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["n9e.xx.xx:20090"]
## temp script dir
meta_dir = "./meta"

[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://xx.xx.xx:19000/v1/n9e/heartbeat"

# interval, unit: s
interval = 10

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100



---- input.prometheus/prometheus.toml----
interval = 30

[[instances]]
 urls = [
     "http://127.0.0.1:8030/metrics"
]

url_label_key = "instance"
url_label_value = "{{.Host}}"
timeout = "10s"
## Scrape Services available in Consul Catalog
#  [[instances.consul.query]]
#    name = "a service name"
#    tag = "a service tag"
#    url = 'http://{{if ne .ServiceAddress ""}}{{.ServiceAddress}}{{else}}{{.Address}}{{end}}:{{.ServicePort}}/{{with .ServiceMeta.metrics_path}}{{.}}{{else}}metrics{{end}}'
#    [instances.consul.query.tags]
#      host = "{{.Node}}"
#
# bearer_token_string = ""

# e.g. /run/secrets/kubernetes.io/serviceaccount/token
# bearer_token_file = ""

# # basic auth
# username = ""
# password = ""

# headers = ["X-From", "categraf"]

# # interval = global.interval * interval_times
# interval_times = 4

labels = {group="fe",job="Doris-fe",cluster="online"}

# support glob
ignore_metrics = [ "go_*" ]

Logs from categraf

无异常日志

System info

v0.3.0-3ae9599251088bae5414c8c0c776d3649613c8cc

Docker

No response

Steps to reproduce

  1. 采集目标为Doris 集群版本 1.1.3-rc02-b4364b451 的 metrics接口
    2.丢失数据示例:
    怀疑是标签数据问题
    doris_fe_editlog_write_latency_ms{quantile="0.75"} 1.0
    doris_fe_query_latency_ms{quantile="0.75"} 14.0
    我测试了 0.2.29 和0.2.28 都不可以
    categraf 回退到了0.2.9之后恢复正常
    我的server版本是 v5.15.0 数据存储为 vmcluster

Expected behavior

上报完整的metrics数据

Actual behavior

丢失metrics数据

Additional info

No response

@kongfei605
Copy link
Collaborator

kongfei605 commented May 26, 2023

./categraf --test --inputs prometheus --debug 看看

@song-yunfei
Copy link
Author

image
我测试了一下 发现指标是存在的 但是名称后面增加了一个后缀,如截图
图中2.9 是正常的,有后缀名的是0.3.0的版本。

@song-yunfei
Copy link
Author

./categraf --test --inputs prometheus --debug 看看

@kongfei605
Copy link
Collaborator

这个问题是说版本更新后,指标名称发生了变化,对吗?

@song-yunfei
Copy link
Author

是的,categraf 增加了一个后缀

@kongfei605
Copy link
Collaborator

kongfei605 commented Jun 29, 2023

有可能。 翻了下changelog,设置quantile改动的基本在这https://github.com/flashcatcloud/categraf/compare/v0.2.35...v0.2.36 ,不过看起来也是代码移动,不是增加指标后缀。 社区有一个基本目标就是尽量维持指标名称的稳定,没有什么特殊理由不会乱改动指标的。

如果只是指标名称改动,这个issue就先关闭了,还有其他疑问可以随时打开。

@song-yunfei
Copy link
Author

可以关闭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants