Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RIP-46][Task1]: Define the specification of metric. #5366

Closed
yangwenting-ywt opened this issue Oct 20, 2022 · 7 comments
Closed

[RIP-46][Task1]: Define the specification of metric. #5366

yangwenting-ywt opened this issue Oct 20, 2022 · 7 comments
Milestone

Comments

@yangwenting-ywt
Copy link

yangwenting-ywt commented Oct 20, 2022

Metrics

RocketMQ exposes the following metrics in Prometheus format. You can monitor your clusters with those metrics.

  • Broker metrics
  • Producer metrics
  • Consumer metrics

Details of metrics

Metric types

The standard for defining metrics in RocketMQ complies with that for defining the metrics in open source Prometheus. The metric types that RocketMQ offers include counters, gauges, and histograms. For more information, see METRIC TYPES.

Broker metrics

The following table describes the labels of the metrics that are related to the Message Queue for Apache RocketMQ broker.

  • cluster: RocketMQ cluster name.
  • node_type: the type of service node, whitch includes the following:proxy,broker,nameserver.
  • node_id:the ID of the service node.
  • topic: the topic of RocketMQ.
  • message_type: the type of a message, which includes the following:
    Normal:normal messages;
    FIFO:ordered messages;
    Transaction:Transactional messages;
    Delay:scheduled or delayed messages.
  • consumer_group: the ID of the consumer group.
Type Name Unit Description Label
counter rocketmq_messages_in_total count The number of messages that are produced. cluster,node_type,node_id,topic,message_type
counter rocketmq_messages_out_total count The number of messages that are consumed. cluster,node_type,node_id,topic, consumer_group
counter rocketmq_throughput_in_total byte The write throughput that are produced. cluster,node_type,node_id,topic,message_type
counter rocketmq_throughput_out_total byte The read throughput that are produced. cluster,node_type,node_id,topic, consumer_group
histogram rocketmq_message_size byte The distribution of message sizes. This metric is counted only when messages are sent. The following shows the distribution ranges:
le_1_kb: ≤ 1 KB
le_4_kb: ≤ 4 KB
le_512_kb: ≤ 512 KB
le_1_mb: ≤ 1 MB
le_2_mb: ≤ 2 MB
le_4_mb: ≤ 4 MB
le_overflow: > 4 MB
cluster,node_type,node_id,topic,message_type
gauge rocketmq_consumer_ready_messages count The number of ready messages. cluster,node_type,node_id,topic, consumer_group
gauge rocketmq_consumer_inflight_messages count The number of inflight messages. cluster,node_type,node_id,topic, consumer_group
gauge rocketmq_consumer_queueing_latency millisecond Ready messages queueing delay time. cluster,node_type,node_id,topic, consumer_group
gauge rocketmq_consumer_lag_latency millisecond The delayed time before messages are consumed. cluster,node_type,node_id,topic, consumer_group
counter rocketmq_send_to_dlq_messages_total count The number of messages that are sent to the dead-letter queue. cluster,node_type,node_id,topic, consumer_group
histogram rocketmq_rpc_latency millisecond The rpc call latency cluster,node_typ,node_id,protocol_type,request_code,response_code
gauge rocketmq_storage_size byte The size of the storage space that is used by the node. cluster,node_type,node_id
counter rocketmq_storage_read_bytes_total byte The amount of data read by the storage layer. cluster,node_type,node_id,topic
gauge rocketmq_storage_read_bytes_max byte Peak data read per second of the storage layer. cluster,node_type,node_id,topic
counter rocketmq_storage_write_bytes_total byte The amount of data write to the storage layer. cluster,node_type,node_id,topic
gauge rocketmq_storage_write_bytes_max byte Peak data write per second to the storage layer. cluster,node_type,node_id,topic
Histogram rocketmq_storage_write_latency millisecond The latency of messages sizes. This metric is counted only when messages are sent. The following shows the distribution ranges:
le_1_kb: ≤ 1 KB
le_4_kb: ≤ 4 KB
le_512_kb: ≤ 512 KB
le_1_mb: ≤ 1 MB
le_2_mb: ≤ 2 MB
le_4_mb: ≤ 4 MB
le_overflow: > 4 MB
cluster,node_type,node_id,topic,message_type
gauge rocketmq_storage_message_reserve_time millisecond Message retention time. cluster,node_type,node_id
gauge rocketmq_storage_dispatch_behind_bytes byte Undispatched message size. cluster,node_type,node_id
gauge rocketmq_storage_flush_behind_bytes byte Unflushed messsage size. cluster,node_type,node_id
gauge rocketmq_thread_pool_wartermark count The number of tasks queued in the thread pool. cluster,node_type,node_id,name

Producer metrics

The following table describes the labels of the metrics that are related to the producers in Message Queue for Apache RocketMQ.

  • cluster: RocketMQ cluster name.
  • node_type: the type of service node, whitch includes the following:proxy,broker,nameserver.
  • node_id:the ID of the service node.
  • topic: the topic of Message Queue for Apache RocketMQ.
  • message_type: the type of a message, which includes the following:
    Normal:normal messages;
    FIFO:ordered messages;
    Transaction:Transactional messages;
    Delay:scheduled or delayed messages.
  • client_id: the ID of the client.
  • invocation_status: the result of the API call for sending messages, which includes success and failure.
Type Name Unit Description Label
Histogram rocketmq_send_cost_time millisecond The distribution of production API call time. The following shows the distribution ranges:
le_1_ms
le_5_ms
le_10_ms
le_20_ms
le_50_ms
le_200_ms
le_500_ms
le_overflow
topic,client_id,invocation_status

Consumer metrics

The following table describes the labels of the metrics that are related to the consumers in Message Queue for Apache RocketMQ.

  • topic: the topic of Message Queue for Apache RocketMQ.
  • consumer_group: the ID of the consumer group.
  • client_id: the ID of the client.
  • invocation_status: the result of the API call for sending messages, which includes success and failure.
Type Name Unit Description Label
Histogram rocketmq_process_time millisecond The distribution of message process time.The following shows the distribution ranges:
le_1_ms
le_5_ms  
le_10_ms
le_100_ms
le_10000_ms
le_60000_ms
le_overflow
topic,consumer_group,client_id,invocation_status
gauge rocketmq_consumer_cached_messages message The number of messages in the local buffer queue of PushConsumer. topic,consumer_group,client_id
gauge rocketmq_consumer_cached_bytes byte The total size of messages in the local buffer queue of PushConsumer. topic,consumer_group,client_id
Histogram rocketmq_await_time millisecond The distribution of queuing time for messages in the local buffer queue of PushConsumer. The following shows the distribution ranges:
le_1_ms
le_5_ms
le_20_ms
le_100_ms
le_1000_ms
le_5000_ms
le_10000_ms
le_overflow
topic,consumer_group,client_id

Background information

RocketMQ defines metrics based on the following business scenarios.

Message accumulation scenarios

rocketmq queue meesage stuatus
The above figure shows the number and duration of messages in different stages. By monitoring these metrics, you can determine whether the business consumption is abnormal. The following table describes the meaning of these metrics and the formulas that are used to calculate these metrics.

Name Description Formula
Inflight messages The number of messages being processed by consumer but not acked yet Offset of the latest pulled message - Offset of the latest committed message
Ready messages The number of messages that are ready for consumption. Maximum offset - Offset of the latest pulled message
Ready time normal message or ordered message:the time when the message is stored to the broker.  
Scheduled message:timing end time.
 Transactional message: transaction commit time.
--
Ready message queue time The time interval between the ready time of the earliest ready message and the current time. This time reflects the timeliness of consumers pulling messages. Current time - Ready time of the earliest ready message
Consumer lag time The time difference between the ready time of the earliest unacked message and the current moment.
This time reflects the timeliness of the consumer to complete message processing.
Current time - Ready time of the earliest unacked message

PushConsumer consumption scenarios

In PushConsumer, real-time message processing capability is implemented based on the typical Reactor thread model inside the SDK.As shown below, the SDK has a built-in long polling thread that asynchronously pulls messages into the SDK's built-in buffer queue and then separately commits them to the consumer thread, triggering the listener to execute the local consumption logic.
PushConsumer client
The metrics of local buffer queues in the PushConsumer scenario are as follows:

  • Number of messages in the local buffer queue: Total number of messages in the local buffer queue.
  • Message size in the local buffer queue: The sum of all message sizes in the local buffer queue.
  • Message waiting time: the time that the message is temporarily cached in the local buffer queue waiting to be processed.
@francisoliverlee
Copy link
Member

@yangwenting-ywt is there some pop metrics on broker or consumer ?

@wegod
Copy link

wegod commented Nov 9, 2022

Will you add metric gauge about consumeOKTPS and consumeFailedTPS in ConsumeStatus.java? These data on consumer client are very helpful to find if there are some error cause by dirty data or virtual machine issues.

And will you and similar consumeOKTPS and consumeFailedTPS on producer client? Such like producerOKTPS or producerFailedTPS. Although this is not important as consumer TPS, it's better to help business to find who is the bad guy that send large garbage.

@yangwenting-ywt
Copy link
Author

yangwenting-ywt commented Nov 10, 2022

Will you add metric gauge about consumeOKTPS and consumeFailedTPS in ConsumeStatus.java? These data on consumer client are very helpful to find if there are some error cause by dirty data or virtual machine issues.

And will you and similar consumeOKTPS and consumeFailedTPS on producer client? Such like producerOKTPS or producerFailedTPS. Although this is not important as consumer TPS, it's better to help business to find who is the bad guy that send large garbage.
suf


Hisrograms can track the number of observations , that showing up as a time series with a _count suffix is inherently a counter.
So, to calculate consumeOKTPS we can use the following experession:
rate(rocketmq_process_time_count{invocation_status="success"}[1m])
to calculate producerOKTPS we can use the following experession:
rate(rocketmq_send_cost_time_count{invocation_status="success"}[1m])

@wegod
Copy link

wegod commented Nov 10, 2022

Will you add metric gauge about consumeOKTPS and consumeFailedTPS in ConsumeStatus.java? These data on consumer client are very helpful to find if there are some error cause by dirty data or virtual machine issues.
And will you and similar consumeOKTPS and consumeFailedTPS on producer client? Such like producerOKTPS or producerFailedTPS. Although this is not important as consumer TPS, it's better to help business to find who is the bad guy that send large garbage.
suf

Hisrograms can track the number of observations , that showing up as a time series with a _count suffix is inherently a counter. So, to calculate consumeOKTPS we can use the following experession: rate(rocketmq_process_time_count{invocation_status="success"}[1m]) to calculate producerOKTPS we can use the following experession: rate(rocketmq_send_cost_time_count{invocation_status="success"}[1m])

Thanks for your reply.
I only see rocketmq_send_cost_time metrics, will you add rocketmq_send_msg_count in Producer Metrics?

And I see topic and client_id params in Producer Metrics. Does that means Metrics can show topic send msg count by each producer machine?
For example, there are one producer group and four producer clients those send msg count are 1,2,3,4. Will I see 1,2,3,4 in Metrics or only a total num 10?

@ShadowySpirits
Copy link
Member

I only see rocketmq_send_cost_time metrics, will you add rocketmq_send_msg_count in Producer Metrics?

In prometheus metrics spec, histogram rocketmq_send_cost_time will be transferred to rocketmq_send_cost_time_count, rocketmq_send_cost_time_sum, and rocketmq_send_cost_time_bucket. I think rocketmq_send_cost_time_count is what you need.

For example, there are one producer group and four producer clients those send msg count are 1,2,3,4. Will I see 1,2,3,4 in Metrics or only a total num 10?

Each label combination generates a time series, so you will see 1,2,3,4.

You can read the Prometheus doc Data Model to get more information.

@wegod
Copy link

wegod commented Nov 10, 2022

I only see rocketmq_send_cost_time metrics, will you add rocketmq_send_msg_count in Producer Metrics?

In prometheus metrics spec, histogram rocketmq_send_cost_time will be transferred to rocketmq_send_cost_time_count, rocketmq_send_cost_time_sum, and rocketmq_send_cost_time_bucket. I think rocketmq_send_cost_time_count is what you need.

For example, there are one producer group and four producer clients those send msg count are 1,2,3,4. Will I see 1,2,3,4 in Metrics or only a total num 10?

Each label combination generates a time series, so you will see 1,2,3,4.

You can read the Prometheus doc Data Model to get more information.

Thanks, I got it.

Where does Metrics collect rocketmq_send_cost_time's data from? Broker's Stats in memory? Or some new place?

And when will Metrics release? Only in 5.x? Or also in 4.9.x?

@ShadowySpirits
Copy link
Member

I only see rocketmq_send_cost_time metrics, will you add rocketmq_send_msg_count in Producer Metrics?

In prometheus metrics spec, histogram rocketmq_send_cost_time will be transferred to rocketmq_send_cost_time_count, rocketmq_send_cost_time_sum, and rocketmq_send_cost_time_bucket. I think rocketmq_send_cost_time_count is what you need.

For example, there are one producer group and four producer clients those send msg count are 1,2,3,4. Will I see 1,2,3,4 in Metrics or only a total num 10?

Each label combination generates a time series, so you will see 1,2,3,4.
You can read the Prometheus doc Data Model to get more information.

Thanks, I got it.

Where does Metrics collect rocketmq_send_cost_time's data from? Broker's Stats in memory? Or some new place?

And when will Metrics release? Only in 5.x? Or also in 4.9.x?

The rocketmq_send_cost_time is collected by producer and reported to opentelemetry collector. We will release metrics in 5.x first. The server metrics is easy to backport to 4.x but client metrics probably can't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants