Skip to content

Conversation

@xichen01
Copy link
Contributor

What changes were proposed in this pull request?

Some of the metrics in CMSMetrics are in milliseconds, some are in nanoseconds, and here they are standardized to milliseconds

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8428

How was this patch tested?

@ChenSammi
Copy link
Contributor

@xichen01 , do you aware the common value of applyTransaction and writeStateMachineData? Are their value always higher than 1ms?

@xichen01
Copy link
Contributor Author

@xichen01 , do you aware the common value of applyTransaction and writeStateMachineData? Are their value always higher than 1ms?

I mainly want to standardize the units of metrics, in CMSMetrics, some metrics are in ms and some are in ns, which is not directly known from the names of the metrics, which may cause misunderstanding.

According to my observation the average value writeStateMachineData is usually 1~10ms, the average value of applyTransaction is usually less than 1ms, but the maximum value of both is more than 1ms.

For values under 1ms, whether we can consider as normal, need not to care? So a uniform unit of metrics is more friendly.

@ChenSammi
Copy link
Contributor

@xichen01 , do you aware the common value of applyTransaction and writeStateMachineData? Are their value always higher than 1ms?

I mainly want to standardize the units of metrics, in CMSMetrics, some metrics are in ms and some are in ns, which is not directly known from the names of the metrics, which may cause misunderstanding.

According to my observation the average value writeStateMachineData is usually 1~10ms, the average value of applyTransaction is usually less than 1ms, but the maximum value of both is more than 1ms.

For values under 1ms, whether we can consider as normal, need not to care? So a uniform unit of metrics is more friendly.

If "applyTransaction" is usually less than 1ms, then it will lose accuracy from ns to ms. If the goal is unify the time unit, then maybe switch other metrics from ms to ns is what we can go.

@xichen01
Copy link
Contributor Author

@xichen01 , do you aware the common value of applyTransaction and writeStateMachineData? Are their value always higher than 1ms?

I mainly want to standardize the units of metrics, in CMSMetrics, some metrics are in ms and some are in ns, which is not directly known from the names of the metrics, which may cause misunderstanding.
According to my observation the average value writeStateMachineData is usually 1~10ms, the average value of applyTransaction is usually less than 1ms, but the maximum value of both is more than 1ms.
For values under 1ms, whether we can consider as normal, need not to care? So a uniform unit of metrics is more friendly.

If "applyTransaction" is usually less than 1ms, then it will lose accuracy from ns to ms. If the goal is unify the time unit, then maybe switch other metrics from ms to ns is what we can go.

It's a good idea that switch other metrics from ms to ns.
But is there a more direct way to confirm the units of a metric? Currently, if I want to confirm the units of an metric, I can only check the code to confirm

@ChenSammi
Copy link
Contributor

It's a good idea that switch other metrics from ms to ns. But is there a more direct way to confirm the units of a metric? Currently, if I want to confirm the units of an metric, I can only check the code to confirm

Yeah, it's not easy to tell whether it's ns or ms for a time metrics in Ozone now. One solution is we use one granularity for all latency/time metrics, say ns. Another solution is like in this patch, we add the suffix "ns" or "ms" for every metrics name.

@kerneltime
Copy link
Contributor

@xBis7 @tanvipenumudy @hemantk-12 can you please take a look?

@xBis7
Copy link
Contributor

xBis7 commented Apr 18, 2023

@xichen01 I think some test cases results would be useful to get an idea of the average numbers. That way, we can tell If we end up losing accuracy. If that's the case, I agree with @ChenSammi, why not keep everything as it is and modify the metric names?

- private @Metric MutableRate applyTransaction;
+ private @Metric MutableRate applyTransactionInMs;

@xichen01
Copy link
Contributor Author

I think this solution is acceptable,
maybe I can add a new metrics ending with MS / NS, and keep the original metrics, but add @Deprecated

such as:
Add:
metrics variable name: applyTransactionNs
-->
Prometheus metrics name: csm_metrics_write_state_machine_data_ns_avg_time and csm_metrics_write_state_machine_data_ns_num_ops

@xBis7 How about this?

@xBis7
Copy link
Contributor

xBis7 commented Apr 18, 2023

@xichen01 I'm fine with either approach but please run a Freon command twice and share what the metrics look like before and after your changes. I don't think there will be an accuracy issue but it will be nice to verify it.

@xichen01
Copy link
Contributor Author

@xichen01 I'm fine with either approach but please run a Freon command twice and share what the metrics look like before and after your changes. I don't think there will be an accuracy issue but it will be nice to verify it.

[root@centos ~/community/ozone]$ curl -s http://0.0.0.0:10021/prom | grep -i CSM |grep time  | grep -v '#'
csm_metrics_apply_transaction_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_close_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_close_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_block_avg_time{context="dfs",hostname="centos"} 11.3025
csm_metrics_put_block_ms_avg_time{context="dfs",hostname="centos"} 11.3025
csm_metrics_put_small_file_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_transaction_latency_avg_time{context="dfs",hostname="centos"} 10.891113892365457
csm_metrics_transaction_latency_ms_avg_time{context="dfs",hostname="centos"} 10.891113892365457
csm_metrics_update_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_update_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_write_chunk_avg_time{context="dfs",hostname="centos"} 10.478696741854636
csm_metrics_write_chunk_ms_avg_time{context="dfs",hostname="centos"} 10.478696741854636
csm_metrics_write_state_machine_data_avg_time{context="dfs",hostname="centos"} 980656.7719298246
csm_metrics_write_state_machine_data_ns_avg_time{context="dfs",hostname="centos"} 980656.7719298246

@xichen01 xichen01 changed the title HDDS-8428. Inconsistent Metrics units in CMSMetrics HDDS-8428. Add MS or NS time unit suffix for CSMMetrics Apr 19, 2023
@xichen01 xichen01 changed the title HDDS-8428. Add MS or NS time unit suffix for CSMMetrics HDDS-8428. Add MS or NS time unit suffix for CSMMetrics Apr 19, 2023
Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 Thanks for the changes and sharing the metrics! New additions look good. I would suggest to remove the old metrics altogether and just keep the new ones with the suffixes, to avoid duplication and confusion.

e.g.

csm_metrics_apply_transaction_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1230705.6332916145

@xichen01
Copy link
Contributor Author

@xichen01 Thanks for the changes and sharing the metrics! New additions look good. I would suggest to remove the old metrics altogether and just keep the new ones with the suffixes, to avoid duplication and confusion.

e.g.

csm_metrics_apply_transaction_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1230705.6332916145

So we just need to directly change the Metric name and needn't keep the old Metric name?
Do you think the suffix is needed for ms, or do we just need to add the suffix for ns? This may need to have a specification for Ozone Metrics.

@xBis7
Copy link
Contributor

xBis7 commented Apr 19, 2023

@xichen01 You are adding the same metrics twice but this time with a more descriptive name.

So we just need to directly change the Metric name and needn't keep the old Metric name?

If we are not changing the metric values, there is no point in keeping the old metric names as well. We are ending up with duplicate entries.

Do you think the suffix is needed for ms, or do we just need to add the suffix for ns?

I have no particular preference. If you keep the old values and the numbers were in ms just add the ms suffix to the metric name. Similarly, if the numbers were referring to ns, add the ns suffix to the name.

@xichen01
Copy link
Contributor Author

The latest Metrics

[root@centos ~/community/ozone]$ curl -s http://0.0.0.0:10021/prom | grep -i CSM |grep time  | grep -v '#'
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1254298.1583333334
csm_metrics_close_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_block_ms_avg_time{context="dfs",hostname="centos"} 10.958333333333334
csm_metrics_put_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_transaction_latency_ms_avg_time{context="dfs",hostname="centos"} 10.61111111111111
csm_metrics_update_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_write_chunk_ms_avg_time{context="dfs",hostname="centos"} 10.26388888888889
csm_metrics_write_state_machine_data_ns_avg_time{context="dfs",hostname="centos"} 1138597.7222222222
[root@centos ~/community/ozone]$

Copy link
Contributor

@xBis7 xBis7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 Thanks for the changes. LGTM!

@adoroszlai
Copy link
Contributor

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @xichen01

Overall looks good to me. Gave one minor comment.


public MutableRate getApplyTransactionLatency() {
return applyTransaction;
public MutableRate getApplyTransactionNsLatency() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public MutableRate getApplyTransactionNsLatency() {
public MutableRate getApplyTransactionLatencyNs() {

Ms and Ns are used as suffix in variable names but in functions it is not consistent. Can you please change it here as well as recordApplyTransactionNsCompletion to recordApplyTransactionCompletionNs and recordWriteStateMachineNsCompletion torecordWriteStateMachineCompletionNs? Similar to incPipelineLatencyMs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@adoroszlai adoroszlai requested a review from ChenSammi May 14, 2023 17:59
@adoroszlai adoroszlai merged commit 99af599 into apache:master May 18, 2023
@adoroszlai
Copy link
Contributor

Thanks @xichen01 for the improvement, @ChenSammi, @hemantk-12, @xBis7 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants