-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-8428. Add MS or NS time unit suffix for CSMMetrics #4574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@xichen01 , do you aware the common value of applyTransaction and writeStateMachineData? Are their value always higher than 1ms? |
I mainly want to standardize the units of metrics, in According to my observation the average value For values under 1ms, whether we can consider as normal, need not to care? So a uniform unit of metrics is more friendly. |
If "applyTransaction" is usually less than 1ms, then it will lose accuracy from ns to ms. If the goal is unify the time unit, then maybe switch other metrics from ms to ns is what we can go. |
It's a good idea that switch other metrics from ms to |
Yeah, it's not easy to tell whether it's ns or ms for a time metrics in Ozone now. One solution is we use one granularity for all latency/time metrics, say ns. Another solution is like in this patch, we add the suffix "ns" or "ms" for every metrics name. |
|
@xBis7 @tanvipenumudy @hemantk-12 can you please take a look? |
|
@xichen01 I think some test cases results would be useful to get an idea of the average numbers. That way, we can tell If we end up losing accuracy. If that's the case, I agree with @ChenSammi, why not keep everything as it is and modify the metric names? - private @Metric MutableRate applyTransaction;
+ private @Metric MutableRate applyTransactionInMs; |
|
I think this solution is acceptable, such as: @xBis7 How about this? |
|
@xichen01 I'm fine with either approach but please run a Freon command twice and share what the metrics look like before and after your changes. I don't think there will be an accuracy issue but it will be nice to verify it. |
[root@centos ~/community/ozone]$ curl -s http://0.0.0.0:10021/prom | grep -i CSM |grep time | grep -v '#'
csm_metrics_apply_transaction_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_close_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_close_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_block_avg_time{context="dfs",hostname="centos"} 11.3025
csm_metrics_put_block_ms_avg_time{context="dfs",hostname="centos"} 11.3025
csm_metrics_put_small_file_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_transaction_latency_avg_time{context="dfs",hostname="centos"} 10.891113892365457
csm_metrics_transaction_latency_ms_avg_time{context="dfs",hostname="centos"} 10.891113892365457
csm_metrics_update_container_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_update_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_write_chunk_avg_time{context="dfs",hostname="centos"} 10.478696741854636
csm_metrics_write_chunk_ms_avg_time{context="dfs",hostname="centos"} 10.478696741854636
csm_metrics_write_state_machine_data_avg_time{context="dfs",hostname="centos"} 980656.7719298246
csm_metrics_write_state_machine_data_ns_avg_time{context="dfs",hostname="centos"} 980656.7719298246 |
MS or NS time unit suffix for CSMMetrics
MS or NS time unit suffix for CSMMetricsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 Thanks for the changes and sharing the metrics! New additions look good. I would suggest to remove the old metrics altogether and just keep the new ones with the suffixes, to avoid duplication and confusion.
e.g.
csm_metrics_apply_transaction_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1230705.6332916145
So we just need to directly change the Metric name and needn't keep the old Metric name? |
|
@xichen01 You are adding the same metrics twice but this time with a more descriptive name.
If we are not changing the metric values, there is no point in keeping the old metric names as well. We are ending up with duplicate entries.
I have no particular preference. If you keep the old values and the numbers were in |
|
The latest Metrics [root@centos ~/community/ozone]$ curl -s http://0.0.0.0:10021/prom | grep -i CSM |grep time | grep -v '#'
csm_metrics_apply_transaction_ns_avg_time{context="dfs",hostname="centos"} 1254298.1583333334
csm_metrics_close_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_compact_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_create_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_delete_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_committed_block_length_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_get_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_block_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_list_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_put_block_ms_avg_time{context="dfs",hostname="centos"} 10.958333333333334
csm_metrics_put_small_file_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_chunk_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_read_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_init_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_stream_write_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_transaction_latency_ms_avg_time{context="dfs",hostname="centos"} 10.61111111111111
csm_metrics_update_container_ms_avg_time{context="dfs",hostname="centos"} 0.0
csm_metrics_write_chunk_ms_avg_time{context="dfs",hostname="centos"} 10.26388888888889
csm_metrics_write_state_machine_data_ns_avg_time{context="dfs",hostname="centos"} 1138597.7222222222
[root@centos ~/community/ozone]$ |
xBis7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 Thanks for the changes. LGTM!
|
Please take a look at CI run in fork before starting PR run. https://github.com/xichen01/ozone/actions/runs/4746296731/jobs/8429756367#step:5:3705 |
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @xichen01
Overall looks good to me. Gave one minor comment.
|
|
||
| public MutableRate getApplyTransactionLatency() { | ||
| return applyTransaction; | ||
| public MutableRate getApplyTransactionNsLatency() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| public MutableRate getApplyTransactionNsLatency() { | |
| public MutableRate getApplyTransactionLatencyNs() { |
Ms and Ns are used as suffix in variable names but in functions it is not consistent. Can you please change it here as well as recordApplyTransactionNsCompletion to recordApplyTransactionCompletionNs and recordWriteStateMachineNsCompletion torecordWriteStateMachineCompletionNs? Similar to incPipelineLatencyMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Thanks @xichen01 for the improvement, @ChenSammi, @hemantk-12, @xBis7 for the review. |
What changes were proposed in this pull request?
Some of the metrics in CMSMetrics are in milliseconds, some are in nanoseconds, and here they are standardized to milliseconds
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8428
How was this patch tested?