-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-21740: Collect LLAP execution latency metrics #633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes (like isMapTask) should be done.
Also not sure if exponential decay is helping the cause here.
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/metrics/LlapTaskSchedulerMetrics.java
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/metrics/LlapTaskSchedulerMetrics.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/metrics/LlapTaskSchedulerMetrics.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/metrics/LlapTaskSchedulerMetrics.java
Outdated
Show resolved
Hide resolved
…de queue is full, we should not reset the counter every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a pass left some minor comments and nit recommendation that can be ignored but over all looks good to me.
Thanks.
@@ -4438,6 +4438,8 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal | |||
LLAP_COLLECT_LOCK_METRICS("hive.llap.lockmetrics.collect", false, | |||
"Whether lock metrics (wait times, counts) are collected for LLAP " | |||
+ "related locks"), | |||
LLAP_LATENCY_METRIC_WINDOW_SIZE("hive.llap.metrics.latency.window.size", 2048, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any clue on why 2k is the default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By severeal TCPDS tests it seems that we require minimally around 1000 measurements to have a steady result for the average. Set it to 2k to be sure
llap-common/src/test/org/apache/hadoop/hive/llap/metrics/MockMetricsCollector.java
Outdated
Show resolved
Hide resolved
llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java
Show resolved
Hide resolved
@@ -1175,6 +1178,9 @@ public boolean deallocateTask( | |||
LOG.debug("Processing deallocateTask for task={}, taskSucceeded={}, endReason={}", task, | |||
taskSucceeded, endReason); | |||
} | |||
if (task instanceof TaskAttemptImpl && metrics != null) { | |||
updateMetrics((TaskAttemptImpl)task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style is off and not sure what case metrics is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics can be null in case of conf.getBoolean(ConfVars.HIVE_IN_TEST.varname, false)
I tried to find the style problem, but not sure which part should be written differently
metrics.addTaskLatency(nodeInfo.shortStringBase, taskAttempt.getFinishTime() - taskAttempt.getLaunchTime()); | ||
} | ||
|
||
private boolean isMapTask(TaskAttemptImpl taskAttempt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really the best way to find if this is a map task ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not yet entirely familiar with this part of the code.
I thought that the vertex can tell us more but getVertex is package private method for TaskAttemptImp, and getting the Vertex from would need something like this (found in getTransitiveVertexOutputs):
DagInfo info = getContext().getCurrentDagInfo();
if (!(info instanceof DAG)) {
LOG.warn("DAG info is not a DAG");
return;
}
DAG dag = (DAG) info;
Vertex vertex = dag.getVertex(taskAttempt.getVertexID());
I found casting DagInfo to DAG more shady than relying on counters, but feel free to disagree.
Also open to any suggestions where should I dig around more to find a better solution!
Add a decaying metrics which measures the task latency.
Refactored the MockMetricsCollector to reuse that in the test.