(WIP) Task delay metrics endpoints on TaskDriver#1056
(WIP) Task delay metrics endpoints on TaskDriver#1056NealSun96 wants to merge 5 commits intoapache:task_poolfrom
Conversation
|
I have 2 questions,
|
| // first level, and its children nodes will represent each job type. Since the metrics are | ||
| // aggregated by job types, the metric values will be recorded in each ZNode belonging to a job | ||
| // type as simple fields. | ||
| private static final String SYNC_ZK_ROOT_PATH = "/JOB_MONITOR_METRICS"; |
There was a problem hiding this comment.
Why we need a ZNode to persist metrics data?
There was a problem hiding this comment.
Is there a reason we don't use MBean for such metrics?
There was a problem hiding this comment.
Right now the metrics are emitted by the controller. We want to expose the metrics to the client side, therefore ZNodes are used as a medium to store the metrics for access.
MBean is still used; the ZNodes are for "syncing" or "copying" the metrics to client side.
There was a problem hiding this comment.
It's now necessary for doing in this way:
- These metrics can be done in controller by looking at the stats of the ZNode property.
- Even if you would like to have it in client side. You should make it reported in MBean since all the data can be derived from config/context. So ingraph can fetch MBean to report it. This is not a good idea to have metrics reporting to introduce new ZNode create/read/write.
There was a problem hiding this comment.
I'm not trying to say no to this idea before I fully understand the context here. But,
- Who are this "We"?
- Do we have a design that illustrates the motivation and potential risks? I would like to take a look and see if there are any other ways.
There was a problem hiding this comment.
I would not think it is a good idea to use ZK as syncing store for metrics.. ZK is not designed for this purpose. ZK is such a core part of infra. Syncing metrics to ZK definitely increase read/write traffic to ZK, which may impact the other critical services. If we have a better design not to use ZK as a syncing store, exposing metrics to client is fine.
Could you explain any case "users that cannot rely directly on the controller emitted metrics to see the metric values."? I would like to see if there is an alternative solution that doesn't use ZK.
|
|
| // first level, and its children nodes will represent each job type. Since the metrics are | ||
| // aggregated by job types, the metric values will be recorded in each ZNode belonging to a job | ||
| // type as simple fields. | ||
| private static final String SYNC_ZK_ROOT_PATH = "/JOB_MONITOR_METRICS"; |
There was a problem hiding this comment.
Let's review whether creating this root path would be appropriate. Can you find a better place to persist the data?
| * Sync the current SubmissionToProcessDelay mean to ZooKeeper | ||
| * @param baseDataAccessor | ||
| */ | ||
| public void syncSubmissionToProcessDelayToZk(BaseDataAccessor<ZNRecord> baseDataAccessor) { |
There was a problem hiding this comment.
As discussed offline, there could be potentially better designs to achieve this.
- This is too specific and not very easy to maintain or scalable - suppose you want to add more metrics that you need to persist, then you'd have to implement additional methods for each additional metric.
- Can you look into inversion of control? Perhaps you could create a component with the appropriate set of interface methods that allow persisting of metric numbers to a metadata store in a more generic fashion. For example, you could have something like,
<I> MetricStorage
<Class> ZkMetricStorage (takes a Zk connection to initialize)
Also, this MetricStorage implementation could take in a dynamic list of metric names to be persisted, etc. You could get creative with this. No need to feel like you have to cover all use cases, but designing a component in a way that is extendable and generic enough will save you time and help with maintainability.
| protected static void reportSubmissionToProcessDelay(BaseControllerDataProvider dataProvider, | ||
| final ClusterStatusMonitor clusterStatusMonitor, final WorkflowConfig workflowConfig, | ||
| final JobConfig jobConfig, final long currentTimestamp) { | ||
| final JobConfig jobConfig, final long currentTimestamp, final HelixManager helixManager) { |
There was a problem hiding this comment.
It's strongly recommend we check whether the given ZkConnection is valid. What if helixManager is null or not connected?
| JobMonitorMetricZnodeField field, HistogramDynamicMetric metric) { | ||
| String zkPath = buildZkPathForJobMonitorMetric(_jobType); | ||
|
|
||
| if (!baseDataAccessor.update(zkPath, currentData -> { |
There was a problem hiding this comment.
Is there a reason we're doing an update instead of set?
4ab7271 to
8027951
Compare
|
After discussion, this work item will not be done. |
Issues
Fixes #1055
Description
We are adding 3 endpoints to TaskDriver that return values of
SubmissionToProcessDelay,SubmissionToScheduleDelay, andControllerInducedDelay. The purpose of the endpoints is to offer an option for users that cannot rely directly on the controller emitted metrics to see the metric values. In order for the metrics to be accessible from TaskDriver, the metric values are written to ZooKeeper every time a new metric value is reported. The metric values that are synced to Zk is the mean value. Note: since they are synced on reporting time, if there is no reporting, there is no update to these values. These endpoint values are fresh when there are task framework resources actively running.Tests
TestJobMonitor, testGetSubmissionToProcessDelay, testGetSubmissionToProcessDelayIllegalArgument, testGetSubmissionToScheduleDelay, testGetSubmissionToScheduleDelayIllegalArgument, testControllerInducedDelay, testControllerInducedDelayIllegalArgument
Rerun:
Commits
Code Quality
(helix-style-intellij.xml if IntelliJ IDE is used)