Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zk-stats instrumentation to get zk-client stats #507

Merged
merged 6 commits into from
Jul 11, 2017

Conversation

rdhabalia
Copy link
Contributor

Motivation

Right now, broker doesn't have zk-client stats to measure zk-operation count and its latency. So, added instrumentation to measure zk-client operation latency.

Modifications

  • Introduce new ZkClientFactory: ZookeeperBkClientFactoryImpl which uses ZookeeperClient implemented in Bookkeeper which by default sends request-creation time in ctx which can be used to measure latency when broker receives zk-response back.
  • Added instrumentation measures zk-operation latency and captured it as a part of broker-metrics.

Result

  • Broker's metrics will have zk-operation latency.

@rdhabalia rdhabalia added the type/feature The PR added a new feature or issue requested a new feature label Jun 20, 2017
@rdhabalia rdhabalia added this to the 1.19 milestone Jun 20, 2017
@rdhabalia rdhabalia self-assigned this Jun 20, 2017
@rdhabalia rdhabalia force-pushed the zk_aspect branch 4 times, most recently from 59dac64 to cda9d0a Compare June 21, 2017 18:31
@rdhabalia rdhabalia force-pushed the zk_aspect branch 2 times, most recently from acd8341 to c7c465b Compare June 28, 2017 00:55
@merlimat
Copy link
Contributor

@rdhabalia I'll get to this one soon

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments. Though I don't completely understand if the AOT part is needed if we are going to wrap the ZooKeeper client class.

Can't the latency measurement be done in the ZK client wrapper?

this.dimensionCounts = dimensionHistogram.getTotalCount();
}

public void recordDimensionTimeValue(long latency) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency should have the time unit in the var name, or we should accept a separate TimeUnit arg


public double elapsedIntervalMs;

private Recorder dimensionTimeRecorder = new Recorder(TimeUnit.MINUTES.toMillis(10), 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the HdrHistogram here, we could use the Prometheus client lib. It has the same support for the quantiles and it will get automatically exposed and reported in the /metrics REST handler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, we have renamed PulsarStats to DimensionStats which is used to record BrokerOperatabilityMetrics- topicLoadLatency + Zk-read/write latency which metrics we use along with Prometheus.
Do you think we can keep it here and add into Prometheus also else, we have to make changes into monitoring tool to parse Prometheus output to just get zk-latency.??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you're not getting it from /metics, you can anyway use the prometheus Java lib in the same way as the HdrHistogram, querying to get the percentiles out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have replaced HdrHistogram with Prometheus.Summary.

return null;
}

public static void addListner(EventListner listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addListener

Record response = (Record) field.get(packet);
return response;
}
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would silently mask the exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should do warn logging here?
actually, as we are using reflection and if in future field-name or anything changed which throws exception then it can entirely fill out logs so, I made it as debug-log.

@@ -189,6 +190,10 @@ public BrokerService(PulsarService pulsar) throws Exception {

this.multiLayerTopicsMap = new ConcurrentOpenHashMap<>();
this.pulsarStats = new PulsarStats(pulsar);
// register listener to capture zk-latency
ClientCnxnAspect.addListner((eventType, latencyMs) -> {
this.pulsarStats.recordZkLatencyTimeValue(eventType, latencyMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also record fractions of Millis, since the getData() operations are going to be in the order of 0.1 to 0.5 millis most of the time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, right now, we are using BK-Yahoo version where BK-ZK-Client captures startTime in Msec and it doesn't have OpStatsLoggger yet. And it seems both the things are fixed in BK-master.
Therefore, startTime is milliseconds, it's not possible to capture time < 1 msec for now. We can do it once we move to latest BK version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about importing the BookkeeperClient.java source file from bk master?

@rdhabalia
Copy link
Contributor Author

Can't the latency measurement be done in the ZK client wrapper?

Latency is measured against ZK-Client org.apache.zookeeper.ClientCnxn only. So, you mean ClientCnxnAspect should not be part of pulsar-broker and part of other module?

Though I don't completely understand if the AOT part is needed if we are going to wrap the ZooKeeper client class.

Not, completely understand your comment. But in this PR we are loading AOT-agent at runtime because of that we don't have to define agent in jvm-args and we can test aspect with unit-test.
In previous ZK-server-aspect PR, we don't have any unit-test case because we can't weave aspectj- advice in unit-test so, hard to test as well. So, do you think we should load aspectj-agent at runtime rather passing as jvm-args in zk-server AOT?

@merlimat
Copy link
Contributor

Latency is measured against ZK-Client org.apache.zookeeper.ClientCnxn only. So, you mean ClientCnxnAspect should not be part of pulsar-broker and part of other module?

What I was meaning is that since you mention the ZookeeperClient from BK, that wrapper is already collecting the latency stats (through the StatsLogger mechanism). Isn't that enough to collect and report the latencies?

@rdhabalia
Copy link
Contributor Author

What I was meaning is that since you mention the ZookeeperClient from BK, that wrapper is already collecting the latency stats (through the StatsLogger mechanism).

OpStatsLogger is not present into BK-Yahoo version

@merlimat
Copy link
Contributor

merlimat commented Jul 11, 2017 via email

@rdhabalia
Copy link
Contributor Author

@merlimat Addressed all the changes.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good. Just 2 very minor suggestions.

For the long run, let's try to collect the stats in the wrapper (maybe when using latest BK) so that we don't need to instrument the code and we can get sub-millis resolution

}

public double getDimensionSum() {
return defaultRegistry.getSampleValue(name + "_sum").doubleValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save the name + "_sum" (and count) into a member variable to avoid recreating the string each time


private double getQuantile(double q) {
return defaultRegistry
.getSampleValue(name, new String[] { "quantile" }, new String[] { Collector.doubleToGoString(q) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the String[] be cached?

pom.xml Outdated
@@ -479,6 +479,13 @@ flexible messaging model and an intuitive client API.</description>
<artifactId>aspectjweaver</artifactId>
<version>${aspectj.version}</version>
</dependency>

<!-- TODO: should we callout BSD license: https://github.com/electronicarts/ea-agent-loader -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Do you think we should add BSD license into NOTICE file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should only be added to the NOTICE file that gets included in the binary distribution, not in the sources one, because there we're not including it.

I did some initial work in collecting the list of dependencies and the licenses to adjust the LICENSE and NOTICE files for both bin and src tgzs. You can leave it out for now, and remove the TODO comment. We'll fix it along with all the other dependencies.

@rdhabalia rdhabalia merged commit bb0a8fb into apache:master Jul 11, 2017
@merlimat
Copy link
Contributor

@rdhabalia I think the build started failing after this commit got merged:

https://travis-ci.org/apache/incubator-pulsar/builds/252621888

Do you have any idea what could it be?

@rdhabalia
Copy link
Contributor Author

let me check.. actually I have merged after PR-travis build passed.

@merlimat
Copy link
Contributor

Yes, I saw. Initially I thought the problem was about my commit upgrading the Jna library, but it fails consistently on local build even before that commit.

@merlimat
Copy link
Contributor

@rdhabalia I did a git bisect locally and it started failing at this commit :)

It reproduces consistently on local build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants