-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zk-stats instrumentation to get zk-client stats #507
Conversation
59dac64
to
cda9d0a
Compare
acd8341
to
c7c465b
Compare
@rdhabalia I'll get to this one soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments. Though I don't completely understand if the AOT part is needed if we are going to wrap the ZooKeeper client class.
Can't the latency measurement be done in the ZK client wrapper?
this.dimensionCounts = dimensionHistogram.getTotalCount(); | ||
} | ||
|
||
public void recordDimensionTimeValue(long latency) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency
should have the time unit in the var name, or we should accept a separate TimeUnit
arg
|
||
public double elapsedIntervalMs; | ||
|
||
private Recorder dimensionTimeRecorder = new Recorder(TimeUnit.MINUTES.toMillis(10), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the HdrHistogram
here, we could use the Prometheus client lib. It has the same support for the quantiles and it will get automatically exposed and reported in the /metrics
REST handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, we have renamed PulsarStats
to DimensionStats
which is used to record BrokerOperatabilityMetrics- topicLoadLatency + Zk-read/write latency which metrics we use along with Prometheus
.
Do you think we can keep it here and add into Prometheus
also else, we have to make changes into monitoring tool to parse Prometheus
output to just get zk-latency.??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you're not getting it from /metics
, you can anyway use the prometheus Java lib in the same way as the HdrHistogram
, querying to get the percentiles out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have replaced HdrHistogram
with Prometheus.Summary
.
return null; | ||
} | ||
|
||
public static void addListner(EventListner listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addListener
Record response = (Record) field.get(packet); | ||
return response; | ||
} | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would silently mask the exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should do warn
logging here?
actually, as we are using reflection and if in future field-name or anything changed which throws exception then it can entirely fill out logs so, I made it as debug-log.
@@ -189,6 +190,10 @@ public BrokerService(PulsarService pulsar) throws Exception { | |||
|
|||
this.multiLayerTopicsMap = new ConcurrentOpenHashMap<>(); | |||
this.pulsarStats = new PulsarStats(pulsar); | |||
// register listener to capture zk-latency | |||
ClientCnxnAspect.addListner((eventType, latencyMs) -> { | |||
this.pulsarStats.recordZkLatencyTimeValue(eventType, latencyMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also record fractions of Millis, since the getData()
operations are going to be in the order of 0.1 to 0.5 millis most of the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, right now, we are using BK-Yahoo version where BK-ZK-Client captures startTime in Msec and it doesn't have OpStatsLoggger yet. And it seems both the things are fixed in BK-master.
Therefore, startTime is milliseconds, it's not possible to capture time < 1 msec for now. We can do it once we move to latest BK version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about importing the BookkeeperClient.java
source file from bk master?
Latency is measured against ZK-Client
Not, completely understand your comment. But in this PR we are loading AOT-agent at runtime because of that we don't have to define agent in jvm-args and we can test aspect with unit-test. |
What I was meaning is that since you mention the |
OpStatsLogger is not present into BK-Yahoo version |
I know, that's why I was saying whether it was easier to just import the
code, until we sync up with future BK releases :)
…--
Matteo Merli
<matteo.merli@gmail.com>
On Mon, Jul 10, 2017 at 5:29 PM, Rajan Dhabalia ***@***.***> wrote:
What I was meaning is that since you mention the ZookeeperClient from BK,
that wrapper is already collecting the latency stats (through the
StatsLogger mechanism).
OpStatsLogger
<https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/zookeeper/ZooWorker.java#L45>
is not present into BK-Yahoo version
<https://github.com/yahoo/bookkeeper/blob/yahoo-4.3/bookkeeper-server/src/main/java/org/apache/bookkeeper/zookeeper/ZooWorker.java>
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#507 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD0JPZ8ODbPwZi9Sz61Dmx2NK-dJS0cks5sMsHhgaJpZM4OARCW>
.
|
@merlimat Addressed all the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good. Just 2 very minor suggestions.
For the long run, let's try to collect the stats in the wrapper (maybe when using latest BK) so that we don't need to instrument the code and we can get sub-millis resolution
} | ||
|
||
public double getDimensionSum() { | ||
return defaultRegistry.getSampleValue(name + "_sum").doubleValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can save the name + "_sum"
(and count) into a member variable to avoid recreating the string each time
|
||
private double getQuantile(double q) { | ||
return defaultRegistry | ||
.getSampleValue(name, new String[] { "quantile" }, new String[] { Collector.doubleToGoString(q) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the String[]
be cached?
pom.xml
Outdated
@@ -479,6 +479,13 @@ flexible messaging model and an intuitive client API.</description> | |||
<artifactId>aspectjweaver</artifactId> | |||
<version>${aspectj.version}</version> | |||
</dependency> | |||
|
|||
<!-- TODO: should we callout BSD license: https://github.com/electronicarts/ea-agent-loader --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should only be added to the NOTICE
file that gets included in the binary distribution, not in the sources one, because there we're not including it.
I did some initial work in collecting the list of dependencies and the licenses to adjust the LICENSE and NOTICE files for both bin and src tgzs. You can leave it out for now, and remove the TODO comment. We'll fix it along with all the other dependencies.
@rdhabalia I think the build started failing after this commit got merged: https://travis-ci.org/apache/incubator-pulsar/builds/252621888 Do you have any idea what could it be? |
let me check.. actually I have merged after PR-travis build passed. |
Yes, I saw. Initially I thought the problem was about my commit upgrading the Jna library, but it fails consistently on local build even before that commit. |
@rdhabalia I did a It reproduces consistently on local build. |
Motivation
Right now, broker doesn't have zk-client stats to measure zk-operation count and its latency. So, added instrumentation to measure zk-client operation latency.
Modifications
ZookeeperBkClientFactoryImpl
which uses ZookeeperClient implemented in Bookkeeper which by default sends request-creation time in ctx which can be used to measure latency when broker receives zk-response back.Result