ZOOKEEPER-3432 Improving zookeeper trace for performance and scalability#994
ZOOKEEPER-3432 Improving zookeeper trace for performance and scalability#994gmcatsf wants to merge 1 commit intoapache:masterfrom gmcatsf:ZOOKEEPER-3432
Conversation
|
@gmcatsf awesome work.
|
hanm
left a comment
There was a problem hiding this comment.
This is a nice improvement as trace never really works as expected in ZK.
Very high level comments:
-
Might worth consider adding a feature switch to turn on / off the tracing.
-
This adds new system properties / configuration options - and these should be properly documented.
-
A documentation on how to set up end to end tracing would be great.
My thoughts on this is, the approach this patch took and third party tracing integration is not mutually exclusive, and it's probably good to have a self contained tracing solution for ZK w/o 3rd party dependency. |
This implementation is sending traces to external process and there is no persistence in the pull request. It is up to the external process how to persist those traces. We have a separate external process in c++ which receive those traces and then write them to scribe or any other framework.
I think it could co-exist with third-party tracing integration, though the code needs some re-factory like changing TraceLogger to interface and adding a trace logger factor to load different implementation. The reason behind current design is to export traces through a single stateful socket connection without no new third-party library dependency in zookeeper. |
There is a system property named "zookeeper.disableTraceLogger" to turn off all tracing (this was added by Ben Reed when he was in FB).
Where is the best place to add documentation? Or shall it be a separate doc? |
Adding TraceLoggerFactory and TraceLogger interface might help generalizing the implementation, please kindly let me know if I should make the change here. |
Search around, there's no good place to add this doc.You can create a separate doc:e.g
|
|
@maoling I will add
There is no traceId for trace message, could you specify which line this refers to?
TraceLogger can be turned off by ZooTrace.disableTraceLogger via system property zookeeper.disableTraceLogger. There is also an existing ZooTrace.traceMask which controls which traces to record.
I assume you mean TraceLogger.TraceMessage.send()? this method is asynchronously adding trace into a queue with limited capacity and return immediately to avoiding slowing down server. Traces added to the queue may be dropped if queue overflows. Performance issue with existing ZooTrace is the exact reason for the pull request. Note the implementation uses only single socket+single thread, and we have been running this on hundreds of facebook production ensembles over one year without noticing performance impact.
The pull request includes a testing class named TraceLoggerServer which shows how this works, it may not be enough and I will enhance it to show how this external process should read traces. This the same as the comment from @hanm
OpenTracing have a span/span context data model, which is different from existing ZooTrace implementation of log messages. The focus of this pull request is to replace writing to log4j part with writing to external process through socket so that server performance wont be affected. Would it make sense to create a separate jira for OpenTracing? |
|
|
|
@maoling it makes sense to leave the possibility of integrating with 3rd party like openTracing, but it seems to me we can do that in a separate JIRA. Most of the code here are focusing on where we collect and generate the trace event, it's not that much about how to send the tracing, we can adapt that to the openTracing later. Given we've already running this on prod for a long time, I would suggest to keep minimal change on this for now and improve it separately, what's your idea? @gmcatsf can you rebase this feature and keep this up to date? |
eolivelli
left a comment
There was a problem hiding this comment.
I did a first pass.
It looks good, but this patch is huge to review.
It would be great to split it a little.
I will do a second pass as soon as possible
|
I'm probably already a dinosaur, because logging in general, especially logging to a separate process/server is just However I like the approach in this patch especially if it's already proven stable in production, so I'm not against committing a solution which already works nicely. I agree with @hanm 's suggestions: documentation is super critical here and an on/off switch would also be beneficial. @gmcatsf This patch has been outstanding for a long time. Any plans on rebasing? You definitely want to do it sooner rather than later, if you want to hit 3.6.0. |
|
Refer to this link for build results (access rights to CI server needed): Build result: FAILURE[...truncated 828.13 KB...][JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/pom.xml to org.apache.zookeeper/parent/3.7.0-SNAPSHOT/parent-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-recipes/zookeeper-recipes-lock/pom.xml to org.apache.zookeeper/zookeeper-recipes-lock/3.7.0-SNAPSHOT/zookeeper-recipes-lock-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-contrib/pom.xml to org.apache.zookeeper/zookeeper-contrib/3.7.0-SNAPSHOT/zookeeper-contrib-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-client/pom.xml to org.apache.zookeeper/zookeeper-client/3.7.0-SNAPSHOT/zookeeper-client-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-jute/pom.xml to org.apache.zookeeper/zookeeper-jute/3.7.0-SNAPSHOT/zookeeper-jute-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-jute/target/zookeeper-jute-3.7.0-SNAPSHOT.jar to org.apache.zookeeper/zookeeper-jute/3.7.0-SNAPSHOT/zookeeper-jute-3.7.0-SNAPSHOT.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-jute/target/zookeeper-jute-3.7.0-SNAPSHOT-tests.jar to org.apache.zookeeper/zookeeper-jute/3.7.0-SNAPSHOT/zookeeper-jute-3.7.0-SNAPSHOT-tests.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-jute/target/zookeeper-jute-3.7.0-SNAPSHOT-sources.jar to org.apache.zookeeper/zookeeper-jute/3.7.0-SNAPSHOT/zookeeper-jute-3.7.0-SNAPSHOT-sources.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-jute/target/zookeeper-jute-3.7.0-SNAPSHOT-javadoc.jar to org.apache.zookeeper/zookeeper-jute/3.7.0-SNAPSHOT/zookeeper-jute-3.7.0-SNAPSHOT-javadoc.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-recipes/zookeeper-recipes-queue/pom.xml to org.apache.zookeeper/zookeeper-recipes-queue/3.7.0-SNAPSHOT/zookeeper-recipes-queue-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-docs/pom.xml to org.apache.zookeeper/zookeeper-docs/3.7.0-SNAPSHOT/zookeeper-docs-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-docs/target/zookeeper-docs-3.7.0-SNAPSHOT.jar to org.apache.zookeeper/zookeeper-docs/3.7.0-SNAPSHOT/zookeeper-docs-3.7.0-SNAPSHOT.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-docs/target/zookeeper-docs-3.7.0-SNAPSHOT-tests.jar to org.apache.zookeeper/zookeeper-docs/3.7.0-SNAPSHOT/zookeeper-docs-3.7.0-SNAPSHOT-tests.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-docs/target/zookeeper-docs-3.7.0-SNAPSHOT-sources.jar to org.apache.zookeeper/zookeeper-docs/3.7.0-SNAPSHOT/zookeeper-docs-3.7.0-SNAPSHOT-sources.jar[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-recipes/pom.xml to org.apache.zookeeper/zookeeper-recipes/3.7.0-SNAPSHOT/zookeeper-recipes-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-client/zookeeper-client-c/pom.xml to org.apache.zookeeper/zookeeper-client-c/3.7.0-SNAPSHOT/zookeeper-client-c-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-contrib/zookeeper-contrib-rest/pom.xml to org.apache.zookeeper/zookeeper-contrib-rest/3.7.0-SNAPSHOT/zookeeper-contrib-rest-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-assembly/pom.xml to org.apache.zookeeper/zookeeper-assembly/3.7.0-SNAPSHOT/zookeeper-assembly-3.7.0-SNAPSHOT.pom[JENKINS] Archiving /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build-maven/zookeeper-contrib/zookeeper-contrib-zooinspector/pom.xml to org.apache.zookeeper/zookeeper-contrib-zooinspector/3.7.0-SNAPSHOT/zookeeper-contrib-zooinspector-3.7.0-SNAPSHOT.pomchannel stopped[SpotBugs] Skipping execution of recorder since overall result is 'FAILURE'Setting status of fe704b0 to FAILURE with url https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build-maven/1758/ and message: 'FAILURE 'Using context: JenkinsMaven |
Author: Mocheng Guo <gmcatsf@gmail.com> Date: Wed Jun 19 17:25:35 2019 -0700 Added trace logger server to collect and write traces from trace logger. Added bin/zkTraceServer.sh to run trace logger server Added zookeeperTrace.md to describe how trace works.
|
@maoling @eolivelli @anmolnar I have rebased on top of master branch. document was already added based on @hanm comment. Please help review and see what further changes are necessary. |
|
@lvfangmin @anmolnar @hanm ping |
|
retest maven build |
lvfangmin
left a comment
There was a problem hiding this comment.
+1
Mocheng is not at fb now, I'll rebase this diff when I have time.
Added TraceLogger to send traces to external process via netty. Trace has a strongly typed schema defined inside TraceField.