Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should handle channel disconnected better in NettyHttpClient or Druid #3393

Closed
kaijianding opened this issue Aug 24, 2016 · 12 comments
Closed
Labels

Comments

@kaijianding
Copy link
Contributor

kaijianding commented Aug 24, 2016

I'm facing such a race condition:
Broker runs a query, finally calls the httpClient.go() in DirectDruidClient, the httpClient is instance of NettyHttpClient.
Inside the NettyHttpClient.go() method, after the channel is taken from the pool and verified in good shape and ready to fire the query, then the historical side reaches the idle time and disconnect this channel.

Then handler.channelDisconnected callback is called, and finally the query is failed in broker.
This exception occurs hundreds of times in our druid cluster (it is a busy cluster).

This channel never has a chance to actually connect to the historical to get a single byte back, thus ,I think, the NettyHttpClient.go() should retry internally one time to get a new channel to finish the "go" instead of set the retVal future to fail.

The same thing can happen when channel.write(httpRequest).addListener(...) in NettyHttpClient. the channel to write can be closed by historical server after it is taken from the pool channel, thus it also should be retried.

here is the trace:

2016-08-24T14:03:29,029 ERROR [qtp1369854401-202[timeseries_eternal_olap_click_da76287c-4eec-4c84-a8b8-2e12ea92aa05]] io.druid.server.QueryResource - Exception handling request: {class=io.druid.server.QueryResource, exceptionType=class com.metamx.common.RE, exceptionMessage=Failure getting results from[http://hdppic0101.et2.tbsite.net:8083/druid/v2/] because of [org.jboss.netty.channel.ChannelException: Channel disconnected], exception=com.metamx.common.RE: Failure getting results from[http://hdppic0101.et2.tbsite.net:8083/druid/v2/] because of [org.jboss.netty.channel.ChannelException: Channel disconnected], query=TimeseriesQuery{dataSource='eternal_olap_click', querySegmentSpec=LegacySegmentSpec{intervals=[2016-08-24T00:00:00.000+08:00/2016-08-24T13:54:00.000+08:00]}, descending=false, dimFilter=(scene_tag = sort_type^C_coefp && bu_src = taobao_topic && item_id = tm6048738 && exper_token_str = wl_topic_sort && action_type = click), granularity='PeriodGranularity{period=PT1M, timeZone=+08:00, origin=null}', aggregatorSpecs=[DoubleSumAggregatorFactory{fieldName='ipv', name='ipv'}], postAggregatorSpecs=[], context={queryId=da76287c-4eec-4c84-a8b8-2e12ea92aa05, timeout=30000}}, peer=10.197.16.234}
com.metamx.common.RE: Failure getting results from[http://hdppic0101.et2.tbsite.net:8083/druid/v2/] because of [org.jboss.netty.channel.ChannelException: Channel disconnected]
        at io.druid.client.DirectDruidClient$JsonParserIterator.init(DirectDruidClient.java:498) ~[druid-server-0.9.0.jar:0.9.0]
        at io.druid.client.DirectDruidClient$JsonParserIterator.hasNext(DirectDruidClient.java:442) ~[druid-server-0.9.0.jar:0.9.0]
        at com.metamx.common.guava.BaseSequence.makeYielder(BaseSequence.java:103) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.BaseSequence.toYielder(BaseSequence.java:81) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.MappedSequence.toYielder(MappedSequence.java:46) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.MergeSequence$2.accumulate(MergeSequence.java:66) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.MergeSequence$2.accumulate(MergeSequence.java:62) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.YieldingAccumulators$1.accumulate(YieldingAccumulators.java:32) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.BaseSequence.makeYielder(BaseSequence.java:104) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.BaseSequence.toYielder(BaseSequence.java:81) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.BaseSequence.accumulate(BaseSequence.java:67) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.MergeSequence.toYielder(MergeSequence.java:59) ~[java-util-0.27.7.jar:?]
        at com.metamx.common.guava.LazySequence.toYielder(LazySequence.java:43) ~[java-util-0.27.7.jar:?]
        at io.druid.query.RetryQueryRunner$1.toYielder(RetryQueryRunner.java:105) ~[druid-processing-0.9.0.jar:0.9.0]
        at io.druid.common.guava.CombiningSequence.toYielder(CombiningSequence.java:79) ~[druid-common-0.9.0.jar:0.9.0]
        at com.metamx.common.guava.MappedSequence.toYielder(MappedSequence.java:46) ~[java-util-0.27.7.jar:?]
        at io.druid.query.CPUTimeMetricQueryRunner$1.toYielder(CPUTimeMetricQueryRunner.java:93) ~[druid-processing-0.9.0.jar:0.9.0]
        at com.metamx.common.guava.Sequences$1.toYielder(Sequences.java:98) ~[java-util-0.27.7.jar:?]
        at io.druid.server.QueryResource.doPost(QueryResource.java:170) [druid-server-0.9.0.jar:0.9.0]
        at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_66]
        at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_66]
        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) [jersey-server-1.19.jar:1.19]
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) [jersey-servlet-1.19.jar:1.19]
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) [jersey-servlet-1.19.jar:1.19]
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) [jersey-servlet-1.19.jar:1.19]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [javax.servlet-api-3.1.0.jar:3.1.0]
        at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:278) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:268) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:180) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:93) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:132) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:129) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:206) [guice-servlet-4.0-beta.jar:?]
        at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:129) [guice-servlet-4.0-beta.jar:?]
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) [jetty-servlet-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:45) [jetty-servlets-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) [jetty-servlet-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) [jetty-servlet-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1158) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) [jetty-servlet-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1090) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.Server.handle(Server.java:517) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242) [jetty-server-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213) [jetty-util-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147) [jetty-util-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) [jetty-util-9.3.6.v20151106.jar:9.3.6.v20151106]
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) [jetty-util-9.3.6.v20151106.jar:9.3.6.v20151106]
        at java.lang.Thread.run(Thread.java:756) [?:1.8.0_66]
Caused by: java.util.concurrent.ExecutionException: org.jboss.netty.channel.ChannelException: Channel disconnected
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-16.0.1.jar:?]
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-16.0.1.jar:?]
        at io.druid.client.DirectDruidClient$JsonParserIterator.init(DirectDruidClient.java:479) ~[druid-server-0.9.0.jar:0.9.0]
        ... 69 more
Caused by: org.jboss.netty.channel.ChannelException: Channel disconnected
        at com.metamx.http.client.NettyHttpClient$1.channelDisconnected(NettyHttpClient.java:311) ~[http-client-1.0.4.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelDisconnected(SimpleChannelUpstreamHandler.java:208) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelDisconnected(SimpleChannelUpstreamHandler.java:208) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.handler.codec.http.HttpClientCodec.handleUpstream(HttpClientCodec.java:92) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.Channels.fireChannelDisconnected(Channels.java:396) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:360) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[netty-3.10.4.Final.jar:?]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[netty-3.10.4.Final.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_66]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_66]
@drcrallen drcrallen added the Bug label Aug 24, 2016
@chainani
Copy link

chainani commented Oct 6, 2016

any resolution for this?

@villuminati
Copy link

facing this same problem

@b-slim
Copy link
Contributor

b-slim commented Nov 23, 2016

i am wondering what are the values of druid.broker.http.numConnections and druid.broker.http.readTimeout did you try to tweak it ?

@b-slim
Copy link
Contributor

b-slim commented Nov 23, 2016

@kaijianding by tweak i mean increase the values.

@kaijianding
Copy link
Contributor Author

druid.broker.http.numConnections=20
druid.broker.http.readTimeout=PT5M
druid.server.http.maxIdleTime=PT5m at historical side, and this value is druid's default value

I increase druid.server.http.maxIdleTime to P1D and this issue is rarely happen
@chainani @RishiMi @b-slim

@b-slim
Copy link
Contributor

b-slim commented Nov 23, 2016

@kaijianding i guess that solved this issue right ?

@kaijianding
Copy link
Contributor Author

yes, but it just avoid to trigger this issue happening frequently, it doesn't resolve it at the root, I think it's better to fix it at code level.

@sharks222
Copy link

sharks222 commented Sep 27, 2017

yes ,
I have set :druid.broker.http.readTimeout=PT20M and druid.server.http.maxIdleTime=PT20m,but it does not work , why ?

@himanshug
Copy link
Contributor

@kaijianding for this specific problem, i think better solution might be to optionally (configured via ResourcePoolConfig ) add an expiry on unused connection in https://github.com/metamx/http-client/blob/master/src/main/java/com/metamx/http/client/pool/ResourcePool.java#L46

@himanshug
Copy link
Contributor

we also saw it on some clusters, though it happens very very infrequently . @akashdw is working to implement #3393 (comment) . So, we'll set 1 min(configurable of course) pooling timeout (if no-one uses that connection in a min then its removed from the pool and discarded) so that near expiry connections are not given to DirectDruidClient.

@kaijianding
Copy link
Contributor Author

This solution sounds good @himanshug

@gianm
Copy link
Contributor

gianm commented Feb 23, 2018

This should be fixed with the java-util upgrade in http://github.com/druid-io/druid/pull/5239 (and was preserved in the ensuing merge of java-util into Druid in #5289).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants