Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

Open
slfritchie opened this issue Oct 23, 2012 · 11 comments
Milestone

Comments

@slfritchie
Copy link

If a prior PB operation times out, a late-arriving answer can trigger default clause handle_info() error log spam, e.g.

2012-10-21 01:44:14.250 [error] <0.31976.2002>@riak_api_pb_server:handle_info:170 Unrecognized message {51911707,{ok,{r_object,<<"bucket">>, ....

We've a customer who's had intra-cluster communication problems (including busy_dist_port warnings) where 40 or more of these messages would be logged per second.

@seancribbs
Copy link
Contributor

@slfritchie To me this would indicate that the timeout is not terminating the request FSM.

@beerriot
Copy link

Just noting that this is likely to eventually cause someone the same problems as seen by @joedevivo on basho/riak_test#79. Namely, this unhandled message may be received during an operation being handled by riak_kv_pb_bucket, which does not have a catch-all ignore case in its process_stream function.

@evanmcc
Copy link
Contributor

evanmcc commented Dec 10, 2012

The timeout doesn't terminate the request FSM, no.

It seems to me that you have to handle it anyway, though, since even if you're terminating it, there's a race there.

PR#20 doesn't address Bryan's comment.

@ixmatus
Copy link

ixmatus commented Jan 7, 2013

This same exact issue is happening to me; sample message:

[error] <0.20657.797>@riak_api_pb_server:handle_info:170 Unrecognized message {pipe_result,#Ref<0.0.842.47868>,0,[<<"ranktracker_summary_1499">>,<<"2013-01-02_248152">>]}

Followed by hundreds of thousands of lines of:

[error] <0.534.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup

@evanmcc
Copy link
Contributor

evanmcc commented Jan 7, 2013

updating to 1.2.1 should help with the last one, but it's good to know there's another path there. Note that that's typically a symptom of sending too many MR jobs for your cluster to handle.

@ixmatus
Copy link

ixmatus commented Jan 7, 2013

My whole cluster has been upgraded to 1.2.1 as of three weeks ago...

@evanmcc
Copy link
Contributor

evanmcc commented Jan 7, 2013

My bad; the patch that I was thinking of will land in 1.3.

@ixmatus
Copy link

ixmatus commented Jan 8, 2013

On 1/7/13 5:13 PM, Evan Vigil-McClanahan wrote:

My bad; the patch that I was thinking of will land in 1.3.


Reply to this email directly or view it on GitHub
#16 (comment).

Okay, thanks for the info.

@seancribbs
Copy link
Contributor

This can still occur -- we have not addressed the root problem which is incomplete termination (and race-conditions around there) of request FSMs.

@seancribbs seancribbs added this to the 2.1 milestone Mar 24, 2014
@nerophon
Copy link

This is still occurring. The customer is using Riak 1.4.12 and Java client 1.4.4. Please see Zendesk ticket #11687 for more details. Example log output:

2015-08-18 02:22:01.882 UTC [error] <0.21975.339>@riak_api_pb_server:handle_info:141 Unrecognized message {8131652,ok}

@Basho-JIRA Basho-JIRA changed the title Prior PB operation timeout can trigger error log spam later Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] Aug 19, 2015
@point9repeating
Copy link

I'm not positive but I may have just experienced this with riak TS 1.3.1:

2016-08-05 21:33:01.492 [error] <0.7481.54>@riak_api_pb_server:handle_info:331 Unrecognized message {ok,{[<<"creation_time">>,<<"start_time">>,<<"sample_period">>,<<"length">>],[ti mestamp,timestamp,sint64,sint64],[[1467746383099,1467752400000,3600,195],[1467746744137,1467752400000,3600,195],[1467747603697,1467756000000,3600,194],[1467747891827,1467756000000,...]}}

We are getting a bunch of these while load testing our cluster to tune our configuration (specifically the value for "timeseries_max_concurrent_queries").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants