Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

slfritchie · 2012-10-23T17:56:31Z

If a prior PB operation times out, a late-arriving answer can trigger default clause handle_info() error log spam, e.g.

2012-10-21 01:44:14.250 [error] <0.31976.2002>@riak_api_pb_server:handle_info:170 Unrecognized message {51911707,{ok,{r_object,<<"bucket">>, ....

We've a customer who's had intra-cluster communication problems (including busy_dist_port warnings) where 40 or more of these messages would be logged per second.

The text was updated successfully, but these errors were encountered:

seancribbs · 2012-10-23T20:43:30Z

@slfritchie To me this would indicate that the timeout is not terminating the request FSM.

beerriot · 2012-10-24T18:06:26Z

Just noting that this is likely to eventually cause someone the same problems as seen by @joedevivo on basho/riak_test#79. Namely, this unhandled message may be received during an operation being handled by riak_kv_pb_bucket, which does not have a catch-all ignore case in its process_stream function.

evanmcc · 2012-12-10T23:03:42Z

The timeout doesn't terminate the request FSM, no.

It seems to me that you have to handle it anyway, though, since even if you're terminating it, there's a race there.

PR#20 doesn't address Bryan's comment.

ixmatus · 2013-01-07T22:52:59Z

This same exact issue is happening to me; sample message:

[error] <0.20657.797>@riak_api_pb_server:handle_info:170 Unrecognized message {pipe_result,#Ref<0.0.842.47868>,0,[<<"ranktracker_summary_1499">>,<<"2013-01-02_248152">>]}

Followed by hundreds of thousands of lines of:

[error] <0.534.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup

evanmcc · 2013-01-07T22:59:46Z

updating to 1.2.1 should help with the last one, but it's good to know there's another path there. Note that that's typically a symptom of sending too many MR jobs for your cluster to handle.

ixmatus · 2013-01-07T23:12:32Z

My whole cluster has been upgraded to 1.2.1 as of three weeks ago...

evanmcc · 2013-01-07T23:13:25Z

My bad; the patch that I was thinking of will land in 1.3.

ixmatus · 2013-01-08T19:22:33Z

On 1/7/13 5:13 PM, Evan Vigil-McClanahan wrote:

My bad; the patch that I was thinking of will land in 1.3.

—
Reply to this email directly or view it on GitHub
#16 (comment).

Okay, thanks for the info.

seancribbs · 2014-03-24T18:56:57Z

This can still occur -- we have not addressed the root problem which is incomplete termination (and race-conditions around there) of request FSMs.

nerophon · 2015-08-19T11:51:20Z

This is still occurring. The customer is using Riak 1.4.12 and Java client 1.4.4. Please see Zendesk ticket #11687 for more details. Example log output:

2015-08-18 02:22:01.882 UTC [error] <0.21975.339>@riak_api_pb_server:handle_info:141 Unrecognized message {8131652,ok}

point9repeating · 2016-08-05T22:01:17Z

I'm not positive but I may have just experienced this with riak TS 1.3.1:

2016-08-05 21:33:01.492 [error] <0.7481.54>@riak_api_pb_server:handle_info:331 Unrecognized message {ok,{[<<"creation_time">>,<<"start_time">>,<<"sample_period">>,<<"length">>],[ti mestamp,timestamp,sint64,sint64],[[1467746383099,1467752400000,3600,195],[1467746744137,1467752400000,3600,195],[1467747603697,1467756000000,3600,194],[1467747891827,1467756000000,...]}}

We are getting a bunch of these while load testing our cluster to tune our configuration (specifically the value for "timeseries_max_concurrent_queries").

evanmcc mentioned this issue Dec 10, 2012

Late vnode replies show up as errors in the logs #20

Closed

seancribbs added this to the 2.1 milestone Mar 24, 2014

Basho-JIRA changed the title ~~Prior PB operation timeout can trigger error log spam later~~ Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] Aug 19, 2015

Basho-JIRA added the JIRA: To Do label Aug 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

slfritchie commented Oct 23, 2012

seancribbs commented Oct 23, 2012

beerriot commented Oct 24, 2012

evanmcc commented Dec 10, 2012

ixmatus commented Jan 7, 2013

evanmcc commented Jan 7, 2013

ixmatus commented Jan 7, 2013

evanmcc commented Jan 7, 2013

ixmatus commented Jan 8, 2013

seancribbs commented Mar 24, 2014

nerophon commented Aug 19, 2015

point9repeating commented Aug 5, 2016

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

Prior PB operation timeout can trigger error log spam later [JIRA: RIAK-2111] #16

Comments

slfritchie commented Oct 23, 2012

seancribbs commented Oct 23, 2012

beerriot commented Oct 24, 2012

evanmcc commented Dec 10, 2012

ixmatus commented Jan 7, 2013

evanmcc commented Jan 7, 2013

ixmatus commented Jan 7, 2013

evanmcc commented Jan 7, 2013

ixmatus commented Jan 8, 2013

seancribbs commented Mar 24, 2014

nerophon commented Aug 19, 2015

point9repeating commented Aug 5, 2016