Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

kuenishi · 2015-07-01T08:27:59Z

No description provided.

kuenishi · 2015-07-01T08:29:42Z

src/riak_cs_block_server.erl

-    Sorry = {error, notfound},
+             UUID, BlockNumber, _RcPid, MaxRetries, ErrorReasons)
+  when is_list(ErrorReasons) andalso length(ErrorReasons) > MaxRetries ->
+    Sorry = {error, hd(ErrorReasons)},


I wonder returning whole reasons or just head of this, which is better ... riak_cs_get_fsm handles this Reason just for logging. So maybe this should be rather a list of errors than the head of errors.

kuenishi · 2015-07-01T08:31:24Z

I also wonder right way to stack all errors, to add source of errors like from local get or remote.

shino · 2015-07-09T07:16:53Z

TL;DR; In my humble opinion, return all stacked errors to caller is
simple answer unless clever triage-and-prioritization filter
function is implemented.

It's difficult question: which error is most significant ;)

There are three kinds of "get" in block server (excluding legacy
n_val_one=false branch.)

[Lo] local get with N=one
[La] local get with N=all
[Ra] remote get with N=all (proxy get)

Possible usual error cases and triage of them:

insufficent_vnodes in Lo: not informative (e.g just single node down)
notfound in Lo: not much informative (e.g. replica dificit in first primary)
notfound in La when proxy-get enabled: probably not informative
(e.g. not replicated yet)
notfound in La when proxy-get disabled: significant (e.g. block data loss)
notfound in Ra: significant (e.g. block data loss)

Some other considerations:

timeout is more significant than disconnected.
If insufficient_vnodes errors are resolved in retry intervals,
they can be ignored because they are just temporary.

I also wonder right way to stack all errors, to add source of errors
like from local get or remote.

Adding classification sounds nice :) e.g.

{error, [{local_one, {error, {insufficient_vnodes,0,need,1}}},
         {local_all, {error, timeout}},
         {local_all, {error, disconnected}},
         {local_all, {error, disconnected}}]}

{error, [{local_one, {error, notfound}},
         {local_all, {error, notfound}},
         {remote_all, {error, timeout}}]}.

kuenishi · 2015-07-13T04:39:59Z

Added three indicator: local_get, local_quorum and remote_quorum . Ready again ^^;

shino · 2015-07-13T06:07:59Z

src/riak_cs_block_server.erl

        {error, notfound} ->
-            RetryFun(failure);
+            RetryFun([{local_quorum, notfound}|ErrorReasons]);


This branch indicates

get object failed by notfound with default N (usually 3) and

proxy get is disabled or not-used (because local cluster is origin).

There is no strong reason that retry make the operation succeed after this error.
Original "immediately fail" seems reasonable.

My intention was not to change original behaviour. If we change original behaviour, it'd be another pull request?

The behaivior is changed.

Original will call RetryFun with the atom failure and original
do_get_block/11 was implemented as follows:

do_get_block(ReplyPid, _Bucket, _Key, _ClusterID, _UseProxyGet, _ProxyActive, UUID, BlockNumber, _RcPid, MaxRetries, NumRetries) when is_atom(NumRetries) orelse NumRetries > MaxRetries -> Sorry = {error, notfound},

then there is no retry, immediate failure.

kuenishi · 2015-07-13T07:32:35Z

Updated. Looks like some riak_tests passing.

shino · 2015-07-13T08:03:11Z

src/riak_cs_block_server.erl

-                {error, _} ->
-                    RetryFun(NumRetries + 1)
+                {error, Reason} ->
+                    RetryFun([{remote_quorum, Reason}|ErrorReasons])


{local_quorum, notfound} is missing in the stack.

kuenishi · 2015-07-13T08:07:26Z

"May the force-push be with you." "Use the force-push, Luke!"

shino · 2015-07-13T08:27:29Z

This PR improves visibility inside block object fetch failures 👍 🎢 📇

Stack error reasons on block retrieving failure [JIRA: RCS-223] Reviewed-by: shino

shino · 2015-07-13T08:50:41Z

@borshop merge

Basho-JIRA · 2015-07-14T00:56:45Z

For release note:

On retrieving blocks there is a complex logic to resolve blocks when GET is requested from client. First, CS tries to retrieve a block with n_val=1 and if it fails, retry will be done in n_val=3. If the block cannot be resolved locally and proxy_get is enabled, the system is configured with datacenter replication. Thus Riak CS tries to perform proxied get to remote site. These fallback and retry logic and complex and hard to trace, especially in a faulty or unstable situation. This change improves error tracing of the whole sequence described above and will help diagnose issues. Specifically, for each block, block server stacks all errors returned from Riak client and reports every error reason as well as the type of call in which the error occurred.
PR: #1177

_[posted via JIRA by Kota Uenishi]_

Stack error reasons on block retrieving failure

0b93323

kuenishi reviewed Jul 1, 2015
View reviewed changes

Basho-JIRA changed the title ~~Stack error reasons on block retrieving failure~~ Stack error reasons on block retrieving failure [JIRA: RCS-223] Jul 10, 2015

Basho-JIRA added the JIRA: To Do label Jul 10, 2015

kuenishi added this to the 2.1.0 milestone Jul 10, 2015

Add indicator on where getting a block failed

24fbdc1

shino reviewed Jul 13, 2015
View reviewed changes

shino mentioned this pull request Jul 13, 2015

Improve/fix Block server retry logic [JIRA: RCS-235] #1183

Open

shino reviewed Jul 13, 2015
View reviewed changes

kuenishi force-pushed the feature/logging-block-server-errors branch from b50b8ce to 51374e9 Compare July 13, 2015 08:06

Keep original block server behaviour

51374e9

borshop added a commit that referenced this pull request Jul 13, 2015

Merge pull request #1177 from basho/feature/logging-block-server-errors

24eba52

Stack error reasons on block retrieving failure [JIRA: RCS-223] Reviewed-by: shino

borshop merged commit 51374e9 into develop Jul 13, 2015

kuenishi deleted the feature/logging-block-server-errors branch July 13, 2015 09:01

Basho-JIRA added JIRA: In Progress JIRA: Done and removed JIRA: To Do JIRA: In Progress labels Jul 14, 2015

Basho-JIRA assigned kuenishi Jul 14, 2015

Basho-JIRA added the JIRA: Closed label Jul 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

kuenishi commented Jul 1, 2015

kuenishi Jul 1, 2015

kuenishi commented Jul 1, 2015

shino commented Jul 9, 2015

kuenishi commented Jul 13, 2015

shino Jul 13, 2015

kuenishi Jul 13, 2015

shino Jul 13, 2015

kuenishi commented Jul 13, 2015

shino Jul 13, 2015

kuenishi Jul 13, 2015

kuenishi commented Jul 13, 2015

shino commented Jul 13, 2015

shino commented Jul 13, 2015

Basho-JIRA commented Jul 14, 2015

Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

Conversation

kuenishi commented Jul 1, 2015

kuenishi Jul 1, 2015

Choose a reason for hiding this comment

kuenishi commented Jul 1, 2015

shino commented Jul 9, 2015

kuenishi commented Jul 13, 2015

shino Jul 13, 2015

Choose a reason for hiding this comment

kuenishi Jul 13, 2015

Choose a reason for hiding this comment

shino Jul 13, 2015

Choose a reason for hiding this comment

kuenishi commented Jul 13, 2015

shino Jul 13, 2015

Choose a reason for hiding this comment

kuenishi Jul 13, 2015

Choose a reason for hiding this comment

kuenishi commented Jul 13, 2015

shino commented Jul 13, 2015

shino commented Jul 13, 2015

Basho-JIRA commented Jul 14, 2015