Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack error reasons on block retrieving failure [JIRA: RCS-223] #1177

Merged
merged 3 commits into from
Jul 13, 2015

Conversation

kuenishi
Copy link
Contributor

@kuenishi kuenishi commented Jul 1, 2015

No description provided.

Sorry = {error, notfound},
UUID, BlockNumber, _RcPid, MaxRetries, ErrorReasons)
when is_list(ErrorReasons) andalso length(ErrorReasons) > MaxRetries ->
Sorry = {error, hd(ErrorReasons)},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder returning whole reasons or just head of this, which is better ... riak_cs_get_fsm handles this Reason just for logging. So maybe this should be rather a list of errors than the head of errors.

@kuenishi
Copy link
Contributor Author

kuenishi commented Jul 1, 2015

I also wonder right way to stack all errors, to add source of errors like from local get or remote.

@shino
Copy link
Contributor

shino commented Jul 9, 2015

TL;DR; In my humble opinion, return all stacked errors to caller is
simple answer unless clever triage-and-prioritization filter
function is implemented.


It's difficult question: which error is most significant ;)

There are three kinds of "get" in block server (excluding legacy
n_val_one=false branch.)

  • [Lo] local get with N=one
  • [La] local get with N=all
  • [Ra] remote get with N=all (proxy get)

Possible usual error cases and triage of them:

  1. insufficent_vnodes in Lo: not informative (e.g just single node down)
  2. notfound in Lo: not much informative (e.g. replica dificit in first primary)
  3. notfound in La when proxy-get enabled: probably not informative
    (e.g. not replicated yet)
  4. notfound in La when proxy-get disabled: significant (e.g. block data loss)
  5. notfound in Ra: significant (e.g. block data loss)

Some other considerations:

  • timeout is more significant than disconnected.
  • If insufficient_vnodes errors are resolved in retry intervals,
    they can be ignored because they are just temporary.

I also wonder right way to stack all errors, to add source of errors
like from local get or remote.

Adding classification sounds nice :) e.g.

{error, [{local_one, {error, {insufficient_vnodes,0,need,1}}},
         {local_all, {error, timeout}},
         {local_all, {error, disconnected}},
         {local_all, {error, disconnected}}]}
{error, [{local_one, {error, notfound}},
         {local_all, {error, notfound}},
         {remote_all, {error, timeout}}]}.

@Basho-JIRA Basho-JIRA changed the title Stack error reasons on block retrieving failure Stack error reasons on block retrieving failure [JIRA: RCS-223] Jul 10, 2015
@kuenishi kuenishi added this to the 2.1.0 milestone Jul 10, 2015
@kuenishi
Copy link
Contributor Author

Added three indicator: local_get, local_quorum and remote_quorum . Ready again ^^;

{error, notfound} ->
RetryFun(failure);
RetryFun([{local_quorum, notfound}|ErrorReasons]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch indicates

  • get object failed by notfound with default N (usually 3) and
  • proxy get is disabled or not-used (because local cluster is origin).

There is no strong reason that retry make the operation succeed after this error.
Original "immediately fail" seems reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was not to change original behaviour. If we change original behaviour, it'd be another pull request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaivior is changed.

Original will call RetryFun with the atom failure and original
do_get_block/11 was implemented as follows:

do_get_block(ReplyPid, _Bucket, _Key, _ClusterID, _UseProxyGet, _ProxyActive,
             UUID, BlockNumber, _RcPid, MaxRetries, NumRetries)
  when is_atom(NumRetries) orelse NumRetries > MaxRetries ->
    Sorry = {error, notfound},

then there is no retry, immediate failure.

@kuenishi
Copy link
Contributor Author

Updated. Looks like some riak_tests passing.

{error, _} ->
RetryFun(NumRetries + 1)
{error, Reason} ->
RetryFun([{remote_quorum, Reason}|ErrorReasons])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{local_quorum, notfound} is missing in the stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@kuenishi kuenishi force-pushed the feature/logging-block-server-errors branch from b50b8ce to 51374e9 Compare July 13, 2015 08:06
@kuenishi
Copy link
Contributor Author

"May the force-push be with you." "Use the force-push, Luke!"

@shino
Copy link
Contributor

shino commented Jul 13, 2015

This PR improves visibility inside block object fetch failures 👍 🎢 📇

borshop added a commit that referenced this pull request Jul 13, 2015
Stack error reasons on block retrieving failure [JIRA: RCS-223]

Reviewed-by: shino
@shino
Copy link
Contributor

shino commented Jul 13, 2015

@borshop merge

@borshop borshop merged commit 51374e9 into develop Jul 13, 2015
@kuenishi kuenishi deleted the feature/logging-block-server-errors branch July 13, 2015 09:01
@Basho-JIRA
Copy link

For release note:

On retrieving blocks there is a complex logic to resolve blocks when GET is requested from client. First, CS tries to retrieve a block with n_val=1 and if it fails, retry will be done in n_val=3. If the block cannot be resolved locally and proxy_get is enabled, the system is configured with datacenter replication. Thus Riak CS tries to perform proxied get to remote site. These fallback and retry logic and complex and hard to trace, especially in a faulty or unstable situation. This change improves error tracing of the whole sequence described above and will help diagnose issues. Specifically, for each block, block server stacks all errors returned from Riak client and reports every error reason as well as the type of call in which the error occurred.
PR: #1177

_[posted via JIRA by Kota Uenishi]_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants