Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in get user with strong option makes subsequent weak get fail [JIRA: RCS-250] #1201

Closed
shino opened this issue Jul 29, 2015 · 2 comments
Assignees
Milestone

Comments

@shino
Copy link
Contributor

shino commented Jul 29, 2015

Symptom : single slow node may cause user fetch failure.

  • Getting CS user is in two steps, first step with PR=all,
    in which single slow riak node can cause timeout error at client.
  • When timeout occurs in riakc_pb_socket, it disconnects TCP connection
    and goes into wait-and-retry loop.
  • Then, CS user get 2nd phase with weak option, but it's likely that reconnect
    does not happen yet, fails with disconnected error.

If "slow" node is completely frozen (no action will come out from it),
after health check timeout, strong get fails by "insufficient vnodes"
and weak get should work well. For this case, certain user can not
access Riak CS for finite time period, 60 sec by default.


Reproduction (or simulation)

  • Create 4-node cluster ({get_user_timeout, 3000} in advanced.config may help)

  • Memo dev2 pid

    DEV2=`ps aux | grep riak_ee | grep dev1 | grep beam.smp | awk '{print $2;}' `
    
  • Freeze it: kill -s SIGSTOP $DEV2 (keep your fingers crossed, if unfortunate, freeze another node 🙉)

  • Do any access,

@Basho-JIRA Basho-JIRA changed the title Timeout in get user with strong option makes subsequent weak get fail Timeout in get user with strong option makes subsequent weak get fail [JIRA: RCS-250] Jul 29, 2015
@shino shino added this to the 2.1.0 milestone Sep 7, 2015
@shino
Copy link
Contributor Author

shino commented Sep 7, 2015

For release note, short version: Improve user object fetch logic when some nodes are slow or silently failed.
For longer version, please refer the description of this issue.

@shino
Copy link
Contributor Author

shino commented Sep 7, 2015

Addressed by #1230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants