Eas fix wm 2i timeout #644

Merged
merged 2 commits into from Aug 29, 2013

Conversation

Projects
None yet
3 participants
@engelsanchez
Contributor

engelsanchez commented Aug 28, 2013

When the webmachine process handling the 2i request times out, it was
returning an error message, but messages from the index FSM could still
arrive at this process later in a different context (in mochiweb)
causing it to send error messages to the client, which at that point
could be sending another request down the same connection. That
unrelated request was failing with a 400 "unexpected data" message.
This fix uses a big hammer: on a timeout in the wm process, it will try
to kill the index FSM and consume any messages sent from it before
returning a timeout error message to the client.

Fix 2i timeout messing with next request
When the webmachine process handling the 2i request times out, it was
returning an error message, but messages from the index FSM could still
arrive at this process later in a different context (in mochiweb)
causing it to send error messages to the client, which at that point
could be sending another request down the same connection. That
unrelated request was failing with a 400 "unexpected data" message.
This fix uses a big hammer: on a timeout in the wm process, it will try
to kill the index FSM and consume any messages sent from it before
returning a timeout error message to the client.

@engelsanchez engelsanchez referenced this pull request in basho/riak_test Aug 28, 2013

Merged

Eas 1.4.2 2i timeout fix #369

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 28, 2013

Contributor

Testing notes: it would be good to make sure all 2i tests run without problem (verify_2i_timeout, secondary_index_tests, verify_2i_returnterms, verify_2i_limit... I think that's it). The problem occurred only once every few/several tries, be warned.

Possible things to watch out for:

  • killing the FSM should be OK if all processes talking to it or depending on it to exit are monitoring it, which we should be as far as I know.
  • killing a process will probably generate crash messages in the log. At this point I think that's an acceptable level of noise.
  • I am not sure that killing a process, checking if it's alive and then consuming all of its messages is race condition prone: could we still get a message in the queue after the point where erlang:is_process_alive returns false? I'm not 100% sure, but this should be better than no protection at all
Contributor

engelsanchez commented Aug 28, 2013

Testing notes: it would be good to make sure all 2i tests run without problem (verify_2i_timeout, secondary_index_tests, verify_2i_returnterms, verify_2i_limit... I think that's it). The problem occurred only once every few/several tries, be warned.

Possible things to watch out for:

  • killing the FSM should be OK if all processes talking to it or depending on it to exit are monitoring it, which we should be as far as I know.
  • killing a process will probably generate crash messages in the log. At this point I think that's an acceptable level of noise.
  • I am not sure that killing a process, checking if it's alive and then consuming all of its messages is race condition prone: could we still get a message in the queue after the point where erlang:is_process_alive returns false? I'm not 100% sure, but this should be better than no protection at all
@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 28, 2013

Contributor

Also, you'll need riak_test#369 to test. Without that PR, you may see different error messages because the request sometimes returns a 500, sometimes a 503 depending on the type of timeout and also the 2i returnterms could fail generating an index request URL.

Contributor

engelsanchez commented Aug 28, 2013

Also, you'll need riak_test#369 to test. Without that PR, you may see different error messages because the request sometimes returns a 500, sometimes a 503 depending on the type of timeout and also the 2i returnterms could fail generating an index request URL.

@seancribbs

This comment has been minimized.

Show comment Hide comment
@seancribbs

seancribbs Aug 28, 2013

Contributor

The timeout-related 400 can also be reproduced reliably on this basho/riak-python-client#272. To run just that test, use python -m unittest riak.tests.test_all.RiakHttpTransportTestCase.test_index_timeout.

Contributor

seancribbs commented Aug 28, 2013

The timeout-related 400 can also be reproduced reliably on this basho/riak-python-client#272. To run just that test, use python -m unittest riak.tests.test_all.RiakHttpTransportTestCase.test_index_timeout.

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 28, 2013

Contributor

Thanks @seancribbs, that should be useful. I'll try it.

Contributor

engelsanchez commented Aug 28, 2013

Thanks @seancribbs, that should be useful. I'll try it.

@seancribbs

This comment has been minimized.

Show comment Hide comment
@seancribbs

seancribbs Aug 28, 2013

Contributor

@engelsanchez Let me know if you need help setting up the tests.

Contributor

seancribbs commented Aug 28, 2013

@engelsanchez Let me know if you need help setting up the tests.

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 28, 2013

Contributor

While running the test_index_timeout python test in the sdc-2i-timeout branch, I'm getting something unexpected. Is this what you were seeing? The client is not raising an exception with timeout=1.

FAIL: test_index_timeout (riak.tests.test_all.RiakHttpTransportTestCase)

Traceback (most recent call last):
File "riak/tests/test_2i.py", line 422, in test_index_timeout
bucket.get_index('field1_bin', 'val1', timeout=1)
AssertionError: RiakError not raised

Contributor

engelsanchez commented Aug 28, 2013

While running the test_index_timeout python test in the sdc-2i-timeout branch, I'm getting something unexpected. Is this what you were seeing? The client is not raising an exception with timeout=1.

FAIL: test_index_timeout (riak.tests.test_all.RiakHttpTransportTestCase)

Traceback (most recent call last):
File "riak/tests/test_2i.py", line 422, in test_index_timeout
bucket.get_index('field1_bin', 'val1', timeout=1)
AssertionError: RiakError not raised

@seancribbs

This comment has been minimized.

Show comment Hide comment
@seancribbs

seancribbs Aug 28, 2013

Contributor

@engelsanchez Yes, and when running wireshark, you can see the 400 go onto the wire between the original response and the client's next request.

Contributor

seancribbs commented Aug 28, 2013

@engelsanchez Yes, and when running wireshark, you can see the 400 go onto the wire between the original response and the client's next request.

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 28, 2013

Contributor

After debugging the test with my branch, my conclusion is that the test is not working: with a 1ms timeout, results can be returned quite easily. And for the record, the Python test is issuing a non-streaming request, which is a different code path than that touched in this PR.

Contributor

engelsanchez commented Aug 28, 2013

After debugging the test with my branch, my conclusion is that the test is not working: with a 1ms timeout, results can be returned quite easily. And for the record, the Python test is issuing a non-streaming request, which is a different code path than that touched in this PR.

@beerriot

This comment has been minimized.

Show comment Hide comment
@beerriot

beerriot Aug 29, 2013

Contributor

The method here looks sound to me. I spent a while verifying that it's a good idea by creating a minimal PULSE test:

https://gist.github.com/beerriot/02d6ddf725a9cc7db720

One thing I wanted to check specifically was whether the wait for is_process_alive/1 to return false was necessary. I believe the test shows that it is. As written in the gist, the test without the wait fails, but the test with the wait passes. Uncommenting the extra documented receive line causes the non-waiting test to pass. I'm not sure why that's the case, but it seems safest to leave the wait in for this "hammer".

The safe alternative to this busy-wait loop would be to attach a monitor to FSMPid and wait for the 'DOWN' message.

Contributor

beerriot commented Aug 29, 2013

The method here looks sound to me. I spent a while verifying that it's a good idea by creating a minimal PULSE test:

https://gist.github.com/beerriot/02d6ddf725a9cc7db720

One thing I wanted to check specifically was whether the wait for is_process_alive/1 to return false was necessary. I believe the test shows that it is. As written in the gist, the test without the wait fails, but the test with the wait passes. Uncommenting the extra documented receive line causes the non-waiting test to pass. I'm not sure why that's the case, but it seems safest to leave the wait in for this "hammer".

The safe alternative to this busy-wait loop would be to attach a monitor to FSMPid and wait for the 'DOWN' message.

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 29, 2013

Contributor

Oh FFS! Why didn't I think of that! Of course that's better :(. Shame...

Contributor

engelsanchez commented Aug 29, 2013

Oh FFS! Why didn't I think of that! Of course that's better :(. Shame...

@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 29, 2013

Contributor

I feel tempted to retry with the monitor (duh!), but given that it's late in the day wanted to check with @jaredmorrow and @jonmeredith to see if anybody is around to take it as is and do another run. If not, I'd like to try to fix the elegant way, but not if it holds back the release.

Contributor

engelsanchez commented Aug 29, 2013

I feel tempted to retry with the monitor (duh!), but given that it's late in the day wanted to check with @jaredmorrow and @jonmeredith to see if anybody is around to take it as is and do another run. If not, I'd like to try to fix the elegant way, but not if it holds back the release.

Replace busy wait with monitor check
My dumbass coded the kill + busy wait loop for no good reason. Setting
up a monitor, killing the process and waiting for the 'DOWN' message is
the obvious elegant Erlangy way to kill and wait for a process to die.
@engelsanchez

This comment has been minimized.

Show comment Hide comment
@engelsanchez

engelsanchez Aug 29, 2013

Contributor

Well, I've replaced my busy wait with the proper monitor + DOWN message detection technique as Bryan pointed out. Maybe tomorrow @beerriot you or can run a proper PULSE test on it. I've ran the riak test 10 times to verify. I need a bit more time to set up PULSE again.

Contributor

engelsanchez commented Aug 29, 2013

Well, I've replaced my busy wait with the proper monitor + DOWN message detection technique as Bryan pointed out. Maybe tomorrow @beerriot you or can run a proper PULSE test on it. I've ran the riak test 10 times to verify. I need a bit more time to set up PULSE again.

@beerriot

This comment has been minimized.

Show comment Hide comment
@beerriot

beerriot Aug 29, 2013

Contributor

I've updated my pulse gist with a version that uses monitor like your latest commit. It passes perfectly.

So, +1 to this PR, then.

Contributor

beerriot commented Aug 29, 2013

I've updated my pulse gist with a version that uses monitor like your latest commit. It passes perfectly.

So, +1 to this PR, then.

engelsanchez added a commit that referenced this pull request Aug 29, 2013

Merge pull request #644 from basho/eas-fix-wm-2i-timeout
Fix wm 2i timeout causing next request to fail

@engelsanchez engelsanchez merged commit 61ac9d8 into 1.4 Aug 29, 2013

1 check failed

default The Travis CI build failed
Details

@engelsanchez engelsanchez deleted the eas-fix-wm-2i-timeout branch Aug 29, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment