Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle socket error by catching badmatch #1280

Merged
merged 6 commits into from
Jan 14, 2016
Merged

Conversation

kuenishi
Copy link
Contributor

When a client cancelled or stopped uploading an object in the middle,
the socket for TCP connection returns error when mochiweb tried to
read payload. It had been throwing badmatch and that request was just
failing, leaving already uploaded chunks and writing manifests.

This commit adds a try-catch clause to catch the badmatch error thrown
by mochiweb and calls amendment code that moves the exact manifest to
GC bucket and marks it as scheduled_delete.

A test is included in regression_tests.erl, where a socket error is
emulated by client-side socket close in the middle of upload.

Partial solution for #770.

@kuenishi kuenishi added this to the 2.1.1 milestone Dec 15, 2015
@kuenishi kuenishi force-pushed the bugfix/cleanup-manifests branch 2 times, most recently from bf494c9 to 3294b01 Compare December 22, 2015 10:04
@kuenishi kuenishi force-pushed the bugfix/cleanup-manifests branch 2 times, most recently from 90f1346 to 788c6b8 Compare January 6, 2016 05:14
@kuenishi kuenishi changed the title [WIP] Handle socket error by catching badmatch Handle socket error by catching badmatch Jan 6, 2016
When a client cancelled or stopped uploading an object in the middle,
the socket for TCP connection returns error when mochiweb tried to
read payload. It had been throwing badmatch and that request was just
failing, leaving already uploaded chunks and writing manifests.

This commit adds a try-catch clause to catch the badmatch error thrown
by mochiweb and calls amendment code that moves the exact manifest to
GC bucket and marks it as scheduled_delete.

A test is included in regression_tests.erl, where a socket error is
emulated by client-side socket close in the middle of upload.

There is another possibility of cancelled upload in multipart upload,
where a part upload could be cancelled and left as it was, but it
can be collected if the multipart upload was aborted or completed.
{ok, Sock} = gen_tcp:connect("127.0.0.1", 15018, [{active, false}]),
FirstLine = io_lib:format("PUT /~s HTTP/1.1", [K]),
Binary = binary:copy(<<"*">>, Actual),
ReqHdr = [FirstLine, $\n, "Host: ", Host, $\n, Auth, $\n,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CR+LF is more common for line break in HTTP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a test code and dumb client, I didn't care much about that.

@@ -145,6 +147,85 @@ verify_cs512(UserConfig, BucketName) ->
assert_notfound(UserConfig,BucketName),
ok.

verify_cs770({UserConfig, {RiakNodes, _, _}}, BucketName) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test case 👍

@shino
Copy link
Contributor

shino commented Jan 8, 2016

Interruptions for normal upload and copy upload work fine 👍
For copy upload for MP part, client socket close made error log, which does no harm:

Client side, execute the below command and stop it by Ctrl-C:

S3CURL=.s3curl.15018.alice s3curl.pl --id cs -- -X PUT -x 127.0.0.1:15018 -v -H 'Content-length: 0' -H 'x-amz-copy-source: /test/rand.100MB' -s  http://test.s3.amazonaws.com/mp'?uploadId=XelIvUl3T5uPP_Gnnj2b3A%3D%3D&partNumber=3'

Error log:

15:07:15.769 [error] gen_fsm <0.4528.0> in state not_full terminated with reason: no such process or port in call to gen_fsm:sync_send_event(undefined, {gc_specific_manifest,<<199,130,21,21,106,2,74,79,178,198,229,149,37,33,59,250>>}, infinity)
15:07:15.771 [error] CRASH REPORT Process <0.4528.0> with 1 neighbours exited with reason: no such process or port in call to gen_fsm:sync_send_event(undefined, {gc_specific_manifest,<<199,130,21,21,106,2,74,79,178,198,229,149,37,33,59,250>>}, infinity) in gen_fsm:terminate/7 line 622

@shino
Copy link
Contributor

shino commented Jan 8, 2016

Maybe, the above error is caused when put fsm does not have manifest fsm in its state.
For multipart related requests, put fsm does not spawn manifest fsm as in:

maybe_riak_cs_manifest_fsm_start_link(false, _Bucket, _Key, _RcPid) ->
    {ok, undefined};
maybe_riak_cs_manifest_fsm_start_link(true, Bucket, Key, RcPid) ->
    riak_cs_manifest_fsm:start_link(Bucket, Key, RcPid).

@shino
Copy link
Contributor

shino commented Jan 8, 2016

Now running riak test.

@shino
Copy link
Contributor

shino commented Jan 13, 2016

Two cases failed in multibag flavor, local fix for them:

diff --git a/riak_test/tests/regression_tests.erl b/riak_test/tests/regression_tests.erl
index 9b1dbce..b8a320b 100644
--- a/riak_test/tests/regression_tests.erl
+++ b/riak_test/tests/regression_tests.erl
@@ -174,7 +174,7 @@ verify_cs770({UserConfig, {RiakNodes, _, _}}, BucketName) ->
                           scheduled_delete =:= Mx?MANIFEST.state
                   end, 8, 4096),

-    Pbc = rt:pbc('dev1@127.0.0.1'),
+    Pbc = rtcs:pbc(RiakNodes, objects, BucketName),

     %% verify that object is also stored in latest GC bucket
     Ms = all_manifests_in_gc_bucket(Pbc),
diff --git a/riak_test/tests/upgrade_downgrade_test.erl b/riak_test/tests/upgrade_downgrade_test.erl
index 5e82cfa..cff44d5 100644
--- a/riak_test/tests/upgrade_downgrade_test.erl
+++ b/riak_test/tests/upgrade_downgrade_test.erl
@@ -51,6 +51,9 @@ confirm() ->
          ok = rt:upgrade(RiakNode, RiakCurrentVsn),
          rt:wait_for_service(RiakNode, riak_kv),
          ok = rtcs_config:upgrade_cs(N, AdminCreds),
+         rtcs:set_advanced_conf({cs, current, N},
+                                [{riak_cs,
+                                  [{riak_host, {"127.0.0.1", rtcs_config:pb_port(1)}}]}]),
          rtcs_exec:start_cs(N, current)
      end
      || RiakNode <- RiakNodes],

@shino
Copy link
Contributor

shino commented Jan 13, 2016

regression_tests is still failing... Not reproducible single test run with -t option of riak_test. But it fails with -d riak_test/ebin option... 😿

@shino
Copy link
Contributor

shino commented Jan 13, 2016

During lunch, test run with -d option succeeded with the diff below:

diff --git a/riak_test/tests/regression_tests.erl b/riak_test/tests/regression_tests.erl
index 9b1dbce..aea3d94 100644
--- a/riak_test/tests/regression_tests.erl
+++ b/riak_test/tests/regression_tests.erl
@@ -156,7 +156,7 @@ verify_cs770({UserConfig, {RiakNodes, _, _}}, BucketName) ->
     {ok, Socket} = rtcs_object:upload(UserConfig,
                                       {normal_partial, 3*1024*1024, 1024*1024},
                                       BucketName, Key),
-
+timer:sleep(3*1000),
     [[{UUID, M}]] = get_manifests(RiakNodes, BucketName, Key),

But I'm not sure this is actually a race or not.

@kuenishi
Copy link
Contributor Author

Great catch! That race condition of socket close and writing and getting manifest from Riak. I pushed fix.

@shino
Copy link
Contributor

shino commented Jan 14, 2016

Diff looks nice, all riak_test passed. Really nice fix and a good example of "don't let it crash" 😄

borshop added a commit that referenced this pull request Jan 14, 2016
Handle socket error by catching badmatch

Reviewed-by: shino
@kuenishi
Copy link
Contributor Author

@borshop merge

On Thursday, 14 January 2016, Shunichi Shinohara notifications@github.com
wrote:

Diff looks nice, all riak_test passed. Really nice fix and a good example
of "don't let it crash" [image: 😄]


Reply to this email directly or view it on GitHub
#1280 (comment).

Sent from Gmail Mobile

@borshop borshop merged commit 6f5e135 into 2.1 Jan 14, 2016
@shino shino deleted the bugfix/cleanup-manifests branch January 14, 2016 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants