63012 fix filtered replications #105

nickva · 2016-12-08T02:07:42Z

When replication filter changes, replication id record in doc processor ETS
table was not updated. This led to the new replication job not showing up in
the _scheduler/docs output.
Make sure doc processor workers do not re-add deleted replication jobs.
Previously, especially in case of filtered replications, doc processor workers
could inadvertently re-add a replication job after it was deleted. Workers after
finishing fetching filter code and computing the replication id, would try to add
the replication job to the scheduler. They did that without checking if replication
document was already deleted, or another worker was spawned.
Fix add_job/1 spec for scheduler

rnewson · 2016-12-08T12:37:02Z

src/couch_replicator_doc_processor.erl

+    case ets:lookup(?MODULE, {DbName, DocId}) of
+        [#rdoc{worker = WRef}] when is_reference(WRef) ->
+            WRef;
+        _Other ->


let's be clear what we expect here. presumably [] is the only other case?

would we expect an #rdoc without a worker ref, etc? We should avoid this kind of defensive programming in general.

Good point. You're right. _Other in this case should be [] or a worker with a nil reference. I will fix it

rnewson · 2016-12-08T12:38:53Z

src/couch_replicator_doc_processor_worker.erl

-        {ok, RepId};
+        % Before adding a the job check that this worker is still the current
+        % worker. This is to handle a race condition where a worker which was
+        % sleeping and then checking a replication filter my inadvertently re-add


nickva · 2016-12-08T17:49:18Z

@rnewson Fixed typo. Added explicit cases for nil and worker with nil reference in get_worker_ref function. Wrote a test for get_worker_ref as well.

sagelywizard · 2016-12-09T22:16:05Z

src/couch_replicator_doc_processor.erl

@@ -194,8 +207,8 @@ handle_cast(Msg, State) ->
    {stop, {error, unexpected_message, Msg}, State}.


-handle_info({'DOWN', Ref, _, _, #doc_worker_result{id = Id, result = Res}},
-        State) ->
+handle_info({'DOWN', _, _,  _, #doc_worker_result{id = Id, wref = Ref,


Accidentally added a space there.

Oops. Good catch. Fixing it

sagelywizard · 2016-12-09T23:05:53Z

src/couch_replicator_doc_processor_worker.erl

-            couch_log:warning("replicator scheduler: ~p was already added", [Rep])
-        end,
-        {ok, RepId};
+        % Before adding a the job check that this worker is still the current


Little comment typo here. adding a the job

Good catch. Fixing it

sagelywizard · 2016-12-10T01:48:56Z

Looks like it'd work, but it'd be nice if we could simplify things a bit. Would it make sense for the doc processor to kill the worker and demonitor(WorkerRef, [flush]) the mailbox when the document was removed? That would avoid passing the ref all around.

nickva · 2016-12-11T03:16:50Z

Interesting idea. That might be a bit tricky because the worker is actually 2 processes: a top level wrapper and the actual worker process. The wrapper is used to have a guarantee that no-matter what happens (blocked network request, a redirect loop, etc. ...) workers always returns and don't block indefinitely. Killing the wrapper process will still leave the process which start the replication running.

Linking the wrapper and the executor might appear to work but then if executor dies the wrapper will exit with an unspecified exit signal (now it is expected to exit with a well known result record only). I think linking but then also trapping exits in the worker might work. But not sure if that would still simplify the logic overall or just shift some of the tricky bits.

sagelywizard · 2017-01-10T00:11:43Z

src/couch_replicator_doc_processor_worker.erl

-    {_Pid, WRef} = spawn_monitor(fun() -> worker_fun(Id, Rep, WaitSec) end),
-    WRef.
+-spec spawn_worker(db_doc_id(), #rep{}, seconds(), reference()) -> pid().
+spawn_worker(Id, Rep, WaitSec, WRef) ->


Why do you need to pass in the ref? Why can't you just add the ref returned by spawn_monitor to the worker field in ets?

That is done to avoid a race condition.

A reference for the worker is created first and inserted in the ets table. Then the workers with that ref is started. The race condition could happen because the worker after it starts will check if it is still the current worker (for the latest document update):

https://github.com/apache/couchdb-fabric/blob/master/src/fabric_doc_open_revs.erl#L461-L468

There is a (probably mostly theoretical) race condition there if after the worker is spawned the doc processor main process is put to sleep and before it gets a chance to add the reference to ets, in the meantime worker process keeps going, checks that it is not the current worker and exits.

Makes sense.

sagelywizard

LGTM after a squash and rebase.

add_job/1 doesn't return `{error, already_added}` anymore so fix spec to conform.

…able When replication filter changes, replication id record in doc processor ETS table was not updated. This led to the new replication job not showing up in the _scheduler/docs output.

Previously, especially in case of filtered replications, doc processor workers could inadvertently re-add a replication job after it was deleted. Workers after finishing fetching filter code and computing the replication id, would try to add the replication job to the scheduler. They did that without checking if replication document was already deleted, or another worker was spawned. The fix is to create a unique worker reference, pass it to the worker, then worker confirms they are still current and document was not deleted before adding the job, otherwise they exit with an `ignore` result.

nickva · 2017-01-11T20:48:32Z

@sagelywizard Thank you!

🐎

nickva force-pushed the 63012-fix-filtered-replications branch 2 times, most recently from 24aeae3 to 4e79f47 Compare December 8, 2016 02:12

rnewson reviewed Dec 8, 2016

View reviewed changes

nickva force-pushed the 63012-fix-filtered-replications branch from 4e79f47 to 88bf066 Compare December 8, 2016 17:45

sagelywizard reviewed Dec 9, 2016

View reviewed changes

sagelywizard reviewed Jan 10, 2017

View reviewed changes

sagelywizard approved these changes Jan 11, 2017

View reviewed changes

nickva added 3 commits January 11, 2017 14:58

Fix add_job/1 spec for scheduler

73d1736

add_job/1 doesn't return `{error, already_added}` anymore so fix spec to conform.

Update filtered replication id when it changes in doc processor ETS t…

f2b3ac5

…able When replication filter changes, replication id record in doc processor ETS table was not updated. This led to the new replication job not showing up in the _scheduler/docs output.

nickva force-pushed the 63012-fix-filtered-replications branch from 8db2107 to 700a929 Compare January 11, 2017 20:00

nickva merged commit 146b700 into 63012-scheduler Jan 11, 2017

nickva deleted the 63012-fix-filtered-replications branch January 11, 2017 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

63012 fix filtered replications #105

63012 fix filtered replications #105

nickva commented Dec 8, 2016 •

edited

rnewson Dec 8, 2016

rnewson Dec 8, 2016

nickva Dec 8, 2016 •

edited

rnewson Dec 8, 2016

nickva commented Dec 8, 2016

sagelywizard Dec 9, 2016

nickva Dec 9, 2016

sagelywizard Dec 9, 2016

nickva Dec 9, 2016

sagelywizard commented Dec 10, 2016

nickva commented Dec 11, 2016

sagelywizard Jan 10, 2017

nickva Jan 10, 2017 •

edited

sagelywizard Jan 11, 2017

sagelywizard left a comment

nickva commented Jan 11, 2017

63012 fix filtered replications #105

63012 fix filtered replications #105

Conversation

nickva commented Dec 8, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickva Dec 8, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickva commented Dec 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagelywizard commented Dec 10, 2016

nickva commented Dec 11, 2016

Choose a reason for hiding this comment

nickva Jan 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagelywizard left a comment

Choose a reason for hiding this comment

nickva commented Jan 11, 2017

nickva commented Dec 8, 2016 •

edited

nickva Dec 8, 2016 •

edited

nickva Jan 10, 2017 •

edited