Skip to content

Remove replication job supervisor#5036

Merged
nickva merged 1 commit into
mainfrom
prevent-local-duplicate-jobs-error
Apr 24, 2024
Merged

Remove replication job supervisor#5036
nickva merged 1 commit into
mainfrom
prevent-local-duplicate-jobs-error

Conversation

@nickva
Copy link
Copy Markdown
Contributor

@nickva nickva commented Apr 24, 2024

Use the scheduler as the job supervisor, since the scheduler is already a fancy supervisor, with its own backoff logic, process monitoring, etc.

This simplifies the job starting/stopping logic and fixes a bug where the simple_one_for_one supervisor could restart a job, but the scheduler would consider it not running, and try to start another job with the same replication ID on the same node. Since jobs register themselves in pg, the second job would keep crashing with duplicate_job error the first time it tried to checkpoint.

@@ -1,34 +0,0 @@
% Licensed under the Apache License, Version 2.0 (the "License"); you may not
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a left-over supervisor from many years ago when we switched to the scheduling replicator and forgot to remove it. Since we're cleaning up supervisors, removing the extra junk as well.

Copy link
Copy Markdown
Member

@rnewson rnewson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nice reduction in complexity.

Comment thread src/couch_replicator/src/couch_replicator_scheduler.erl Outdated
@nickva nickva force-pushed the prevent-local-duplicate-jobs-error branch from aa5c5c8 to b90a591 Compare April 24, 2024 13:59
Use the scheduler as the job supervisor, since the scheduler is already a fancy
supervisor, with its own backoff logic, process monitoring, etc.

This simplifies the job starting/stopping logic and fixes a bug where the
simple_one_for_one supervisor could restart a job, but the scheduler would
consider it not running, and try to start another job with the same replication
ID on the same node. Since jobs register themselves in pg, the second job would
keep crashing with duplicate_job error the first time it tried to checkpoint.
@nickva nickva force-pushed the prevent-local-duplicate-jobs-error branch from b90a591 to 770faed Compare April 24, 2024 14:26
@nickva nickva merged commit 7388b52 into main Apr 24, 2024
@nickva nickva deleted the prevent-local-duplicate-jobs-error branch April 24, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants