Skip to content

Poller recovery starts multiple processes and fails to recover properly #4215

@bernisys

Description

@bernisys

Describe the bug

First of all, this probably appears only in really large setups, where the recovery is taking more time than a polling cycle.

When connectivity returns after an outage in a remote poller setup, the poller recovery does not work properly.
It spawns multiple processes and causes a lot of DB stress, because it does not take into account that a process is already running.

I also found the actual flaw in the code and i think i was able to fix it.

To Reproduce

Let a really large (in terms of data sources) remote poller go into "Heartbeat" mode and let it recover again.
You will see multiple processes running after a while, which are trying all the same: starting the data sync from scratch over and over again.

Expected behavior

Poller recovery should run only once.
Every subsequent call should detect that a process is running already.

Additional context

The problem is, that the "recovery_pid" is nowhere inserted into the corresponding table.
It is only deleted at one point in the code.
This way the subsequent processes always see an empty value (because there is no "recovery_pid" entry in the "settings" table) and will start the process all over again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugUndesired behaviourresolvedA fixed issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions