Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

dwevedivaibhav · 2024-03-07T14:09:05Z

I hope you're doing well.

We've encountered a critical issue regarding job duplication in distribution environment when the server crashes and a new server comes up. Despite implementing distributed locks, we're experiencing instances where jobs running on the original server continue to run on the new server after it crashes and restarts. This is causing duplicates and impacting our system performance.

Your urgent attention to this matter is greatly appreciated. We need to resolve this issue as soon as possible to prevent any further disruptions.

Thank you for your assistance.

glime-ai · 2024-03-07T14:09:30Z

Hi,

Based on the context provided, it seems you are experiencing issues with job duplication in a distributed environment using Elsa Workflows when a server crashes and a new server comes up. This is indeed a critical issue that can lead to performance problems and inconsistent states within your workflow system.

To address this issue, it's important to ensure that your workflow instances are correctly managed across server restarts and crashes. Elsa Workflow's design allows for distributed execution, but it requires careful configuration, especially when dealing with long-running workflows and background jobs.

One approach to mitigate such issues is to use a persistent storage provider that supports distributed locks, such as Redis. This can help prevent multiple instances of the same job from being executed simultaneously. Additionally, you should verify that your Hangfire server (or whichever background job system you are using) is configured correctly to handle server crashes and restarts without duplicating jobs.

Since you mentioned that you have already implemented distributed locks but are still facing issues, it might be necessary to review the current implementation for potential misconfigurations or to consider alternative strategies for managing workflow instances and background jobs.

Given the complexity of the issue, it would be beneficial to engage directly with the Elsa Workflow community or the maintainers for more in-depth support. You can create an issue on the Elsa Workflow GitHub repository detailing your problem, environment setup, and any logs or error messages you have. This will allow the maintainers and community to provide more targeted assistance.

For immediate troubleshooting, you can refer to the Elsa Workflow documentation, GitHub issues, and Stack Overflow questions related to similar problems for guidance. If there is a known bug or limitation, it might be documented in the GitHub issues or in the release notes of the project.

Please ensure that you are using the latest stable version of Elsa Workflows, as updates often include bug fixes and improvements that could resolve your issue.

If you need further assistance, please provide more details about your current setup, including the version of Elsa Workflows you are using, the storage provider, the background job system configuration, and any relevant logs or error messages.

dwevedivaibhav · 2024-03-07T16:23:27Z

Please find below configuration which i have added in start up

services.Configure(Configuration.GetSection("TaskSettings"));
services.Configure(Configuration.GetSection("ConnectionsMongoDatabaseOptions"));
//configure Redis
services.AddRedis($"{distribiutedCacheRedis?.ConnectionString}");
var migrationOptions = new MongoMigrationOptions
{
MigrationStrategy = new MigrateMongoMigrationStrategy(),
BackupStrategy = new CollectionMongoBackupStrategy()

        };
        var storageOptions = new MongoStorageOptions
        {
            MigrationOptions = migrationOptions,
            CheckConnection = false
        };
        services.AddHangfire(configuration => configuration
       .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
       .UseSimpleAssemblyNameTypeSerializer()
       .UseRecommendedSerializerSettings(settings => settings.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb))
       .UseMongoStorage(mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName, storageOptions));
        services.AddHangfireServer((sp, options) =>
        {
            options.HeartbeatInterval = TimeSpan.FromSeconds(2);
            options.ConfigureForElsaDispatchers(sp);
        });
        services.ConfigureCustomLogger();
        services
            .AddElsa(elsa =>
            {
                elsa.UseMongoDbPersistence(ef => ef.ConnectionString = mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName);
                elsa.ConfigureDistributedLockProvider(options => options.UseProviderFactory(sp => name =>
                {
                    var connection = sp.GetRequiredService<IConnectionMultiplexer>();
                    return new RedisDistributedLock(name, connection.GetDatabase());
                }));
                elsa.UseRedisCacheSignal();
                elsa.AddQuartzTemporalActivities();
                elsa.UseHangfireDispatchers();
            });
        services.AddElsaApiEndpoints();

dwevedivaibhav · 2024-03-08T12:06:50Z

Hi @sfmskywalker
I'd like to address an issue we've encountered with our workflow execution across multiple servers. Allow me to illustrate with a simple example:

We have two servers, Server A and Server B. On Server A, we have Workflow A running, and on Server B, Workflow B is running. In the event of Server B crashing, a new server, Server C, takes over and resumes Workflow B from where it left off. However, we've observed an unexpected behavior where Workflow A from Server A also starts running on Server C, causing duplicate calls and inconsistencies in our system.

Despite implementing distributed locks, we're puzzled as to why Workflow A from Server A is being executed on Server C. This issue does not occur under normal circumstances when servers are not crashing.

We would greatly appreciate any insights or suggestions you may have on resolving this issue and ensuring the proper execution of workflows across servers.

Thank you for your attention to this matter.

dwevedivaibhav · 2024-03-19T10:00:51Z

@sfmskywalker Any duration to fix this issue ?

sfmskywalker · 2024-03-19T10:14:21Z

Hello @dwevedivaibhav,

Thank you for bringing this issue to our attention. I really wish I could offer a specific timeline for addressing it, but currently, my schedule is quite packed due to other projects and commitments, especially those related to my paying clients.

It appears from your description that Server C is unexpectedly handling workflow instances that are in a 'Running' state. Ideally, this scenario shouldn't occur since Server A is supposed to have an exclusive lock on Workflow A, ensuring that no other servers interfere with its operations. To properly investigate and resolve this issue, we would need a detailed set of steps that can reliably reproduce the problem. This kind of in-depth troubleshooting demands focused time and attention, which, regrettably, I'm unable to dedicate at this moment.

I understand this might not be the response you were hoping for, and I appreciate your patience and understanding. Your issue is important to us, and I assure you it’s on our radar. As soon as my current obligations have been met, I will take a closer look at your situation. Meanwhile, if you're able to provide any additional information or steps to reproduce the issue, it would be incredibly helpful for when we are able to address this.

Thanks for your understanding and for being a part of our community. Your contributions help us improve, and we're looking forward to resolving this together as soon as possible.

dwevedivaibhav · 2024-03-28T05:31:45Z

Hi @sfmskywalker ,

It sounds like the problem lies in the resuming workflow method, particularly in the handling of server information in the workflow instance. Without including server information in the workflow instance, the system may mistakenly acquire locks for workflows that are already running on other servers, leading to duplicates.

To address this issue, we should ensure that server information is properly incorporated into the workflow instance. By doing so, we can accurately identify which server a workflow belongs to and prevent duplicates during job distribution and lock acquisition.

Once we have updated the workflow instances to include server information, we can refine the resuming workflow method to consider this information when acquiring locks. This should help resolve the issue of duplicates caused by incorrectly distributing and acquiring locks for workflows running on different servers.

Please find the exact file info below to take a look in identify the issue

https://github.com/elsa-workflows/elsa-core/blob/2.10.2.2/src/core/Elsa.Core/StartupTasks/ContinueRunningWorkflows.cs

Thanks

dwevedivaibhav · 2024-04-04T03:50:40Z

Hi @sfmskywalker,

Any thoughts on this ?

sfmskywalker · 2024-04-09T08:39:45Z

Makes sense. Probably, we also need to consider that at some point a server will go down, e.g. when hosting in a Kubernetes cluster. If a workflow instance is still associated with that server, and that workflow instance is still in the Running state, then it will not be picked up by new servers. Perhaps we would need a heartbeat system where servers update a table record every minute for example. If a given server hasn't reported a heartbeat for e.g. 5 minutes, it is considered dead, in which case new servers are free to pickup the Running workflow instance.

dwevedivaibhav · 2024-04-15T06:13:12Z

Hi @sfmskywalker
I agree on your thoughts.

Thanks

dwevedivaibhav · 2024-04-17T14:57:46Z

Hi @sfmskywalker

I'm currently working on making some custom changes and trying to call a class as a hook in my project. However, it seems like the approach I'm using isn't working as expected. Do you have any insights or ideas on how to proceed?

Below is the code snippet showing how I'm attempting to implement it:

services
.AddElsa(elsa =>
{
elsa.UseMongoDbPersistence(ef => ef.ConnectionString = mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName);
elsa.UseRedisCacheSignal();
elsa.ConfigureDistributedLockProvider(options => options.UseRedisLockProvider());
elsa.AddQuartzTemporalActivities();
elsa.UseHangfireDispatchers();

              // This is where we call custom hook of [ContinueRunningWorkflows.](https://github.com/elsa-workflows/elsa-core/blob/2.10.2.2/src/core/Elsa.Core/StartupTasks/ContinueRunningWorkflows.cs)
                elsa.Services.AddStartupTask<CustomContinueRunningWorkflowsStartupTask>();
            });

Your assistance and guidance on this matter would be greatly appreciated.

Thanks

dwevedivaibhav mentioned this issue Mar 8, 2024

Ensure Hangfire Scheduling works properly in a distributed environment #4819

Open

sfmskywalker added the elsa 2 label Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

dwevedivaibhav commented Mar 7, 2024

glime-ai bot commented Mar 7, 2024

dwevedivaibhav commented Mar 7, 2024

dwevedivaibhav commented Mar 8, 2024

dwevedivaibhav commented Mar 19, 2024

sfmskywalker commented Mar 19, 2024

dwevedivaibhav commented Mar 28, 2024

dwevedivaibhav commented Apr 4, 2024

sfmskywalker commented Apr 9, 2024

dwevedivaibhav commented Apr 15, 2024

dwevedivaibhav commented Apr 17, 2024

Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

Comments

dwevedivaibhav commented Mar 7, 2024

glime-ai bot commented Mar 7, 2024

dwevedivaibhav commented Mar 7, 2024

dwevedivaibhav commented Mar 8, 2024

dwevedivaibhav commented Mar 19, 2024

sfmskywalker commented Mar 19, 2024

dwevedivaibhav commented Mar 28, 2024

dwevedivaibhav commented Apr 4, 2024

sfmskywalker commented Apr 9, 2024

dwevedivaibhav commented Apr 15, 2024

dwevedivaibhav commented Apr 17, 2024