Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045

Open
dwevedivaibhav opened this issue Mar 7, 2024 · 10 comments
Labels

Comments

@dwevedivaibhav
Copy link

Hi @sfmskywalker,

I hope you're doing well.

We've encountered a critical issue regarding job duplication in distribution environment when the server crashes and a new server comes up. Despite implementing distributed locks, we're experiencing instances where jobs running on the original server continue to run on the new server after it crashes and restarts. This is causing duplicates and impacting our system performance.

Your urgent attention to this matter is greatly appreciated. We need to resolve this issue as soon as possible to prevent any further disruptions.

Thank you for your assistance.

Copy link

glime-ai bot commented Mar 7, 2024

Hi,

Based on the context provided, it seems you are experiencing issues with job duplication in a distributed environment using Elsa Workflows when a server crashes and a new server comes up. This is indeed a critical issue that can lead to performance problems and inconsistent states within your workflow system.

To address this issue, it's important to ensure that your workflow instances are correctly managed across server restarts and crashes. Elsa Workflow's design allows for distributed execution, but it requires careful configuration, especially when dealing with long-running workflows and background jobs.

One approach to mitigate such issues is to use a persistent storage provider that supports distributed locks, such as Redis. This can help prevent multiple instances of the same job from being executed simultaneously. Additionally, you should verify that your Hangfire server (or whichever background job system you are using) is configured correctly to handle server crashes and restarts without duplicating jobs.

Since you mentioned that you have already implemented distributed locks but are still facing issues, it might be necessary to review the current implementation for potential misconfigurations or to consider alternative strategies for managing workflow instances and background jobs.

Given the complexity of the issue, it would be beneficial to engage directly with the Elsa Workflow community or the maintainers for more in-depth support. You can create an issue on the Elsa Workflow GitHub repository detailing your problem, environment setup, and any logs or error messages you have. This will allow the maintainers and community to provide more targeted assistance.

For immediate troubleshooting, you can refer to the Elsa Workflow documentation, GitHub issues, and Stack Overflow questions related to similar problems for guidance. If there is a known bug or limitation, it might be documented in the GitHub issues or in the release notes of the project.

Please ensure that you are using the latest stable version of Elsa Workflows, as updates often include bug fixes and improvements that could resolve your issue.

If you need further assistance, please provide more details about your current setup, including the version of Elsa Workflows you are using, the storage provider, the background job system configuration, and any relevant logs or error messages.

@dwevedivaibhav
Copy link
Author

Please find below configuration which i have added in start up

services.Configure(Configuration.GetSection("TaskSettings"));
services.Configure(Configuration.GetSection("ConnectionsMongoDatabaseOptions"));
//configure Redis
services.AddRedis($"{distribiutedCacheRedis?.ConnectionString}");
var migrationOptions = new MongoMigrationOptions
{
MigrationStrategy = new MigrateMongoMigrationStrategy(),
BackupStrategy = new CollectionMongoBackupStrategy()

        };
        var storageOptions = new MongoStorageOptions
        {
            MigrationOptions = migrationOptions,
            CheckConnection = false
        };
        services.AddHangfire(configuration => configuration
       .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
       .UseSimpleAssemblyNameTypeSerializer()
       .UseRecommendedSerializerSettings(settings => settings.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb))
       .UseMongoStorage(mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName, storageOptions));
        services.AddHangfireServer((sp, options) =>
        {
            options.HeartbeatInterval = TimeSpan.FromSeconds(2);
            options.ConfigureForElsaDispatchers(sp);
        });
        services.ConfigureCustomLogger();
        services
            .AddElsa(elsa =>
            {
                elsa.UseMongoDbPersistence(ef => ef.ConnectionString = mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName);
                elsa.ConfigureDistributedLockProvider(options => options.UseProviderFactory(sp => name =>
                {
                    var connection = sp.GetRequiredService<IConnectionMultiplexer>();
                    return new RedisDistributedLock(name, connection.GetDatabase());
                }));
                elsa.UseRedisCacheSignal();
                elsa.AddQuartzTemporalActivities();
                elsa.UseHangfireDispatchers();
            });
        services.AddElsaApiEndpoints();

@dwevedivaibhav
Copy link
Author

Hi @sfmskywalker
I'd like to address an issue we've encountered with our workflow execution across multiple servers. Allow me to illustrate with a simple example:

We have two servers, Server A and Server B. On Server A, we have Workflow A running, and on Server B, Workflow B is running. In the event of Server B crashing, a new server, Server C, takes over and resumes Workflow B from where it left off. However, we've observed an unexpected behavior where Workflow A from Server A also starts running on Server C, causing duplicate calls and inconsistencies in our system.

Despite implementing distributed locks, we're puzzled as to why Workflow A from Server A is being executed on Server C. This issue does not occur under normal circumstances when servers are not crashing.

We would greatly appreciate any insights or suggestions you may have on resolving this issue and ensuring the proper execution of workflows across servers.

Thank you for your attention to this matter.

@dwevedivaibhav
Copy link
Author

@sfmskywalker Any duration to fix this issue ?

@sfmskywalker
Copy link
Member

Hello @dwevedivaibhav,

Thank you for bringing this issue to our attention. I really wish I could offer a specific timeline for addressing it, but currently, my schedule is quite packed due to other projects and commitments, especially those related to my paying clients.

It appears from your description that Server C is unexpectedly handling workflow instances that are in a 'Running' state. Ideally, this scenario shouldn't occur since Server A is supposed to have an exclusive lock on Workflow A, ensuring that no other servers interfere with its operations. To properly investigate and resolve this issue, we would need a detailed set of steps that can reliably reproduce the problem. This kind of in-depth troubleshooting demands focused time and attention, which, regrettably, I'm unable to dedicate at this moment.

I understand this might not be the response you were hoping for, and I appreciate your patience and understanding. Your issue is important to us, and I assure you it’s on our radar. As soon as my current obligations have been met, I will take a closer look at your situation. Meanwhile, if you're able to provide any additional information or steps to reproduce the issue, it would be incredibly helpful for when we are able to address this.

Thanks for your understanding and for being a part of our community. Your contributions help us improve, and we're looking forward to resolving this together as soon as possible.

@dwevedivaibhav
Copy link
Author

Hi @sfmskywalker ,

It sounds like the problem lies in the resuming workflow method, particularly in the handling of server information in the workflow instance. Without including server information in the workflow instance, the system may mistakenly acquire locks for workflows that are already running on other servers, leading to duplicates.

To address this issue, we should ensure that server information is properly incorporated into the workflow instance. By doing so, we can accurately identify which server a workflow belongs to and prevent duplicates during job distribution and lock acquisition.

Once we have updated the workflow instances to include server information, we can refine the resuming workflow method to consider this information when acquiring locks. This should help resolve the issue of duplicates caused by incorrectly distributing and acquiring locks for workflows running on different servers.

Please find the exact file info below to take a look in identify the issue

https://github.com/elsa-workflows/elsa-core/blob/2.10.2.2/src/core/Elsa.Core/StartupTasks/ContinueRunningWorkflows.cs

Thanks

@dwevedivaibhav
Copy link
Author

Hi @sfmskywalker,

Any thoughts on this ?

@sfmskywalker
Copy link
Member

Makes sense. Probably, we also need to consider that at some point a server will go down, e.g. when hosting in a Kubernetes cluster. If a workflow instance is still associated with that server, and that workflow instance is still in the Running state, then it will not be picked up by new servers. Perhaps we would need a heartbeat system where servers update a table record every minute for example. If a given server hasn't reported a heartbeat for e.g. 5 minutes, it is considered dead, in which case new servers are free to pickup the Running workflow instance.

@dwevedivaibhav
Copy link
Author

Hi @sfmskywalker
I agree on your thoughts.

Thanks

@dwevedivaibhav
Copy link
Author

Hi @sfmskywalker

I'm currently working on making some custom changes and trying to call a class as a hook in my project. However, it seems like the approach I'm using isn't working as expected. Do you have any insights or ideas on how to proceed?

Below is the code snippet showing how I'm attempting to implement it:

services
.AddElsa(elsa =>
{
elsa.UseMongoDbPersistence(ef => ef.ConnectionString = mongoDatabaseOptions?.ConnectionString + "/" + mongoDatabaseOptions?.DatabaseName);
elsa.UseRedisCacheSignal();
elsa.ConfigureDistributedLockProvider(options => options.UseRedisLockProvider());
elsa.AddQuartzTemporalActivities();
elsa.UseHangfireDispatchers();

              // This is where we call custom hook of [ContinueRunningWorkflows.](https://github.com/elsa-workflows/elsa-core/blob/2.10.2.2/src/core/Elsa.Core/StartupTasks/ContinueRunningWorkflows.cs)
                elsa.Services.AddStartupTask<CustomContinueRunningWorkflowsStartupTask>();
            });

Your assistance and guidance on this matter would be greatly appreciated.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants