-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Urgent: Issue with Job Duplication During Server Crash and Restart in distribution server #5045
Comments
Hi, Based on the context provided, it seems you are experiencing issues with job duplication in a distributed environment using Elsa Workflows when a server crashes and a new server comes up. This is indeed a critical issue that can lead to performance problems and inconsistent states within your workflow system. To address this issue, it's important to ensure that your workflow instances are correctly managed across server restarts and crashes. Elsa Workflow's design allows for distributed execution, but it requires careful configuration, especially when dealing with long-running workflows and background jobs. One approach to mitigate such issues is to use a persistent storage provider that supports distributed locks, such as Redis. This can help prevent multiple instances of the same job from being executed simultaneously. Additionally, you should verify that your Hangfire server (or whichever background job system you are using) is configured correctly to handle server crashes and restarts without duplicating jobs. Since you mentioned that you have already implemented distributed locks but are still facing issues, it might be necessary to review the current implementation for potential misconfigurations or to consider alternative strategies for managing workflow instances and background jobs. Given the complexity of the issue, it would be beneficial to engage directly with the Elsa Workflow community or the maintainers for more in-depth support. You can create an issue on the Elsa Workflow GitHub repository detailing your problem, environment setup, and any logs or error messages you have. This will allow the maintainers and community to provide more targeted assistance. For immediate troubleshooting, you can refer to the Elsa Workflow documentation, GitHub issues, and Stack Overflow questions related to similar problems for guidance. If there is a known bug or limitation, it might be documented in the GitHub issues or in the release notes of the project. Please ensure that you are using the latest stable version of Elsa Workflows, as updates often include bug fixes and improvements that could resolve your issue. If you need further assistance, please provide more details about your current setup, including the version of Elsa Workflows you are using, the storage provider, the background job system configuration, and any relevant logs or error messages. |
Please find below configuration which i have added in start up services.Configure(Configuration.GetSection("TaskSettings"));
|
Hi @sfmskywalker We have two servers, Server A and Server B. On Server A, we have Workflow A running, and on Server B, Workflow B is running. In the event of Server B crashing, a new server, Server C, takes over and resumes Workflow B from where it left off. However, we've observed an unexpected behavior where Workflow A from Server A also starts running on Server C, causing duplicate calls and inconsistencies in our system. Despite implementing distributed locks, we're puzzled as to why Workflow A from Server A is being executed on Server C. This issue does not occur under normal circumstances when servers are not crashing. We would greatly appreciate any insights or suggestions you may have on resolving this issue and ensuring the proper execution of workflows across servers. Thank you for your attention to this matter. |
@sfmskywalker Any duration to fix this issue ? |
Hello @dwevedivaibhav, Thank you for bringing this issue to our attention. I really wish I could offer a specific timeline for addressing it, but currently, my schedule is quite packed due to other projects and commitments, especially those related to my paying clients. It appears from your description that Server C is unexpectedly handling workflow instances that are in a 'Running' state. Ideally, this scenario shouldn't occur since Server A is supposed to have an exclusive lock on Workflow A, ensuring that no other servers interfere with its operations. To properly investigate and resolve this issue, we would need a detailed set of steps that can reliably reproduce the problem. This kind of in-depth troubleshooting demands focused time and attention, which, regrettably, I'm unable to dedicate at this moment. I understand this might not be the response you were hoping for, and I appreciate your patience and understanding. Your issue is important to us, and I assure you it’s on our radar. As soon as my current obligations have been met, I will take a closer look at your situation. Meanwhile, if you're able to provide any additional information or steps to reproduce the issue, it would be incredibly helpful for when we are able to address this. Thanks for your understanding and for being a part of our community. Your contributions help us improve, and we're looking forward to resolving this together as soon as possible. |
Hi @sfmskywalker , It sounds like the problem lies in the resuming workflow method, particularly in the handling of server information in the workflow instance. Without including server information in the workflow instance, the system may mistakenly acquire locks for workflows that are already running on other servers, leading to duplicates. To address this issue, we should ensure that server information is properly incorporated into the workflow instance. By doing so, we can accurately identify which server a workflow belongs to and prevent duplicates during job distribution and lock acquisition. Once we have updated the workflow instances to include server information, we can refine the resuming workflow method to consider this information when acquiring locks. This should help resolve the issue of duplicates caused by incorrectly distributing and acquiring locks for workflows running on different servers. Please find the exact file info below to take a look in identify the issue Thanks |
Hi @sfmskywalker, Any thoughts on this ? |
Makes sense. Probably, we also need to consider that at some point a server will go down, e.g. when hosting in a Kubernetes cluster. If a workflow instance is still associated with that server, and that workflow instance is still in the Running state, then it will not be picked up by new servers. Perhaps we would need a heartbeat system where servers update a table record every minute for example. If a given server hasn't reported a heartbeat for e.g. 5 minutes, it is considered dead, in which case new servers are free to pickup the Running workflow instance. |
Hi @sfmskywalker Thanks |
I'm currently working on making some custom changes and trying to call a class as a hook in my project. However, it seems like the approach I'm using isn't working as expected. Do you have any insights or ideas on how to proceed? Below is the code snippet showing how I'm attempting to implement it: services
Your assistance and guidance on this matter would be greatly appreciated. Thanks |
Hi @sfmskywalker,
I hope you're doing well.
We've encountered a critical issue regarding job duplication in distribution environment when the server crashes and a new server comes up. Despite implementing distributed locks, we're experiencing instances where jobs running on the original server continue to run on the new server after it crashes and restarts. This is causing duplicates and impacting our system performance.
Your urgent attention to this matter is greatly appreciated. We need to resolve this issue as soon as possible to prevent any further disruptions.
Thank you for your assistance.
The text was updated successfully, but these errors were encountered: