-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault array, fix retry, new parameter in Join activity #3182
Fault array, fix retry, new parameter in Join activity #3182
Conversation
…to support wait faulted activities.
Willing to add more changes to this PR asap about the "Run workflow" activity because I just found that If the secondary workflow is faulted, the Run workflow activity is faulted, but if you try to retry the parent workflow, it generates a new secondary workflow instance instead of retrying the faulted one. |
This looks very interesting and we definitely need more options for fault handling. @sfmskywalker can you please take a look and advise as it has breaking changes. |
Great findings and solutions!
Exactly. Your fix for this with the RunWorkflow activity is clever. The "real" fix might be for each activity execution to have its own instance ID - but that might require some significant refactorings, perhaps via #1059 - obviously out of scope for this PR As an FYI: Elsa 3 already handles this today where each activity execution acquires its ow unique instance ID. |
@@ -188,7 +188,7 @@ public void Fault(Exception? exception, string message, string? activityId, obje | |||
var clock = ServiceProvider.GetRequiredService<IClock>(); | |||
WorkflowInstance.WorkflowStatus = WorkflowStatus.Faulted; | |||
WorkflowInstance.FaultedAt = clock.GetCurrentInstant(); | |||
WorkflowInstance.Fault = new WorkflowFault(SimpleException.FromException(exception), message, activityId, activityInput, resuming); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of deleting this Fault
property, could we keep it and mark it as Obsolete
and change its implementation to get Faults.FirstOrDefault()
as to try and limit the number of breaking changes for users with existing workflow instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, OK, I'll mark it as Obsolete but that wouldn't prevent "breaking" existing workflows. If we do that, we will lost the Fault original value anyway because it will never be accessible. Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned. I can try to find somewhere in the code where that movement could fit without breaking, and thus, removing it in the future, like a "patch". Maybe in the workflow instance constructor. Will wait for your response for this review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned
Exactly, that would be sweet.
Thanks man, appreciate your efforts!
@@ -47,6 +47,10 @@ public enum JoinMode | |||
)] | |||
public JoinMode Mode { get; set; } | |||
|
|||
|
|||
[ActivityInput(Hint = "True if incoming activities executed should not be Faulted to continue.", UIHint = ActivityInputUIHints.SingleLine, SupportedSyntaxes = new[] { SyntaxNames.JavaScript, SyntaxNames.Liquid })] | |||
public bool WaitFaulted { get; set; } = default!; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it wouldn't be better to always stop the workflow when it faulted (and thereby removing this property)? It seems to me that as soon as the workflow faults, no activity should continue. The fact that the Join
activity still continues after a previous activity faulted is a bug IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that stopping when faulted should be the correct way to do the Join Activity. Changing it to be the default behaviour and removing the property in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect.
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a few excessive blank lines here and there - let's tidy it up 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad 😉 my brain was struggling hard with this Activity. Cleaning it next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can relate 😄
@@ -242,7 +243,7 @@ export class ElsaWorkflowInstanceJournal { | |||
<div class="elsa-relative elsa-flex elsa-space-x-3"> | |||
<div> | |||
<span | |||
class="elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-green-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1" | |||
class={`elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-${eventColor}-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1`} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guilty of this as well, but we shouldn't dynamically construct TailwindCSS classes like this, because they will be filtered out by the CSS cleanup phase in the build action.
Instead, it's better to assign the complete class value to the eventColor
variable. Example (pseudo code):
const eventColor = faulted ? 'elsa-bg-red-500' : 'elsa-bg-green-500'
This way, the complete TailwindCSS class strings will be recognized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, changing it.
When I started to design some testing workflows to learn about Elsa I was a bit surprised when the execution log in a loop was showing the same ActivityId for one looped activity and also just storing the last executed activity value. Moreover, that means that the parallelism is limited. I thought that, using hangfire, it would create a hangfire job for each branch, at least in the parallel foreach. That issue you are talking about would be very nice but indeed, requires a lot of refactoring. Can you point me how have you done this in Elsa 3? Maybe I can think about it! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job 👍🏻
Thanks, I saw you merged! Will try to finish the Fault/Faults change in a new PR today. |
Sure! In Elsa 3, each activity execution context has its own ID, which could be interpreted as a unique activity instance ID. |
This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created? |
What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard. Do you want a export of those workflow with the modified fault? |
Yes please share the export. I'm not sure how you made the modification but I think if you unpublish then republish the workflow it will modify the current version. |
Here you have, some json workflows and |
@smartwaystudio thanks for the examples! I have tested with the scenario of having dependent activities in the foreach and they executed once the workflow is retired and the error is cleared. This is really awesome! |
This PR has breaking changes
Hi, I will try to set up my needs to explain what is this PR about. We are migrating our app core from custom Hangfire jobs to Elsa workflows with custom activities to enrich our app and also to have the ability to modify the workflows without making changes on the code. So, the first thing I did was to copy my hangfire flow to custom activities on Elsa. Everything was OK and the workflow was being executing with the same results. Then, I started do dig more, so, I purpossely failed some activities (custom and fault activity) to try the retry functionality and it wasn't working. Already opened an issue with the problem . If I couldn't retry my failed activities, I would need to give up elsa, so I started to checking the core code. I saw that Workflowinstance.CurrentActivity was always null, so the reviver was always using the first workflow activity as retry. But also found another problem (at least in my scenario). Elsa was only saving to database the last fault activity, so, in a loop or fork, just the last fault activity was being tagged as faulted, the others were OK. I changed the fault attribute to faults as SimpleStack. Now, I can save all the faulted activities. And, then, changed the reviver to check if there are faulted activities to schedule on retry. Now it's working good, even the retry. Now you can retry all the faulted activities inside a loop.
(Be careful with how you design your loop. If the activities input are dependant on other activities, for example, the output of the loop, it won't work as expected, because it will use the last output of the loop as input for every activity that uses "return foreach.Output()". The way to solve this is to transmit the input from the first activity in the loop to the input of the next activity, just returning it as output.)
Also, I found another problem. In a fork scenario, the final join continues the execution (with both configuration modes) even if the inbound activities has failed (tagged as faulted). I added a new parameter (bool) to the Join activity that prevents that. It checks if there are fault or scheduled activities (this change is chained with the previous) and stop the execution. Just changed IsDone method. Now, it will add to the journal the activities that have faulted, like the inbounds journal data.
There are 2 activity list checked on isDone. Faults is used to know if there are some faults in the current workflowinstance execution. The Scheduled activity list is used on retry. When you retry a faulted workflow, it moves the activities from Faults to scheduled list.
Finally, I updated the designer with the new Faults parameter (was Fault) and now, all the faulted activities in the designer are tagged red.