Fault array, fix retry, new parameter in Join activity #3182

smartwaystudio · 2022-07-04T11:45:25Z

This PR has breaking changes

Hi, I will try to set up my needs to explain what is this PR about. We are migrating our app core from custom Hangfire jobs to Elsa workflows with custom activities to enrich our app and also to have the ability to modify the workflows without making changes on the code. So, the first thing I did was to copy my hangfire flow to custom activities on Elsa. Everything was OK and the workflow was being executing with the same results. Then, I started do dig more, so, I purpossely failed some activities (custom and fault activity) to try the retry functionality and it wasn't working. Already opened an issue with the problem . If I couldn't retry my failed activities, I would need to give up elsa, so I started to checking the core code. I saw that Workflowinstance.CurrentActivity was always null, so the reviver was always using the first workflow activity as retry. But also found another problem (at least in my scenario). Elsa was only saving to database the last fault activity, so, in a loop or fork, just the last fault activity was being tagged as faulted, the others were OK. I changed the fault attribute to faults as SimpleStack. Now, I can save all the faulted activities. And, then, changed the reviver to check if there are faulted activities to schedule on retry. Now it's working good, even the retry. Now you can retry all the faulted activities inside a loop.

(Be careful with how you design your loop. If the activities input are dependant on other activities, for example, the output of the loop, it won't work as expected, because it will use the last output of the loop as input for every activity that uses "return foreach.Output()". The way to solve this is to transmit the input from the first activity in the loop to the input of the next activity, just returning it as output.)

Also, I found another problem. In a fork scenario, the final join continues the execution (with both configuration modes) even if the inbound activities has failed (tagged as faulted). I added a new parameter (bool) to the Join activity that prevents that. It checks if there are fault or scheduled activities (this change is chained with the previous) and stop the execution. Just changed IsDone method. Now, it will add to the journal the activities that have faulted, like the inbounds journal data.

There are 2 activity list checked on isDone. Faults is used to know if there are some faults in the current workflowinstance execution. The Scheduled activity list is used on retry. When you retry a faulted workflow, it moves the activities from Faults to scheduled list.

Finally, I updated the designer with the new Faults parameter (was Fault) and now, all the faulted activities in the designer are tagged red.

…to support wait faulted activities.

dnfadmin · 2022-07-04T11:45:37Z

All CLA requirements met.

smartwaystudio · 2022-07-05T12:05:36Z

Willing to add more changes to this PR asap about the "Run workflow" activity because I just found that If the secondary workflow is faulted, the Run workflow activity is faulted, but if you try to retry the parent workflow, it generates a new secondary workflow instance instead of retrying the faulted one.

mohdali · 2022-07-06T22:37:24Z

This looks very interesting and we definitely need more options for fault handling.

@sfmskywalker can you please take a look and advise as it has breaking changes.

…ng inside a foreach too.

smartwaystudio · 2022-07-08T09:40:13Z

Hi, as I told, new changes have been pushed. The RunWorkFlow activity was always creating a new ChildWorkflow instance when retrying. There could be some scenario that you don't want that to happen, so I added a new Advanced parameter in the activity to, instead of creating new instances, retry the faulted when retrying.

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I". That was happening also inside runworkflow. If not in a branched workflow, the retry was easy, but inside a loop, you can't determine what child workflow needs to be revived, because the activity just store the last iteration state inside, so ChildWorkflowInstanceID was pointing to the last. So, what I've done is to generate a hash for the values of the Input and store it. When retrying, using the input hash I can get the real ChildWorkflowInstance id and retry it.

Here is a test without and with that parameter active. I just have a 4 iterations loop that fault. If I retry without that option, I get another 4 instances of the ChildWorkflow:

If the option is active, it retry the faulted (I changed the Fault activity code to fast testing, returning done):

sfmskywalker · 2022-07-08T18:45:32Z

Great findings and solutions!

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I"

Exactly. Your fix for this with the RunWorkflow activity is clever. The "real" fix might be for each activity execution to have its own instance ID - but that might require some significant refactorings, perhaps via #1059 - obviously out of scope for this PR

As an FYI: Elsa 3 already handles this today where each activity execution acquires its ow unique instance ID.

sfmskywalker · 2022-07-08T18:53:42Z

src/core/Elsa.Abstractions/Services/Models/WorkflowExecutionContext.cs

@@ -188,7 +188,7 @@ public void Fault(Exception? exception, string message, string? activityId, obje
            var clock = ServiceProvider.GetRequiredService<IClock>();
            WorkflowInstance.WorkflowStatus = WorkflowStatus.Faulted;
            WorkflowInstance.FaultedAt = clock.GetCurrentInstant();
-            WorkflowInstance.Fault = new WorkflowFault(SimpleException.FromException(exception), message, activityId, activityInput, resuming);


Instead of deleting this Fault property, could we keep it and mark it as Obsolete and change its implementation to get Faults.FirstOrDefault() as to try and limit the number of breaking changes for users with existing workflow instances?

Hi, OK, I'll mark it as Obsolete but that wouldn't prevent "breaking" existing workflows. If we do that, we will lost the Fault original value anyway because it will never be accessible. Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned. I can try to find somewhere in the code where that movement could fit without breaking, and thus, removing it in the future, like a "patch". Maybe in the workflow instance constructor. Will wait for your response for this review

Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned

Exactly, that would be sweet.

Thanks man, appreciate your efforts!

sfmskywalker · 2022-07-08T18:56:53Z

src/core/Elsa.Core/Activities/ControlFlow/Join/Join.cs

@@ -47,6 +47,10 @@ public enum JoinMode
        )]
        public JoinMode Mode { get; set; }

+
+        [ActivityInput(Hint = "True if incoming activities executed should not be Faulted to continue.", UIHint = ActivityInputUIHints.SingleLine, SupportedSyntaxes = new[] { SyntaxNames.JavaScript, SyntaxNames.Liquid })]
+        public bool WaitFaulted { get; set; } = default!;


I'm wondering if it wouldn't be better to always stop the workflow when it faulted (and thereby removing this property)? It seems to me that as soon as the workflow faults, no activity should continue. The fact that the Join activity still continues after a previous activity faulted is a bug IMHO.

I agree that stopping when faulted should be the correct way to do the Join Activity. Changing it to be the default behaviour and removing the property in next commit.

sfmskywalker · 2022-07-08T19:00:01Z

src/core/Elsa.Core/Activities/Workflows/RunWorkflow/RunWorkflow.cs

+
+

I noticed a few excessive blank lines here and there - let's tidy it up 😄

My bad 😉 my brain was struggling hard with this Activity. Cleaning it next commit

I can relate 😄

sfmskywalker · 2022-07-08T19:04:22Z

...s/workflow-instance-viewer/elsa-workflow-instance-journal/elsa-workflow-instance-journal.tsx

@@ -242,7 +243,7 @@ export class ElsaWorkflowInstanceJournal {
              <div class="elsa-relative elsa-flex elsa-space-x-3">
                <div>
                  <span
-                    class="elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-green-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1"
+                    class={`elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-${eventColor}-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1`}


I am guilty of this as well, but we shouldn't dynamically construct TailwindCSS classes like this, because they will be filtered out by the CSS cleanup phase in the build action.

Instead, it's better to assign the complete class value to the eventColor variable. Example (pseudo code):

const eventColor = faulted ? 'elsa-bg-red-500' : 'elsa-bg-green-500'

This way, the complete TailwindCSS class strings will be recognized.

Nice, changing it.

smartwaystudio · 2022-07-11T10:50:00Z

Great findings and solutions!

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I"

Exactly. Your fix for this with the RunWorkflow activity is clever. The "real" fix might be for each activity execution to have its own instance ID - but that might require some significant refactorings, perhaps via #1059 - obviously out of scope for this PR

As an FYI: Elsa 3 already handles this today where each activity execution acquires its ow unique instance ID.

When I started to design some testing workflows to learn about Elsa I was a bit surprised when the execution log in a loop was showing the same ActivityId for one looped activity and also just storing the last executed activity value. Moreover, that means that the parallelism is limited. I thought that, using hangfire, it would create a hangfire job for each branch, at least in the parallel foreach. That issue you are talking about would be very nice but indeed, requires a lot of refactoring. Can you point me how have you done this in Elsa 3? Maybe I can think about it! 😄

sfmskywalker

Awesome job 👍🏻

smartwaystudio · 2022-07-11T14:51:43Z

Awesome job 👍🏻

Thanks, I saw you merged! Will try to finish the Fault/Faults change in a new PR today.

sfmskywalker · 2022-07-11T14:51:57Z

Can you point me how have you done this in Elsa 3? Maybe I can think about it! 😄

Sure! In Elsa 3, each activity execution context has its own ID, which could be interpreted as a unique activity instance ID.
This way, even when executing in a loop, every time an activity executes, it has its own activity execution context and ID. The activity execution context can store information that gets persisted when the workflow instance gets persisted.

mohdali · 2022-07-11T15:16:44Z

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

smartwaystudio · 2022-07-11T17:36:15Z

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.

Do you want a export of those workflow with the modified fault?

mohdali · 2022-07-11T17:47:40Z

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.

Do you want a export of those workflow with the modified fault?

Yes please share the export. I'm not sure how you made the modification but I think if you unpublish then republish the workflow it will modify the current version.

smartwaystudio · 2022-07-12T16:48:45Z

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.
Do you want a export of those workflow with the modified fault?

Yes please share the export. I'm not sure how you made the modification but I think if you unpublish then republish the workflow it will modify the current version.

test_workflows.zip

Here you have, some json workflows and Fault.cs. What I do is to force force Fault activity to sometimes return Done(). First execution of the workflow I return fault or exception and then, I change the code to return Done() to verify that the workflow ends properly. Is like emulating a scenario when an activity suddenly fails and you just need to retry it (ex: because it was a network error)

mohdali · 2022-07-15T16:26:59Z

@smartwaystudio thanks for the examples!

I have tested with the scenario of having dependent activities in the foreach and they executed once the workflow is retired and the error is cleared. This is really awesome!

Rubén M added 2 commits July 4, 2022 12:52

Faulted activity as array, updating designer, updating join activity …

456f3fd

…to support wait faulted activities.

Update test

8458ebf

smartwaystudio changed the title ~~Feature/fault array~~ Fault array, fix retry, new parameter in Join activity Jul 4, 2022

mohdali requested a review from sfmskywalker July 6, 2022 22:35

Rubén M added 3 commits July 7, 2022 14:47

Temp

ec80d56

Run faulted subworkflow instance instead of creating a new one. Worki…

fe2834e

…ng inside a foreach too.

RunWorkflow Activity retry done

61b74cf

sfmskywalker reviewed Jul 8, 2022

View reviewed changes

Rubén M added 3 commits July 11, 2022 12:53

Removing WaitFaulted property to become the default behaviour

5d4c4d3

RunWorkflow tidy up

69ba864

EventColor class value to a variable instead of dynamically constructing

e650798

sfmskywalker approved these changes Jul 11, 2022

View reviewed changes

sfmskywalker merged commit 4d95f7d into elsa-workflows:master Jul 11, 2022

smartwaystudio mentioned this pull request Jul 11, 2022

Restore Fault property and mark it as obsolete. Patched it for less impact (ref PR 3182) #3194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault array, fix retry, new parameter in Join activity #3182

Fault array, fix retry, new parameter in Join activity #3182

smartwaystudio commented Jul 4, 2022

dnfadmin commented Jul 4, 2022 •

edited

smartwaystudio commented Jul 5, 2022

mohdali commented Jul 6, 2022

smartwaystudio commented Jul 8, 2022

sfmskywalker commented Jul 8, 2022

sfmskywalker Jul 8, 2022

smartwaystudio Jul 11, 2022 •

edited

sfmskywalker Jul 11, 2022

sfmskywalker Jul 8, 2022

smartwaystudio Jul 11, 2022

sfmskywalker Jul 11, 2022

sfmskywalker Jul 8, 2022

smartwaystudio Jul 11, 2022

sfmskywalker Jul 11, 2022

sfmskywalker Jul 8, 2022

smartwaystudio Jul 11, 2022

smartwaystudio commented Jul 11, 2022

sfmskywalker left a comment

smartwaystudio commented Jul 11, 2022

sfmskywalker commented Jul 11, 2022

mohdali commented Jul 11, 2022

smartwaystudio commented Jul 11, 2022 •

edited

mohdali commented Jul 11, 2022

smartwaystudio commented Jul 12, 2022

mohdali commented Jul 15, 2022

Fault array, fix retry, new parameter in Join activity #3182

Fault array, fix retry, new parameter in Join activity #3182

Conversation

smartwaystudio commented Jul 4, 2022

dnfadmin commented Jul 4, 2022 • edited

smartwaystudio commented Jul 5, 2022

mohdali commented Jul 6, 2022

smartwaystudio commented Jul 8, 2022

sfmskywalker commented Jul 8, 2022

Choose a reason for hiding this comment

smartwaystudio Jul 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smartwaystudio commented Jul 11, 2022

sfmskywalker left a comment

Choose a reason for hiding this comment

smartwaystudio commented Jul 11, 2022

sfmskywalker commented Jul 11, 2022

mohdali commented Jul 11, 2022

smartwaystudio commented Jul 11, 2022 • edited

mohdali commented Jul 11, 2022

smartwaystudio commented Jul 12, 2022

mohdali commented Jul 15, 2022

dnfadmin commented Jul 4, 2022 •

edited

smartwaystudio Jul 11, 2022 •

edited

smartwaystudio commented Jul 11, 2022 •

edited