Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault array, fix retry, new parameter in Join activity #3182

Merged

Conversation

smartwaystudio
Copy link

This PR has breaking changes

Hi, I will try to set up my needs to explain what is this PR about. We are migrating our app core from custom Hangfire jobs to Elsa workflows with custom activities to enrich our app and also to have the ability to modify the workflows without making changes on the code. So, the first thing I did was to copy my hangfire flow to custom activities on Elsa. Everything was OK and the workflow was being executing with the same results. Then, I started do dig more, so, I purpossely failed some activities (custom and fault activity) to try the retry functionality and it wasn't working. Already opened an issue with the problem . If I couldn't retry my failed activities, I would need to give up elsa, so I started to checking the core code. I saw that Workflowinstance.CurrentActivity was always null, so the reviver was always using the first workflow activity as retry. But also found another problem (at least in my scenario). Elsa was only saving to database the last fault activity, so, in a loop or fork, just the last fault activity was being tagged as faulted, the others were OK. I changed the fault attribute to faults as SimpleStack. Now, I can save all the faulted activities. And, then, changed the reviver to check if there are faulted activities to schedule on retry. Now it's working good, even the retry. Now you can retry all the faulted activities inside a loop.
imagen
imagen

(Be careful with how you design your loop. If the activities input are dependant on other activities, for example, the output of the loop, it won't work as expected, because it will use the last output of the loop as input for every activity that uses "return foreach.Output()". The way to solve this is to transmit the input from the first activity in the loop to the input of the next activity, just returning it as output.)

Also, I found another problem. In a fork scenario, the final join continues the execution (with both configuration modes) even if the inbound activities has failed (tagged as faulted). I added a new parameter (bool) to the Join activity that prevents that. It checks if there are fault or scheduled activities (this change is chained with the previous) and stop the execution. Just changed IsDone method. Now, it will add to the journal the activities that have faulted, like the inbounds journal data.
imagen

There are 2 activity list checked on isDone. Faults is used to know if there are some faults in the current workflowinstance execution. The Scheduled activity list is used on retry. When you retry a faulted workflow, it moves the activities from Faults to scheduled list.

Finally, I updated the designer with the new Faults parameter (was Fault) and now, all the faulted activities in the designer are tagged red.
imagen

@dnfadmin
Copy link

dnfadmin commented Jul 4, 2022

CLA assistant check
All CLA requirements met.

@smartwaystudio smartwaystudio changed the title Feature/fault array Fault array, fix retry, new parameter in Join activity Jul 4, 2022
@smartwaystudio
Copy link
Author

Willing to add more changes to this PR asap about the "Run workflow" activity because I just found that If the secondary workflow is faulted, the Run workflow activity is faulted, but if you try to retry the parent workflow, it generates a new secondary workflow instance instead of retrying the faulted one.

@mohdali mohdali requested a review from sfmskywalker July 6, 2022 22:35
@mohdali
Copy link
Member

mohdali commented Jul 6, 2022

This looks very interesting and we definitely need more options for fault handling.

@sfmskywalker can you please take a look and advise as it has breaking changes.

@smartwaystudio
Copy link
Author

Hi, as I told, new changes have been pushed. The RunWorkFlow activity was always creating a new ChildWorkflow instance when retrying. There could be some scenario that you don't want that to happen, so I added a new Advanced parameter in the activity to, instead of creating new instances, retry the faulted when retrying.
image

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I". That was happening also inside runworkflow. If not in a branched workflow, the retry was easy, but inside a loop, you can't determine what child workflow needs to be revived, because the activity just store the last iteration state inside, so ChildWorkflowInstanceID was pointing to the last. So, what I've done is to generate a hash for the values of the Input and store it. When retrying, using the input hash I can get the real ChildWorkflowInstance id and retry it.

image
image

Here is a test without and with that parameter active. I just have a 4 iterations loop that fault. If I retry without that option, I get another 4 instances of the ChildWorkflow:
image

If the option is active, it retry the faulted (I changed the Fault activity code to fast testing, returning done):
image

@sfmskywalker
Copy link
Member

Great findings and solutions!

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I"

Exactly. Your fix for this with the RunWorkflow activity is clever. The "real" fix might be for each activity execution to have its own instance ID - but that might require some significant refactorings, perhaps via #1059 - obviously out of scope for this PR

As an FYI: Elsa 3 already handles this today where each activity execution acquires its ow unique instance ID.

@@ -188,7 +188,7 @@ public void Fault(Exception? exception, string message, string? activityId, obje
var clock = ServiceProvider.GetRequiredService<IClock>();
WorkflowInstance.WorkflowStatus = WorkflowStatus.Faulted;
WorkflowInstance.FaultedAt = clock.GetCurrentInstant();
WorkflowInstance.Fault = new WorkflowFault(SimpleException.FromException(exception), message, activityId, activityInput, resuming);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of deleting this Fault property, could we keep it and mark it as Obsolete and change its implementation to get Faults.FirstOrDefault() as to try and limit the number of breaking changes for users with existing workflow instances?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, OK, I'll mark it as Obsolete but that wouldn't prevent "breaking" existing workflows. If we do that, we will lost the Fault original value anyway because it will never be accessible. Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned. I can try to find somewhere in the code where that movement could fit without breaking, and thus, removing it in the future, like a "patch". Maybe in the workflow instance constructor. Will wait for your response for this review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can think something about moving the Fault value to the new Faults property so everything will be aligned

Exactly, that would be sweet.

Thanks man, appreciate your efforts!

@@ -47,6 +47,10 @@ public enum JoinMode
)]
public JoinMode Mode { get; set; }


[ActivityInput(Hint = "True if incoming activities executed should not be Faulted to continue.", UIHint = ActivityInputUIHints.SingleLine, SupportedSyntaxes = new[] { SyntaxNames.JavaScript, SyntaxNames.Liquid })]
public bool WaitFaulted { get; set; } = default!;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it wouldn't be better to always stop the workflow when it faulted (and thereby removing this property)? It seems to me that as soon as the workflow faults, no activity should continue. The fact that the Join activity still continues after a previous activity faulted is a bug IMHO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that stopping when faulted should be the correct way to do the Join Activity. Changing it to be the default behaviour and removing the property in next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect.

Comment on lines 168 to 169


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a few excessive blank lines here and there - let's tidy it up 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad 😉 my brain was struggling hard with this Activity. Cleaning it next commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can relate 😄

@@ -242,7 +243,7 @@ export class ElsaWorkflowInstanceJournal {
<div class="elsa-relative elsa-flex elsa-space-x-3">
<div>
<span
class="elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-green-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1"
class={`elsa-h-8 elsa-w-8 elsa-rounded-full elsa-bg-${eventColor}-500 elsa-flex elsa-items-center elsa-justify-center elsa-ring-8 elsa-ring-white elsa-mr-1`}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guilty of this as well, but we shouldn't dynamically construct TailwindCSS classes like this, because they will be filtered out by the CSS cleanup phase in the build action.

Instead, it's better to assign the complete class value to the eventColor variable. Example (pseudo code):

const eventColor = faulted ? 'elsa-bg-red-500' : 'elsa-bg-green-500'

This way, the complete TailwindCSS class strings will be recognized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, changing it.

@smartwaystudio
Copy link
Author

Great findings and solutions!

There is a general "problem" in all activities. The problem is that they are not unique in a branched workflow. Inside a loop, the same activity instance is used for all the iterations, so is hard to determine "who am I"

Exactly. Your fix for this with the RunWorkflow activity is clever. The "real" fix might be for each activity execution to have its own instance ID - but that might require some significant refactorings, perhaps via #1059 - obviously out of scope for this PR

As an FYI: Elsa 3 already handles this today where each activity execution acquires its ow unique instance ID.

When I started to design some testing workflows to learn about Elsa I was a bit surprised when the execution log in a loop was showing the same ActivityId for one looped activity and also just storing the last executed activity value. Moreover, that means that the parallelism is limited. I thought that, using hangfire, it would create a hangfire job for each branch, at least in the parallel foreach. That issue you are talking about would be very nice but indeed, requires a lot of refactoring. Can you point me how have you done this in Elsa 3? Maybe I can think about it! 😄

Copy link
Member

@sfmskywalker sfmskywalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job 👍🏻

@sfmskywalker sfmskywalker merged commit 4d95f7d into elsa-workflows:master Jul 11, 2022
@smartwaystudio
Copy link
Author

Awesome job 👍🏻

Thanks, I saw you merged! Will try to finish the Fault/Faults change in a new PR today.

@sfmskywalker
Copy link
Member

Can you point me how have you done this in Elsa 3? Maybe I can think about it! 😄

Sure! In Elsa 3, each activity execution context has its own ID, which could be interpreted as a unique activity instance ID.
This way, even when executing in a loop, every time an activity executes, it has its own activity execution context and ID. The activity execution context can store information that gets persisted when the workflow instance gets persisted.

@mohdali
Copy link
Member

mohdali commented Jul 11, 2022

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

@smartwaystudio
Copy link
Author

smartwaystudio commented Jul 11, 2022

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.

Do you want a export of those workflow with the modified fault?

@mohdali
Copy link
Member

mohdali commented Jul 11, 2022

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.

Do you want a export of those workflow with the modified fault?

Yes please share the export. I'm not sure how you made the modification but I think if you unpublish then republish the workflow it will modify the current version.

@smartwaystudio
Copy link
Author

This is really exciting and I want to test it soon. @smartwaystudio is it possible to share the json of the test cases you created?

What I've done to test was creating some workflow (Fork, foreach, pforeach, while, fork-join, runWorkflow, foreach-runworkflow) and temporary modifying the Fault activity. The first run, the workflow fails and then, I change the Fault activity to return Done() (also adding a outcome to connect an activity to it) to test the retry. I've done this way because there is no way to edit the current version of workflow with dashboard.
Do you want a export of those workflow with the modified fault?

Yes please share the export. I'm not sure how you made the modification but I think if you unpublish then republish the workflow it will modify the current version.

test_workflows.zip

Here you have, some json workflows and Fault.cs. What I do is to force force Fault activity to sometimes return Done(). First execution of the workflow I return fault or exception and then, I change the code to return Done() to verify that the workflow ends properly. Is like emulating a scenario when an activity suddenly fails and you just need to retry it (ex: because it was a network error)

@mohdali
Copy link
Member

mohdali commented Jul 15, 2022

@smartwaystudio thanks for the examples!

I have tested with the scenario of having dependent activities in the foreach and they executed once the workflow is retired and the error is cleared. This is really awesome!

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants