Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dataflow errors #7193

Merged
merged 6 commits into from Jul 9, 2023
Merged

Conversation

hubertp
Copy link
Contributor

@hubertp hubertp commented Jul 3, 2023

Pull Request Description

The logic in instrumentation specifically prohibited caching of panics (added in #1611).
Except it seems that the restriction was too aggressive because dataflow errors could be cached.
This indirectly fixes the problem with dataflow error in the context of a (temporary) change in execution context, which should keep around the old value.

Closes #7140.

After the fix:
fix_7140.webm

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

- [ ] The documentation has been updated, if necessary.

  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd
Copy link
Member

radeusgd commented Jul 3, 2023

Any reason we are not caching panics then?

It appears that if we keep not caching them, the same issue as described in #7140 will persist whenever a function panics.

@radeusgd
Copy link
Member

radeusgd commented Jul 3, 2023

I see now 02f801b (#1611).

So it seems we have a conflict:

  • caching Panics will cause issues with rerunning problematic nodes, especially when adding an import to resolve the problem,
  • not caching them, will cause problems of the workflow being recomputed if a dry-run panics.

@hubertp
Copy link
Contributor Author

hubertp commented Jul 4, 2023

I see now 02f801b (#1611).

So it seems we have a conflict:

  • caching Panics will cause issues with rerunning problematic nodes, especially when adding an import to resolve the problem,
  • not caching them, will cause problems of the workflow being recomputed if a dry-run panics.

Yes, that's the reason. I knew that Panics can still cause problems but this is a much more tricky/less common issue than regular dataflow errors. I would suggest that we file a separate ticket instead, just for tracking panics.

@hubertp hubertp force-pushed the wip/hubert/7140-cache-dataflow-errors branch from 903f21d to 4a4a538 Compare July 4, 2023 09:32
@radeusgd
Copy link
Member

radeusgd commented Jul 4, 2023

I would suggest that we file a separate ticket instead, just for tracking panics.

I started a discussion then so we can decide how we want to handle this: https://github.com/orgs/enso-org/discussions/7203

GitHub
The discussion under the PR fixing caching of dataflow errors (#7193) highlights a 'conflict' between two functionalities that we want. We are caching regular results, and with #7193 we will be (ar...

Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sane & tested.

@@ -250,7 +251,7 @@ public void onReturnExceptional(VirtualFrame frame, Throwable exception) {
}

private void onExpressionReturn(Object result, Node node, EventContext context) throws ThreadDeath {
boolean isPanic = result instanceof AbstractTruffleException;
boolean isPanic = result instanceof AbstractTruffleException && !(result instanceof DataflowError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any system in these checks at all (another example at https://github.com/enso-org/enso/pull/7205/files#diff-bfe95bc1ca4b016dccbab4fc97f23809a6c83d040de6ca8a6a615f02378e1a59R357)?

I am afraid we are just patching random parts of the Enso system without having a consistent classification of the objects and their roles. Everything seems to hold together just with some unit tests. At least we have them, but still...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a connection between this and the mentioned PR other than that we deal with DataFlowError. Yes, it's patching (not so random) parts of the Enso system and so what is wrong with that?
Alternatively we shouldn't match on AbstractTruffleException at all and wrap it into something we can control. Is it a better solution? That's debatable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is wrong with that?

It is perfectly pragmatic behavior. I am a huge fan of pragmatism. However sometimes I a masking myself - shouldn't we be more [rationalistic]http://wiki.apidesign.org/wiki/Rationalism) when designing a language?

Possibly more consistency than patching would make the language better...

The logic in instrumentation specifically prohibited caching of panics.
Except it seems that the restriction was too aggressive because dataflow
errors could be.
This indirectly fixes the problem with dataflow error in the context of
(temporary)change in execution context, which should keep around the old
value.
@hubertp hubertp force-pushed the wip/hubert/7140-cache-dataflow-errors branch from 4a4a538 to a79f751 Compare July 5, 2023 09:13
@hubertp
Copy link
Contributor Author

hubertp commented Jul 5, 2023

@radeusgd could I get a 👍 while the discussion considers the, much less frequent, case of panics?

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

The fix looks ok, but I agree with the notion brought up by @JaroslavTulach - the complexity of the system grows and there is a lot of special cases and it does not seem like we have a systematic way of handling them - that feels a bit worrying in terms of extending it in the future - but that is a topic not for this PR.

@JaroslavTulach shall we start a discussion about this? You are probably better suited to express the issue clearly (I have more like a vague intuition here)

@hubertp
Copy link
Contributor Author

hubertp commented Jul 6, 2023

I will see if I can extract some of the special logic for panics and other errors so that it is defined in a single place, if possible. But as a follow up PR.

@hubertp hubertp added the CI: Ready to merge This PR is eligible for automatic merge label Jul 6, 2023
This reverts commit a79f751.
@mergify mergify bot merged commit 9dcf48a into develop Jul 9, 2023
27 checks passed
@mergify mergify bot deleted the wip/hubert/7140-cache-dataflow-errors branch July 9, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If a node fails with dataflow error in batch run, it immediately falls back to a dry run afterwards
5 participants