Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose error due to failure of local python pipeline node execution #1411

Merged
merged 14 commits into from Mar 18, 2021

Conversation

kiersten-stokes
Copy link
Member

@kiersten-stokes kiersten-stokes commented Mar 10, 2021

Resolves #1382 by adding an except block to trap CalledProcessError's that occur specifically as a result of python subprocess execution failure.

New look

new look

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

@elyra-bot
Copy link

elyra-bot bot commented Mar 10, 2021

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@kiersten-stokes kiersten-stokes added the component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines label Mar 10, 2021
@kevin-bates
Copy link
Member

Hi @kiersten-stokes - thanks for working on this.

I've pulled your branch and run a pipeline with a known error. Prior to this change, the error message produced from a local run was:
Screen Shot 2021-03-11 at 2 55 06 PM

After the changes, it seems that most of the traceback now appears in the title. Also, note the double scrollbars in the details window - along with the '.' - which leads me to believe something additional was expected.
Screen Shot 2021-03-11 at 3 01 11 PM

While the actual error (File 'node1a.out' already exists!) appears a tad more prominently at the bottom of the details box, I'm not sure it's an improvement.

Have you stepped through with a debugger to see what pieces of CalledProcessError might be usable? If we can identify a succinct portion of the actual error message, we could log the full error and raise another error with the more succinct portion. At least that's my thoughts on this.

I would also recommend adding an additional assertion in the test that ensures this (TBD) succinct portion is in the raised exception.

@kevin-bates
Copy link
Member

@kiersten-stokes - please ignore my last comment! I just realized, when looking into debugging this, that your changes only apply to python script execution, not notebook node execution - which is what my response was about. I apologize.

Why the different output must be a function of the PR branch and I wonder if there are changes in rc3 that are not in your PR. At any rate, I will switch to looking at this from the perspective of a python node.

In general, when fixing issues that change what is visible to the user, I try to include the updated message or dialog for the reviewer. If you could update your opening description with a screenshot of the new message, that would be helpful - thank you.

@kevin-bates
Copy link
Member

OK - here's the error dialog - now that I know I'm running the right stuff!!!
image

This looks good. The header is detailed yet the details are rich with traceback.

It would be awesome if the header portion could read more like:
Error processing operation node1py: TypeError: '<' not supported between instances of 'str' and 'int' .

Have you explored gaining access to only the TypeError portion? Not sure this is possible w/o impacting the traceback info.

@kiersten-stokes
Copy link
Member Author

@kevin-bates No double scrollbars in my case.

It would be awesome if the header portion could read more like:
Error processing operation node1py: TypeError: '<' not supported between instances of 'str' and 'int' .

^^ I completely agree. I tried to get something like this, but unfortunately I don't think CalledProcessError returns anything that can be used to get nicer formatting. Let me triple check on that front, though!

In general, when fixing issues that change what is visible to the user, I try to include the updated message or dialog for the reviewer. If you could update your opening description with a screenshot of the new message, that would be helpful - thank you.

^^ Will do! Thank for another good tip!

@kevin-bates
Copy link
Member

kevin-bates commented Mar 12, 2021

I completely agree. I tried to get something like this, but unfortunately I don't think CalledProcessError returns anything that can be used to get nicer formatting. Let me triple check on that front, though!

You're right - CPE is very limited (kinda frustrating actually). I think trying to log something useful and trimming the error message string to essentially the file_name might be the best we can do.

I think we have some better options with notebook node issues by trapping PapermillExecutionError (and its evalue field) similar to what you've done for (the limited) CalledProcessError might be something we should add as well (in the notebook processing method).

In your experience with CPE.stderr is it always the case that the interesting error is at the end of the stream (TypeError: in this case)? (We might want to experiment with a few.) If so, perhaps a reverse search of 'Error:' - followed by further capture of the full error name, might prove a useful way to grab the interesting portion.

@kiersten-stokes kiersten-stokes added the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Mar 12, 2021
@kiersten-stokes
Copy link
Member Author

kiersten-stokes commented Mar 12, 2021

Local notebook handling errors now look like the below, matching the general style of the log and popup box messages for python script execution errors. Note that the evalue attribute of the PapermillExecutionError is sometimes empty, as can be seen in the second screenshot when compared to the corresponding lines of code (line 211 in processor_local).

Screen Shot 2021-03-12 at 4 52 37 PM

Screen Shot 2021-03-12 at 4 53 06 PM

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good - just had a few comments.

except papermill.PapermillExecutionError as pmee:
self.log.error(f'Internal error executing {file_name}: {str(pmee.ename)}' +
f'{str(pmee.evalue)} in [{str(pmee.exec_count)}]')
raise RuntimeError(f'{str(pmee.ename)} {str(pmee.evalue)} in [{str(pmee.exec_count)}]') from pmee
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the in [2] portion of this confusing and regular users won't know what this means. I suggest we drop this from the RuntimeError and even the log message.

The log will contain a traceback for this somewhere - I suspect when the web request throws - so that kind of information should be in the log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! Makes sense if the normal user wouldn't get any useful info from it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if we use notebook cell X?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptitzler - good idea, but I'd rather convey the cell index slightly differently...

f'{str(pmee.ename)} {str(pmee.evalue)} executing cell {str(pmee.cell_index}]'

which would yield...

Error processing operation load_data: FileNotFoundError [Errno 2] No such file or directory: 'data/file1.csv' executing cell 3

or

Error processing operation load_data: AssertionError executing cell 3

per examples above. Does that sound okay?

(Note: I'm assuming the attribute name is cell_index.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, amending my previous comment, cell_index is not the correct attribute, exec_count still gives the correct cell value.

except Exception as ex:
raise RuntimeError(f'Internal error executing {filepath}: {ex}') from ex
self.log.error(f'Internal error executing {file_name}: {str(ex)}')
raise RuntimeError('Internal error executing notebook') from ex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the notebook name in the error.

Suggested change
raise RuntimeError('Internal error executing notebook') from ex
raise RuntimeError(f'Internal error executing {file_name}') from ex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I must have misunderstood this piece in our earlier conversation. I removed the file name in the popup message because the exception message one up the chain already includes the node name, which already is the file name (sans extension).

Error processing operation load_data: Internal error executing load_data.ipynb: ...

Do you think we should keep both in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it will always be the case that the node name will always be derived from the file name and I feel including the actual file name is an extra level of detail that could be useful. We might also find other "internal execution errors" that are based on other things - not necessarily related to the node/file name.

That said, everyone has opinions and mine aren't necessarily correct. 😄 I'm going to as for @ptitzler to review as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Unlike in earlier releases the node name and the file name can be entirely different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that's helpful! I was wondering if there were cases where node name != file name. I'll add the file name to each of these errors, and remove the word "internal" since it's unnecessary.

if error_trim_index != -1:
raise RuntimeError(error_msg[error_trim_index:]) from cpe
else:
raise RuntimeError('Internal error executing Python script') from cpe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the actual script name.

Suggested change
raise RuntimeError('Internal error executing Python script') from cpe
raise RuntimeError(f'Internal error executing {file_name}') from cpe

except Exception as ex:
raise RuntimeError(f'Internal error executing {filepath}: {ex}') from ex
self.log.error(f'Internal error executing {file_name}: {str(ex)}')
raise RuntimeError('Internal error executing Python script') from ex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for script name.

@ptitzler
Copy link
Member

General comment: there's probably no need to classify these as internal errors in the message text.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good. Thanks for tackling a similar issue for the notebook node as well!

@kiersten-stokes
Copy link
Member Author

Thanks, @kevin-bates! I'll just clean up those tests and hopefully we'll be all set!

@kiersten-stokes kiersten-stokes removed the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Mar 16, 2021
@akchinSTC akchinSTC added this to the 2.2.0 milestone Mar 16, 2021
@kevin-bates
Copy link
Member

This looks awesome now - thank you @kiersten-stokes!
image

@kevin-bates
Copy link
Member

Thanks @kiersten-stokes

Here's a python script error:
image

And a notebook error:
image

@lresende lresende merged this pull request into elyra-ai:master Mar 18, 2021
lresende pushed a commit that referenced this pull request Mar 29, 2021
…1485)

It introduces a new base class named ScriptOperationProcessor
from which the existingPythonScriptOperationProcessor and
RScriptOperationProcessor classes now derive. This base class
contains 90% of the applicable code with the subclasses providing
their name and argument vectors.

It introduces a log_and_raise() method on the base 
FIleOperationProcessor class that is available to all file-based
operations. Building on the work done in #1411, this method
checks the length of the error message and truncates it
to around the max (80), replacing overflow with ellipses (...).

Adds a test that removes the kernel metadata from a notebook
node and ensures the appropriate error is raised.
@kiersten-stokes kiersten-stokes deleted the local-failure-error-handling branch August 20, 2021 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline execution hides actual issue when local failed execution
5 participants