New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] More visible warnings of transfer and checksum errors #27
Conversation
Thanks. I see that various "error" messages have been upgraded to "critical", but it is unclear to me what happens to them beyond that, or whether they are just changing name but still hidden within logfiles waiting for humans to grep them each morning. Or are critical errors promoted to emails alerts in a way that errors are not?
I suggest checking after the early-morning catch-up transfer, and again after the noontime-ish final sync. In an ideal future world we won't have checksum errors, but in our current world that double check might allow for some problems to be caught and address prior to the final sync verifying that they are ok and backing everything up to tape.
The current mood of the pipeline team has swung all the way to "post everything you can get that had a transfer link created at KPNO". i.e. if ICS posts a link at KPNO and you can transfer the directory, go ahead and immediately post it regardless of checksum errors. We certainly know of cases where we wish the data was posted immediately despite checksum errors; we don't yet know any cases where we were glad that it wasn't posted. If we ever encounter that latter case (perhaps in a future world where the provided checksums are a reliable marker of file corruption), we may adjust the policy again. |
Changing the messages to 'critical' triggers an email sent to I will add checksum verification after the catch-up, etc. I will make the adjustment to the staging to In compensation, would it be reasonable at this time to remove the code involved in allowing the data transfer to trigger the pipeline? Since the pipeline team wants to make that decision entirely on its own, there is no point in maintaining separate logic to make that decision. |
Sounds good. Good idea to remove the unused pipeline triggering logic. Go for it. |
Another question I just thought of: since we will be receiving emails warning of checksum errors, and since the pipeline wants every exposure, regardless of error, what role should the transfer status display have? I confess, I almost never look at it. |
I had forgotten about the transfers status page, but I think it is useful as-is, in particular for easy flagging which exposures have checksum errors to point to Klaus. i.e. go ahead and post the exposures, but still list them in the transfer status page like you are now. |
@sbailey, I want to do operational tests on this prior to merging, but the code should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, but +1 for operations testing this in addition to unit tests. I added a minor comment about noting in docstrings when a function modifies its inputs.
checksum_file : :class:`str` | ||
The checksum file. | ||
status : :class:`desitransfer.status.TransferStatus` | ||
The associated status object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a function modifies one of its inputs, it's nice to comment on that in the docstring. In this case, it could clarify that this function modified status
rather than returning a checksum succes/error/status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by "modified" in this case. The method uses the status object, but it doesn't really modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you just mean that the result of the check is reported via log messages and the status object, rather than via return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I misunderstood. I thought the status
object was tracking state that was updated via status.update(night, exposure, ...)
when this function was called. If the only effect of calling this function is printing log messages, with no changes to the status
object itself and no return value, then this docstring is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then we definitely don't agree on the definition of "modify". The status object does not have state, in the object-oriented sense, that can be modified. It is a fancy way to write the state to a file. That's why I say the status object is used but not modified.
This PR:
There are still some open questions from the issues listed above which I consolidate here:
$DESI_SPECTRO_REDUX
(see realtime staging of exposures with extra files #25)?