Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] More visible warnings of transfer and checksum errors #27

Merged
merged 16 commits into from Jan 18, 2021

Conversation

weaverba137
Copy link
Member

@weaverba137 weaverba137 commented Jan 7, 2021

This PR:

There are still some open questions from the issues listed above which I consolidate here:

@weaverba137 weaverba137 added the WIP Work in Progress label Jan 7, 2021
@weaverba137 weaverba137 self-assigned this Jan 7, 2021
@sbailey
Copy link
Contributor

sbailey commented Jan 8, 2021

Thanks. I see that various "error" messages have been upgraded to "critical", but it is unclear to me what happens to them beyond that, or whether they are just changing name but still hidden within logfiles waiting for humans to grep them each morning. Or are critical errors promoted to emails alerts in a way that errors are not?

How rigorously to we re-verify checksums, for example after an early-morning catch-up transfer

I suggest checking after the early-morning catch-up transfer, and again after the noontime-ish final sync. In an ideal future world we won't have checksum errors, but in our current world that double check might allow for some problems to be caught and address prior to the final sync verifying that they are ok and backing everything up to tape.

What are the exact criteria for allowing raw data to be transferred from the staging area to $DESI_SPECTRO_REDUX

The current mood of the pipeline team has swung all the way to "post everything you can get that had a transfer link created at KPNO". i.e. if ICS posts a link at KPNO and you can transfer the directory, go ahead and immediately post it regardless of checksum errors. We certainly know of cases where we wish the data was posted immediately despite checksum errors; we don't yet know any cases where we were glad that it wasn't posted. If we ever encounter that latter case (perhaps in a future world where the provided checksums are a reliable marker of file corruption), we may adjust the policy again.

@weaverba137
Copy link
Member Author

Changing the messages to 'critical' triggers an email sent to desi-alarms-transfer.

I will add checksum verification after the catch-up, etc.

I will make the adjustment to the staging to $DESI_SPECTRO_DATA posting.

In compensation, would it be reasonable at this time to remove the code involved in allowing the data transfer to trigger the pipeline? Since the pipeline team wants to make that decision entirely on its own, there is no point in maintaining separate logic to make that decision.

@sbailey
Copy link
Contributor

sbailey commented Jan 8, 2021

Sounds good. Good idea to remove the unused pipeline triggering logic. Go for it.

@weaverba137
Copy link
Member Author

Another question I just thought of: since we will be receiving emails warning of checksum errors, and since the pipeline wants every exposure, regardless of error, what role should the transfer status display have? I confess, I almost never look at it.

@sbailey
Copy link
Contributor

sbailey commented Jan 8, 2021

I had forgotten about the transfers status page, but I think it is useful as-is, in particular for easy flagging which exposures have checksum errors to point to Klaus. i.e. go ahead and post the exposures, but still list them in the transfer status page like you are now.

@weaverba137 weaverba137 removed the WIP Work in Progress label Jan 12, 2021
@weaverba137
Copy link
Member Author

@sbailey, I want to do operational tests on this prior to merging, but the code should be ready for review.

Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but +1 for operations testing this in addition to unit tests. I added a minor comment about noting in docstrings when a function modifies its inputs.

checksum_file : :class:`str`
The checksum file.
status : :class:`desitransfer.status.TransferStatus`
The associated status object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a function modifies one of its inputs, it's nice to comment on that in the docstring. In this case, it could clarify that this function modified status rather than returning a checksum succes/error/status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "modified" in this case. The method uses the status object, but it doesn't really modify it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you just mean that the result of the check is reported via log messages and the status object, rather than via return value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I misunderstood. I thought the status object was tracking state that was updated via status.update(night, exposure, ...) when this function was called. If the only effect of calling this function is printing log messages, with no changes to the status object itself and no return value, then this docstring is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then we definitely don't agree on the definition of "modify". The status object does not have state, in the object-oriented sense, that can be modified. It is a fancy way to write the state to a file. That's why I say the status object is used but not modified.

@weaverba137 weaverba137 merged commit e75432b into master Jan 18, 2021
@weaverba137 weaverba137 deleted the checksum-warnings branch January 18, 2021 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alert experts for checksum failures realtime staging of exposures with extra files
2 participants