[WIP] More visible warnings of transfer and checksum errors #27

weaverba137 · 2021-01-07T17:14:19Z

This PR:

There are still some open questions from the issues listed above which I consolidate here:

How rigorously to we re-verify checksums, for example after an early-morning catch-up transfer (see Alert experts for checksum failures #26)?
What are the exact criteria for allowing raw data to be transferred from the staging area to $DESI_SPECTRO_REDUX (see realtime staging of exposures with extra files #25)?

sbailey · 2021-01-08T05:10:54Z

Thanks. I see that various "error" messages have been upgraded to "critical", but it is unclear to me what happens to them beyond that, or whether they are just changing name but still hidden within logfiles waiting for humans to grep them each morning. Or are critical errors promoted to emails alerts in a way that errors are not?

How rigorously to we re-verify checksums, for example after an early-morning catch-up transfer

I suggest checking after the early-morning catch-up transfer, and again after the noontime-ish final sync. In an ideal future world we won't have checksum errors, but in our current world that double check might allow for some problems to be caught and address prior to the final sync verifying that they are ok and backing everything up to tape.

What are the exact criteria for allowing raw data to be transferred from the staging area to $DESI_SPECTRO_REDUX

The current mood of the pipeline team has swung all the way to "post everything you can get that had a transfer link created at KPNO". i.e. if ICS posts a link at KPNO and you can transfer the directory, go ahead and immediately post it regardless of checksum errors. We certainly know of cases where we wish the data was posted immediately despite checksum errors; we don't yet know any cases where we were glad that it wasn't posted. If we ever encounter that latter case (perhaps in a future world where the provided checksums are a reliable marker of file corruption), we may adjust the policy again.

weaverba137 · 2021-01-08T15:55:02Z

Changing the messages to 'critical' triggers an email sent to desi-alarms-transfer.

I will add checksum verification after the catch-up, etc.

I will make the adjustment to the staging to $DESI_SPECTRO_DATA posting.

In compensation, would it be reasonable at this time to remove the code involved in allowing the data transfer to trigger the pipeline? Since the pipeline team wants to make that decision entirely on its own, there is no point in maintaining separate logic to make that decision.

sbailey · 2021-01-08T17:19:42Z

Sounds good. Good idea to remove the unused pipeline triggering logic. Go for it.

weaverba137 · 2021-01-08T18:01:23Z

Another question I just thought of: since we will be receiving emails warning of checksum errors, and since the pipeline wants every exposure, regardless of error, what role should the transfer status display have? I confess, I almost never look at it.

sbailey · 2021-01-08T19:27:00Z

I had forgotten about the transfers status page, but I think it is useful as-is, in particular for easy flagging which exposures have checksum errors to point to Klaus. i.e. go ahead and post the exposures, but still list them in the transfer status page like you are now.

…osures

weaverba137 · 2021-01-12T19:08:18Z

@sbailey, I want to do operational tests on this prior to merging, but the code should be ready for review.

sbailey

Code looks good, but +1 for operations testing this in addition to unit tests. I added a minor comment about noting in docstrings when a function modifies its inputs.

sbailey · 2021-01-12T19:18:32Z

py/desitransfer/daemon.py

+        checksum_file : :class:`str`
+            The checksum file.
+        status : :class:`desitransfer.status.TransferStatus`
+            The associated status object.


When a function modifies one of its inputs, it's nice to comment on that in the docstring. In this case, it could clarify that this function modified status rather than returning a checksum succes/error/status.

I'm not sure what you mean by "modified" in this case. The method uses the status object, but it doesn't really modify it.

I guess you just mean that the result of the check is reported via log messages and the status object, rather than via return value?

Perhaps I misunderstood. I thought the status object was tracking state that was updated via status.update(night, exposure, ...) when this function was called. If the only effect of calling this function is printing log messages, with no changes to the status object itself and no return value, then this docstring is fine.

OK, then we definitely don't agree on the definition of "modify". The status object does not have state, in the object-oriented sense, that can be modified. It is a fancy way to write the state to a file. That's why I say the status object is used but not modified.

weaverba137 added 2 commits January 4, 2021 11:31

modify warnings for files not listed in checksum files

6d5ed8f

email some error messages

2b9cbd4

weaverba137 added the WIP Work in Progress label Jan 7, 2021

weaverba137 requested a review from sbailey January 7, 2021 17:14

weaverba137 self-assigned this Jan 7, 2021

add checksum audit

c881915

allow for files in subdirectories

0beec3d

weaverba137 added 9 commits January 8, 2021 13:59

remove pipeline activation code; unconditionally move transferred exp…

d03368d

…osures

remove pipeline from stages

5fcf646

working on checksums for catch-up transfers

90d3a47

hide comment column.

529c4cd

refactor tests of catchup stage.

5e55751

allow any order

f97b4f3

refactor checksum validation

91ebec7

add coveralls token

a49601f

remove debug

7f32a94

weaverba137 removed the WIP Work in Progress label Jan 12, 2021

sbailey approved these changes Jan 12, 2021

View reviewed changes

weaverba137 added 3 commits January 12, 2021 14:54

clarify doc string

4bbf806

pass status in directory method

305db5f

update change log

20e85d9

weaverba137 merged commit e75432b into master Jan 18, 2021

weaverba137 deleted the checksum-warnings branch January 18, 2021 23:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] More visible warnings of transfer and checksum errors #27

[WIP] More visible warnings of transfer and checksum errors #27

weaverba137 commented Jan 7, 2021 •

edited

sbailey commented Jan 8, 2021

weaverba137 commented Jan 8, 2021

sbailey commented Jan 8, 2021

weaverba137 commented Jan 8, 2021

sbailey commented Jan 8, 2021

weaverba137 commented Jan 12, 2021

sbailey left a comment

sbailey Jan 12, 2021

weaverba137 Jan 12, 2021

weaverba137 Jan 12, 2021

sbailey Jan 12, 2021

weaverba137 Jan 12, 2021

[WIP] More visible warnings of transfer and checksum errors #27

[WIP] More visible warnings of transfer and checksum errors #27

Conversation

weaverba137 commented Jan 7, 2021 • edited

sbailey commented Jan 8, 2021

weaverba137 commented Jan 8, 2021

sbailey commented Jan 8, 2021

weaverba137 commented Jan 8, 2021

sbailey commented Jan 8, 2021

weaverba137 commented Jan 12, 2021

sbailey left a comment

Choose a reason for hiding this comment

sbailey Jan 12, 2021

Choose a reason for hiding this comment

weaverba137 Jan 12, 2021

Choose a reason for hiding this comment

weaverba137 Jan 12, 2021

Choose a reason for hiding this comment

sbailey Jan 12, 2021

Choose a reason for hiding this comment

weaverba137 Jan 12, 2021

Choose a reason for hiding this comment

weaverba137 commented Jan 7, 2021 •

edited