Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

lsf job finished successfully... workflow still running #239

Open
davidlmorton opened this issue Jan 12, 2016 · 5 comments
Open

lsf job finished successfully... workflow still running #239

davidlmorton opened this issue Jan 12, 2016 · 5 comments

Comments

@mkiwala
Copy link
Contributor

mkiwala commented Jan 12, 2016

Searching kibana-logstash for callback for Job method (execute:715340) only returns this log message:

Got "execute" callback for Job method (execute:715340) in workflow "Perl SDK Integration Test (lsf_single_operation) 6db920b4-2f27-43ca-a668-daf28cbebee5"

This is consistent with the lsf job succeeding, but the workflow is still running.

Unfortunately, this bug didn't manifest itself in staging where we have logging of when webhooks are scheduled on the http_worker queue. The only thing I can tell is that the status of the lsf job changed in the lsf service, but the succeeded webhook was never sent.

The running webhook was never sent or received either. This is ok because the job status changed immediately from submitted to succeeded, a duration of 13 seconds. But I would note that it is odd that there was no status update during these 13 seconds since the polling interval is set to 3 seconds. EDIT: Even though the polling interval for this job was 3 seconds, the poller was firing at 10 second intervals

Perhaps an unknown entity is connecting to production rabbitmq and stealing messages again?

@mkiwala
Copy link
Contributor

mkiwala commented Jan 12, 2016

@davidlmorton -- we talked a bit in person about increasing logging around sending these webhooks. I'd like to document our plan here.

We currently log when ptero-workflow receives webhooks sent by ptero-lsf. In the case of this bug, we are not seeing any webhooks received by ptero-workflow from ptero-lsf, even though we expected the succeeded webhook. We suspect problems around the queue for the http_worker, but the kind of logging that we have is not sufficient to determine whether the problem lies in the handling of messages.

We could increase the logging in job.py where the messages are put on the queue. We could also increase the logging in http.py where the messages are received from the queue. I think we should increase the logging level in both places.

What do you think about increasing the log level in these two places?

@davidlmorton
Copy link
Contributor Author

I agree with the plan, it gets tricky to increase the logging in the http.py since that is common library code. Perhaps we introduce a separate logging level for http altogether and let each service set its main log level separately from its http log level?

@mkiwala
Copy link
Contributor

mkiwala commented Jan 19, 2016

I opened genome/ptero-common#48 to introduce a separate logging level for http.py altogether.

@mkiwala
Copy link
Contributor

mkiwala commented Jan 19, 2016

And genome/ptero-lsf#102

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants