lsf job finished successfully... workflow still running #239

davidlmorton · 2016-01-12T08:29:51Z

http://lsf.apipe-deis.gsc.wustl.edu/v1/jobs/b04f3f26-d509-4930-8bbb-37f288fa7f3a

http://workflow.apipe-deis.gsc.wustl.edu/v1/reports/workflow-details?workflow_id=182976

mkiwala · 2016-01-12T20:19:17Z

Searching kibana-logstash for callback for Job method (execute:715340) only returns this log message:

Got "execute" callback for Job method (execute:715340) in workflow "Perl SDK Integration Test (lsf_single_operation) 6db920b4-2f27-43ca-a668-daf28cbebee5"

This is consistent with the lsf job succeeding, but the workflow is still running.

Unfortunately, this bug didn't manifest itself in staging where we have logging of when webhooks are scheduled on the http_worker queue. The only thing I can tell is that the status of the lsf job changed in the lsf service, but the succeeded webhook was never sent.

The running webhook was never sent or received either. This is ok because the job status changed immediately from submitted to succeeded, a duration of 13 seconds. ~~But I would note that it is odd that there was no status update during these 13 seconds since the polling interval is set to 3 seconds.~~ EDIT: Even though the polling interval for this job was 3 seconds, the poller was firing at 10 second intervals

Perhaps an unknown entity is connecting to production rabbitmq and stealing messages again?

mkiwala · 2016-01-12T22:05:27Z

@davidlmorton -- we talked a bit in person about increasing logging around sending these webhooks. I'd like to document our plan here.

We currently log when ptero-workflow receives webhooks sent by ptero-lsf. In the case of this bug, we are not seeing any webhooks received by ptero-workflow from ptero-lsf, even though we expected the succeeded webhook. We suspect problems around the queue for the http_worker, but the kind of logging that we have is not sufficient to determine whether the problem lies in the handling of messages.

We could increase the logging in job.py where the messages are put on the queue. We could also increase the logging in http.py where the messages are received from the queue. I think we should increase the logging level in both places.

What do you think about increasing the log level in these two places?

davidlmorton · 2016-01-13T23:39:44Z

I agree with the plan, it gets tricky to increase the logging in the http.py since that is common library code. Perhaps we introduce a separate logging level for http altogether and let each service set its main log level separately from its http log level?

mkiwala · 2016-01-19T16:32:49Z

I opened genome/ptero-common#48 to introduce a separate logging level for http.py altogether.

mkiwala · 2016-01-19T17:28:28Z

And genome/ptero-lsf#102

davidlmorton added bug ready labels Jan 12, 2016

davidlmorton added Frequency: Rare Reproducibility: Hard Reproducibility: Unknown Severity: Critical in review ready and removed Reproducibility: Hard ready in review labels Jan 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lsf job finished successfully... workflow still running #239

lsf job finished successfully... workflow still running #239

davidlmorton commented Jan 12, 2016

mkiwala commented Jan 12, 2016

mkiwala commented Jan 12, 2016

davidlmorton commented Jan 13, 2016

mkiwala commented Jan 19, 2016

mkiwala commented Jan 19, 2016

lsf job finished successfully... workflow still running #239

lsf job finished successfully... workflow still running #239

Comments

davidlmorton commented Jan 12, 2016

mkiwala commented Jan 12, 2016

mkiwala commented Jan 12, 2016

davidlmorton commented Jan 13, 2016

mkiwala commented Jan 19, 2016

mkiwala commented Jan 19, 2016