iPython times out indefinitely... #298

caddymob · 2014-02-10T23:18:06Z

Hi Brad,

Before I had jobs finished to update re #258 - bcbio stopped working. We would start a bcbio it would hang for however long the timeout was set to then iPython would throw an error. Here is the output:

~jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/

/packages/bcbio/0.7.4/bin/bcbio_nextgen.py ../config/bc_pnap2.yaml -t ipython -s torque -q batch -n 16 --timeout 4  --retries 1 --tag pnap2

And this is the output as it hangs:
[2014-02-10 12:56] pnap-login2: Using input YAML configuration: /home/jcorneveaux/GALAXY_BC_TEST/config/bc_pnap2.yaml
[2014-02-10 12:56] pnap-login2: Checking sample YAML configuration: /home/jcorneveaux/GALAXY_BC_TEST/config/bc_pnap2.yaml
[2014-02-10 12:56] pnap-login2: Testing minimum versions of installed programs
[2014-02-10 12:56] pnap-login2: Resource requests: bwa, gatk; memory: 2.5; cores: 16, 1
[2014-02-10 12:56] pnap-login2: Configuring 1 jobs to run, using 16 cores each with 40.2g of memory reserved for each job
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipcontroller_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipengine_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipcluster_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/iplogger_config.py'
2014-02-10 12:56:41.730 [IPClusterStart] Config changed:
2014-02-10 12:56:41.730 [IPClusterStart] {'BcbioTORQUEEngineSetLauncher': {'mem': '40.2', 'cores': 16, 'tag': 'pnap2', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'1506ed99-9d69-41e4-ba67-96f3fe657ab5'}, 'TORQUELauncher': {'queue': 'batch'}, 'BcbioTORQUEControllerLauncher': {'mem': '40.2', 'tag': 'pnap2', 'resources': ''}, 'IPClusterStart': {'delay': 10, 'controller_launcher_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher', 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'n': 1}}
2014-02-10 12:56:41.732 [IPClusterStart] Using existing profile dir: u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython'
2014-02-10 12:56:41.733 [IPClusterStart] Searching path [u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2', u'/home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython'] for config files
2014-02-10 12:56:41.733 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-02-10 12:56:41.734 [IPClusterStart] Loaded config file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipython_config.py
2014-02-10 12:56:41.735 [IPClusterStart] Attempting to load config file: ipcluster_1506ed99_9d69_41e4_ba67_96f3fe657ab5_config.py
2014-02-10 12:56:41.736 [IPClusterStart] Loaded config file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipcontroller_config.py
2014-02-10 12:56:41.737 [IPClusterStart] Attempting to load config file: ipcluster_1506ed99_9d69_41e4_ba67_96f3fe657ab5_config.py
2014-02-10 12:56:41.738 [IPClusterStart] Loaded config file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipengine_config.py
2014-02-10 12:56:41.739 [IPClusterStart] Attempting to load config file: ipcluster_1506ed99_9d69_41e4_ba67_96f3fe657ab5_config.py
2014-02-10 12:56:41.741 [IPClusterStart] Loaded config file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/ipcluster_config.py
Traceback (most recent call last):
 File "/packages/bcbio/0.7.4/bin/bcbio_nextgen.py", line 59, in <module>
   main(**kwargs)
 File "/packages/bcbio/0.7.4/bin/bcbio_nextgen.py", line 39, in main
   run_main(**kwargs)
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 46, in run_main
   fc_dir, run_info_yaml)
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 88, in _run_toplevel
   for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items):
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 290, in run
   multiplier=alignprep.parallel_multiplier(samples)) as run_parallel:
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
   return self.gen.next()
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/site-packages/bcbio/distributed/prun.py", line 45, in start
   with ipython.create(parallel, dirs, config) as view:
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
   return self.gen.next()
 File "/packages/bcbio/0.7.4/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 760, in cluster_view
   raise IOError("Cluster startup timed out.")
IOError: Cluster startup timed out

iPython is not even getting as far as producing the PBS scripts. The error above is the only error we are seeing. bcbio WORKS if I check-out a compute node (using -t torque just to be clear, not just running local on said node) and start the job there. We have checked that all nodes are talking across the network.

Any idea what meant be causing this? Second is there a way to have, I am guessing, iPython be more verbose??

The text was updated successfully, but these errors were encountered:

chapmanb · 2014-02-11T01:17:12Z

Jason;
Unfortunately, IPython can be hard to coax error messages from sometimes. We've tried to turn on and log as much information as we know how from it. There might be useful error messages in log/ipython/log/ipcluster*, although it is strange it's not even writing batch scripts.

The other issue I've run into is old IPython ghost scripts running on the machine from previous runs. You could check ps ax | grep IPy to see if you have old controllers/engines running from previous work that are somehow confusing the process.

If none of this helps: were you able to submit jobs/start IPython from this problem node previously? If not it could be blocking ZeroMQ ports or something else that IPython needs to start up that the compute nodes allow.

Hope this help some. Thanks for the patience debugging this.

caddymob · 2014-02-11T01:54:52Z

Yes, we have basically 2 submit nodes - one is the standard login1 node and one is new that we built for galaxy. Both have worked great thus far.

I did check every single node for dangling chads with this - https://gist.github.com/caddymob/8927861

while there were a few old ones out there, I killed them, wiped my work dirs and yet all still submissions hang -

Very strange indeed. I haven't tried every node to submit (nor do I want to, I want/need to submit from galaxy), but did and still have run several jobs from this weekend runningt when I do my bcbio_nextgen.py Rad.yaml -t ipython -s torque -q batch -n 192 from a node I am running interactively (ie checkout node via galaxy or login-1 node with qsub -I -V -l nodes=1:ppn=2 -l walltime=200:00:00 -j oe)- then run the exact same command, zoom way the jobs go.

Had me head-scratching all weekend, now my IT guys are looking into it and none of us have a clue what happened... For the record qsub'in jobs, no-bcbio jobs work no problemo. It's something with iPy that has lost it's once glorious ability to heard trillions of tiny little DNA sequences

I see you're up to 2.0.11 on iPy at this point, I'm at 2.0.9... I dont see how this will magically fix, but you got any other ideas?

chapmanb · 2014-02-11T02:09:46Z

Jason;
Ugh, I'm not sure. I don't think 0.2.9 to 0.2.11 will make any difference here. I'd start a fresh directory and then look in the log/ipython/log/ipcluster*.log file to see if there are any clues. It should have some message early on if it's not even writing out the batch files for submission. I'll keep thinking on it but don't have any bright ideas right now about what could have gone funny. Sorry again about the issue and lack of the clear error message.

caddymob · 2014-02-11T09:10:15Z

OK. I'm still so confused ATM, but I think I have an idea of where the issue may lie - I just submitted the same test job all with just -n 16 (so single node for me) from a) compute node b) login node and c) galaxy

All failed to launch/get in the queue or create the 2 torque-PBS scripts. The work dir looks like this:

total 4.0K
-rw-r--r-- 1 jcorneveaux domainuser 2.9K 20140211.00.28 bcbio_2014-02-11at12.28AM.log
drwxr-xr-x 3 jcorneveaux domainuser   30 20140211.00.29 tmp
drwxr-xr-x 2 jcorneveaux domainuser   69 20140211.00.30 provenance
drwxr-xr-x 2 jcorneveaux domainuser   10 20140211.00.30 checkpoints_parallel
drwxr-xr-x 3 jcorneveaux domainuser  127 20140211.00.30 log

the iPython log is pretty uninformative -

cat log/ipython/log/ipcluster-64a50167-bb58-4d7c-b3f4-20a25b505678-27442.log
2014-02-11 00:52:44.979 [IPClusterStart] Starting ipcluster with [daemon=True]
2014-02-11 00:52:44.991 [IPClusterStart] Creating pid file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/pid/ipcluster-64a50167-bb58-4d7c-b3f4-20a25b505678.pid
2014-02-11 00:52:44.992 [IPClusterStart] Starting Controller with cluster_helper.cluster.BcbioTORQUEControllerLauncher
2014-02-11 00:52:54.998 [IPClusterStart] Starting 1 Engines with cluster_helper.cluster.BcbioTORQUEEngineSetLauncher
2014-02-11 01:53:20.430 [IPClusterStart] SIGINT received, stopping launchers...
2014-02-11 01:53:20.430 [IPClusterStart] ERROR | IPython cluster: stopping
2014-02-11 01:53:23.433 [IPClusterStart] Removing pid file: /home/jcorneveaux/GALAXY_BC_TEST/work_pnapLogin2/log/ipython/pid/ipcluster-64a50167-bb58-4d7c-b3f4-20a25b505678.pid

(note the log file is my own tee'd log). But there is no torque job and nothing is queue'd in the cluster. Even if the queue was a long way out, I should be submitting the job and assuming I've set the --timeout to something much longer than the default 15 mins (I've been doing 600) - the job should queue and hopefully/typically start computing within the 10 hours I've given it to go.

This all however is using lab default CPU hour debiting account (managed by Gold). I used the 'red-button' queue reserved for a single specific priority project that will put said jobs to the top of the queue, they get queued and presumably will run. I first used this redbutton account with bcbio thanks to #258 last Thursday for a red-button job, and my regular jobs stopped working friday. Gold accounting binaries aren't seen by the computenodes but they do pass on the -A youracccountnotmine if you qsub from a compute. Not sure if that means anything.

So, what did we learn? I'm not totally sure yet, but I did a full fresh test again same data using default acct from login1, galaxy and a compute node. No jobs queue or running. But! I then tried and used the red-button queue, from galaxy -- and sure enough they wrote the torque PBS scripts and are queued to run.

I think it's on our end.. Not sure why 1st using the red-button stopped me from running against default or other accounts specifically for bcbio (*except if submitted from compute node..??). Will be digging in more tomorrow as I'll need my gurus to help and I have no intention of just red-buttoning every job or using a compute node to launch - I need to run from galaxy. To re-iterate, before I hit the red-button with bcbio, my jobs would at least queue. Post red-button, only red-button or submitting (suboptimally) from a compute node and the buzz saw fires up.

Final test, I manually did a few echo "sleep 100" | qsub -A some-acct's - my defualt and other typically used debit acct and the red-button. They are in the queue (not yet running) but I looked at checkjob on them and noted these diffs:

red-button	my-defualt
Bypass: 24 StartCount: 0	Bypass: 1 StartCount: 0
PE: 1.00 StartPriority: 100954	PE: 1.00 StartPriority: 27
SubmitTime: Tue Feb 11 00:13:05	SubmitTime: Tue Feb 11 01:45:17
(Time Queued Total: 1:35:27 Eligible: 1:35:27	(Time Queued Total: 00:02:44 Eligible: 00:02:44)

More testing to come - I'm bogarting the cluster right now with 1410 CPUs bcbio'in all sorts of awesome things (launched on compute to get pushed forward). Hard to get in line and test completely right now but the red-button submission at least getting queue'd is a good sign, or a sign of some sorts. I hoping this is helpful - kinda just taking notes here as I poke around in the dark.

Strange one, ain't it? Bet it's going to be like the famous 'cat send emails farther than 500-miles' .... we're close - appreciate your glucose metabolism on this

…e submission for help debugging bcbio/bcbio-nextgen#298. Make start commands forward compatible with IPython 2.0dev

…fail to start (bcbio/bcbio-nextgen#298

chapmanb · 2014-02-11T13:01:11Z

Jason;
I'm not sure what to guess now so the next step is to try and improve logging so you can see the error message. If you install the latest development version of ipython-cluster-helper:

/packages/bcbio/0.7.4/anaconda/bin/pip install git+https://github.com/roryk/ipython-cluster-helper.git

It should now provide error tracebacks in log/ipython/log/ipcluster*.log. I hope this will provide some more insight into the underlying cause. This is a torque specific fix but hopefully will help us through this case. I also pushed a more general fix upstream to IPython but, if accepted, that'll only be available when the next IPython release lands:

ipython/ipython#5091

Sorry about all the pain debugging this and thanks for all your patience as always. Fingers crossed this will give us some useful information to sort it out.

caddymob · 2014-02-11T20:53:42Z

Hi Brad,

OK - pretty sure I have this nailed down. Simply put it is that I MUST provide a -r account=someAcct be it my default, another debit, or the red-button. Queue opened up a bit today and after a fresh look, I noted that when I lauch bcbio like this:

~> bcbio_nextgen.py ../config/gal_qsub_bc_back2gal.yaml -t ipython -s torque -q batch -n 16 --retries 2 --timeout 15 --tag gal_qsub_bc_back2gal_noRedButton

The issue it seems is the empty 'resources': '' string --

2014-02-11 11:32:52.637 [IPClusterStart] {'BcbioTORQUEEngineSetLauncher': {'mem': '40.2', 'cores': 16, 'tag': 'gal_qsub_bc_back2gal_noRedButton', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/home/jcorneveaux/GALAXY_BC_TEST/work_from_galaxy_feb11_notRedButton/log/ipython'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'dfd0c3a1-7a92-45d9-baa9-6e6ffa8a44c5'}, 'TORQUELauncher': {'queue': 'batch'}, 'BcbioTORQUEControllerLauncher': {'mem': '40.2', 'tag': 'gal_qsub_bc_back2gal_noRedButton', 'resources': ''}, 'IPClusterStart': {'delay': 10, 'controller_launcher_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher', 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'n': 1}}

As soon as I add -r account=validacct, the IPClusterStart has a non-empty string 'resources': 'account=validacct' in the log and the PBS scripts are written and submitted!

014-02-11 11:41:22.270 [IPClusterStart] {'BcbioTORQUEEngineSetLauncher': {'mem': '37.8', 'cores': 15, 'tag': 'gal_qsub_bc_back2gal_204000', 'resources': 'account=youracctnotmine'}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/home/jcorneveaux/GALAXY_BC_TEST/work_from_galaxy_feb11_notRedButton/log/ipython'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'1a6ea856-dcc5-4348-8588-0801095a3c5e'}, 'TORQUELauncher': {'queue': 'batch'}, 'BcbioTORQUEControllerLauncher': {'mem': '37.8', 'tag': 'gal_qsub_bc_back2gal_204000', 'resources': 'account=youracctnotmine'}, 'IPClusterStart': {'delay': 10, 'controller_launcher_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher', 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'n': 1}}
2014-02-11 11:41:22.271 [IPClusterStart] Using existing profile dir: u'/home/jcorneveaux/GALAXY_BC_TEST/work_from_galaxy_feb11_notRedButton/log/ipython'

So this was part confusion and missing that little detail - and my MUST FIRE NOW jobs were for projects where I needed to debit accounts so I was specifying.

I can handle this on my end, but to further tweak my feature request, it seems that if a user doesn't give a -r account=string that we just leave out the #PBS -A and let torque, gold and whoever figure out the default like they always have.

I have a successful run log pre-dating the addition of torque accounting where the resources is empty - 2014-01-13 12:48:48.024 [IPClusterStart] {'BcbioTORQUEEngineSetLauncher': {'mem': '2.2', 'cores': 1, 'resources': ''} and the job ran (was on 8b06634 then)... The job didn't add a -A acct to the PBS as this wasn't available yet, but an empty resources string was OK and my default debit was used. Since updating last friday to e91123c I've gone down this rabbit hole and a mixture timing, deadline panic, new bug thanks to #258 my feature request and a bursty queue had us going in circles...

chapmanb · 2014-02-12T10:34:00Z

Jason;
Thanks for digging into this more. Looking at the ipython-cluster-helper 0.2.9 code this makes sense now. That was not correctly handling an empty resource string passed to Torque.

If you upgrade to the latest ipython-cluster-helper:

/packages/bcbio/0.7.4/anaconda/bin/pip install --upgrade ipython-cluster-helper

it should work as you expect now and be good without a resource string. Thanks again for all the patience tracking this down and hope things work cleanly for you now.

caddymob · 2014-02-12T10:55:47Z

Thanks Brad - glad we got it sorted. The perfect storm of confused weird, but will upgrade and rock-out!

OT: re cmd above, aformentioned subtle differences & tweet - @roryk & #282 (comment) -- twas a green anaconda, not python a python... Cladistics aside, read to tame another beautiful snake ;)

…fail to start (bcbio/bcbio-nextgen#298

chapmanb added a commit to chapmanb/ipython that referenced this issue Feb 11, 2014

Provide logging messages in ipcluster log when engine or controllers …

6346846

…fail to start (bcbio/bcbio-nextgen#298

chapmanb closed this as completed in 679b9df Feb 12, 2014

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Provide logging messages in ipcluster log when engine or controllers …

26086f4

…fail to start (bcbio/bcbio-nextgen#298

minrk pushed a commit to ipython/ipyparallel that referenced this issue Apr 20, 2015

Provide logging messages in ipcluster log when engine or controllers …

8b63a0a

…fail to start (bcbio/bcbio-nextgen#298

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iPython times out indefinitely... #298

iPython times out indefinitely... #298

caddymob commented Feb 10, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 12, 2014

caddymob commented Feb 12, 2014

iPython times out indefinitely... #298

iPython times out indefinitely... #298

Comments

caddymob commented Feb 10, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 11, 2014

caddymob commented Feb 11, 2014

chapmanb commented Feb 12, 2014

caddymob commented Feb 12, 2014