Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torque is hanging indefinitely #416

Closed
caddymob opened this issue Apr 29, 2014 · 67 comments
Closed

torque is hanging indefinitely #416

caddymob opened this issue Apr 29, 2014 · 67 comments

Comments

@caddymob
Copy link

Hey there -

In trying the updates for #386 we have killed our development install with 756be0a - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

Thanks!

@chapmanb
Copy link
Member

Jason;
Sorry about the issues. That's unexpected, I don't think we've done anything on Torque recently -- at least on purpose. If you manually install an older version of ipython-cluster-helper does that resolve it:

/usr/local/share/bcbio-nextgen/anaconda/bin/pip install ipython-cluster-helper==0.2.17

Hopefully that will at least get your running. If so we can try to trace back to see what might have broken it.

One other thing to try is to look for old ipython controller processes from previous runs. I haven't seen that recently but know it solved the issue for a couple of people. Sorry again about the issue.

@caddymob
Copy link
Author

Thanks Brad - The version in our 0.7.9 development build is ./bcbio/anaconda/lib/python2.7/site-packages/ipython_cluster_helper-0.2.19-py2.7.egg-info

In our stable old build (that still works) is ./anaconda/lib/python2.7/site-packages/ipython_cluster_helper-0.2.15-py2.7.egg-info

Working on rolling back versions now but wanted to report. There are no dangling ipys out on the nodes - been there!

@chapmanb
Copy link
Member

Jason;
Thanks much for helping with debugging. To help isolate the issue, it would be helpful to know if this is related to an ipython-cluster-helper version, or bcbio-nextgen itself. If you can fix by downgrading just ipython-cluster-helper, that would help us focus there. Otherwise I'll look at the bcbio-nextgen changes. I already have some improvements to make there to reduce memory usage but if there are other issues happy to shake them out. We're working on finalizing before the next release so this is super helpful. Sorry again about the problems.

@mjafin
Copy link
Contributor

mjafin commented Apr 30, 2014

Hi Guys,
I'm getting this same with SGE. qstat reports the slots being allocated correctly, but bcbio hangs at:

2014-04-30 20:02:30.698 [IPClusterStart] Config changed:
2014-04-30 20:02:30.698 [IPClusterStart] {'ProfileDir': {'location': u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'}, 'SGELauncher': {'queue': 'highmem.q'}, 'BcbioSGEEngineSetLauncher': {'mem': '16.2', 'cores': 8, 'pename': u'smp', 'tag': '', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSGEControllerLauncher': {'mem': '16.2', 'tag': '', 'resources': ''}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'4f9b3d0e-e8b5-4efc-a3ab-0b375278d533'}, 'IPClusterStart': {'delay': 10, 'n': 3, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher'}}
2014-04-30 20:02:30.704 [IPClusterStart] Using existing profile dir: u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'
2014-04-30 20:02:30.704 [IPClusterStart] Searching path [u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work', u'/ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython'] for config files
2014-04-30 20:02:30.704 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-04-30 20:02:30.704 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipython_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipcontroller_config.py
2014-04-30 20:02:30.705 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.706 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipengine_config.py
2014-04-30 20:02:30.706 [IPClusterStart] Attempting to load config file: ipcluster_4f9b3d0e_e8b5_4efc_a3ab_0b375278d533_config.py
2014-04-30 20:02:30.707 [IPClusterStart] Loaded config file: /ngs/PHB/analysis/BGI_BRCA_test_3samples_Apr2014/ensemble-variant/work/log/ipython/ipcluster_config.py

@roryk
Copy link
Collaborator

roryk commented Apr 30, 2014

Hi guys,

Hmm, we're not having troubles running on SLURM or LSF, and we haven't touched the SGE and Torque bits of ipython-cluster-helper so I'm not sure where to look to start to debug. If you install the newest version of ipython-cluster-helper from https://github.com/roryk/ipython-cluster-helper and run the example.py script in ipython-cluster-helper, does it work? It would be helpful in narrowing down if it is something with ipython-cluster-helper or something in bcbio-nextgen. Thanks a lot for the troubleshooting, sorry I can't be of more help.

@caddymob
Copy link
Author

Hey guys - super sorry for this one - turns out the DNS entry changed somewhere somehow on our galaxy machine that is doing the bcbio submissions. #facepalm

@roryk
Copy link
Collaborator

roryk commented Apr 30, 2014

Hi Jason. Thanks for following up-- why can't all bugs solve themselves? I reopened this so we can tackle Miika's issue.

@roryk roryk reopened this Apr 30, 2014
@chapmanb
Copy link
Member

Jason -- thanks for this. Glad to have one problem solved.

Miika -- is your problem with hanging or time outs? If hanging, could you ssh into the controller and engine machines and check memory usage? I've realized with larger runs memory usage is getting to be a problem and working on tactics to avoid that now. That's the only potential problem I know of right now so is my best guess.

@mjafin
Copy link
Contributor

mjafin commented Apr 30, 2014

Brad, it's a problem with hanging. There's plenty memory available on the controller/engine node (256GB) and things work fine when I use 0.7.8 stable.

Rory, how did you want me to run the example.py script?

@mjafin
Copy link
Contributor

mjafin commented Apr 30, 2014

Hmm, now that I submitted it to another queue, off it goes.. false alarm I guess, and beats me what the reason might be!

@roryk
Copy link
Collaborator

roryk commented Apr 30, 2014

Two for two, this must be a dream. highmem sounds like a good queue for bcbio-nextgen to work on. Has it worked before on the highmem queue? If you run the example.py script like this:

python example/example.py --scheduler SLURM --num_jobs 3 --queue highmem.q

Does it work? Is the smp parallel environment that is being selected correct?

@mjafin
Copy link
Contributor

mjafin commented Apr 30, 2014

So, 0.7.8 worked with highmem.q on this same data. smp is what we're using solely so it's correct.

With the other queue, ngs.q, the example.py file produces more output. With highmem.q it doesn't. I'll try to figure out why.

@roryk
Copy link
Collaborator

roryk commented Apr 30, 2014

Hi Miika,

ipython-cluster-helper not working correctly is a good clue. It seems like sometimes this behavior happens because the engines and controller can't talk to each other; we've seen it when the node the controller is running on is walled off from the engines. I'm interested in knowing what you find out.

@caddymob
Copy link
Author

Me again...

I take it back - it wasn't a DNS thing, I was reading the IPs wrong in chasing this down. So, Rory we pulled down your example.py script and gave that a spin - same deal, however a slight deviation to our queue as follows:

python example.py --scheduler torque --num_jobs 3 --queue batch

This does the same thing where it works fine in our stable version, but in our development clone, we're hanging until timeout - but the nodes are indeed checked out.

This is where it sits:

[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_nbconvert_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcontroller_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipengine_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcluster_config.py'
[ProfileCreate] Generating default config file: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/iplogger_config.py'
2014-04-30 14:45:24.522 [IPClusterStart] Config changed:
2014-04-30 14:45:24.522 [IPClusterStart] {'BaseIPythonApplication': {'profile': u'bb7ea384-d0b0-11e3-854e-d4ae52679fa3'}, 'BcbioTORQUEEngineSetLauncher': {'mem': '1', 'cores': 1, 'tag': '', 'resources': ''}, 'IPClusterEng\
ines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'f36d5a2c-9394-4af5-85d4-225ef9951f09'}, 'TORQUELauncher': {'queue': 'batch'}, 'BcbioTORQUE\
ControllerLauncher': {'mem': '1', 'tag': '', 'resources': ''}, 'IPClusterStart': {'delay': 10, 'n': 3, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'controller_launch\
er_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher'}}
2014-04-30 14:45:24.523 [IPClusterStart] IPYTHONDIR set to: /home/jcorneveaux/.ipython
2014-04-30 14:45:24.528 [IPClusterStart] Using existing profile dir: u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3'
2014-04-30 14:45:24.529 [IPClusterStart] Searching path [u'/home/jcorneveaux/example', u'/home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3'] for config files
2014-04-30 14:45:24.529 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-04-30 14:45:24.529 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipython_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcontroller_config.py
2014-04-30 14:45:24.530 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.531 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipengine_config.py
2014-04-30 14:45:24.531 [IPClusterStart] Attempting to load config file: ipcluster_f36d5a2c_9394_4af5_85d4_225ef9951f09_config.py
2014-04-30 14:45:24.532 [IPClusterStart] Loaded config file: /home/jcorneveaux/.ipython/profile_bb7ea384-d0b0-11e3-854e-d4ae52679fa3/ipcluster_config.py

This is specific to our 'galaxy' head node - if we try from the standard submission node or from a compute node running interactively, both branches work as expected. On the stable version, all submission hosts work fine with ipython-cluster, but the galaxy node where we must submit bcbio jobs the development branch is hanging... These are all centOS 6 machines. Is there additional debugging or logs to look or is there a place in the source we can add some debugging?

@roryk
Copy link
Collaborator

roryk commented May 1, 2014

Hi Jason,

Great-- thanks for looking into this. @chapmanb will hopefully have some more ideas, but taking a crack at it: Are you by any chance running the bcbio_nextgen.py command from the head node, or are you also submitting that as a job? It has come up with a couple of cluster configurations that the compute nodes are all allowed to open up random ports on each other, but not to the head node, so the controller and engines can never connect if the job runs on the head node but is fine if the job is submitted to run on a compute node. It sounds to me like there is something specific about the galaxy node setup that is preventing the engines and controller from registering with each other.

In the .ipython/profile_uuid/ directory there might be some log files specific for IPython that have some more information, in particular there are JSON files that have the connection details which might help narrow down which hosts can't talk to each other or if there is a port range that is getting blocked off.

Hope something in there is on the right path!

@roryk
Copy link
Collaborator

roryk commented May 7, 2014

Hi Jason,

Did this ever end up getting resolved? If so, what was the ultimate solution? If not, how can we help? :)

@caddymob
Copy link
Author

caddymob commented May 7, 2014

Sorry i left this one hanging - no, we didn't get it solved, I've been punting and running on compute nodes. We actually just went down the rabbit hole to build a copy on the galaxy node and need to update SSL and reboot and so on... Not sure where or what is different but decided to try building afresh and hope that clears up the blockage.

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

Rory, no luck here! It gets the slots:

4368430 0.50500 bcbio-c    klrl262      r     05/08/2014 09:31:45 highmem.q@ukapdlnx119              1 1
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 1
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 2
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 3
4368431 0.60500 bcbio-e    klrl262      r     05/08/2014 09:32:00 highmem.q@ukapdlnx119              8 4

but then is left hanging indefinitely here:

2014-05-08 09:31:35.085 [IPClusterStart] Config changed:
2014-05-08 09:31:35.086 [IPClusterStart] {'ProfileDir': {'location': u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'}, 'SGELauncher': {'queue': 'highmem.q'}, 'BcbioSGEEngineSetLauncher': {'mem': '16.2', 'cores': 8, 'pename': u'smp', 'tag': '', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'BcbioSGEControllerLauncher': {'mem': '16.2', 'tag': '', 'resources': ''}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'fced528b-5bd6-4cbc-87a2-f0701356db87'}, 'IPClusterStart': {'delay': 10, 'n': 4, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher'}}
2014-05-08 09:31:35.090 [IPClusterStart] Using existing profile dir: u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'
2014-05-08 09:31:35.091 [IPClusterStart] Searching path [u'/ngs/RDI/Analysis/NA12878_validation/work', u'/ngs/RDI/Analysis/NA12878_validation/work/log/ipython'] for config files
2014-05-08 09:31:35.091 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-05-08 09:31:35.091 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipython_config.py
2014-05-08 09:31:35.091 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipcontroller_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.092 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipengine_config.py
2014-05-08 09:31:35.093 [IPClusterStart] Attempting to load config file: ipcluster_fced528b_5bd6_4cbc_87a2_f0701356db87_config.py
2014-05-08 09:31:35.093 [IPClusterStart] Loaded config file: /ngs/RDI/Analysis/NA12878_validation/work/log/ipython/ipcluster_config.py

@roryk
Copy link
Collaborator

roryk commented May 8, 2014

Hi Jason,

Thanks for the update and sorry for the problems. Let us know if rebuilding the node doesn't fix it up.

Miika, hmm-- I see on the processes that the actual bcbio job isn't on the list, is that job running on the head node? There is a client, the controller and a bunch of engines. The client sends jobs to the controller which distributes them to the engines, so the client needs to be able to sonnect to the ZMQ sockets the client sets up. Sometimes clusters are set up such that the compute nodes can talk to each other but the head node can't directly connect to ports on the compute nodes. If the bcbio_nextgen job is submitted to highmem.q as well, does that work? If not, does running that example.py script from ipython-cluster-helper work? Does it work on any queue, or just that one?

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

Rory, both ngs.q and highmem.q work when running 0.7.8. With 0.7.9a ngs.q works but highmem.q doesn't. Tried both actual jobs and example.py.

Was anything changed in the cluster-helper recently (post 0.7.8 release)?

I'm usually not submitting the actual bcbio job itself to the queue, just running it in a screen locally..

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

OK, so after some further investigation, if I'm running example.py from one of the ngs.q hosts, only ngs.q works (0.7.9a). However, on the highmem.q host, both ngs.q and highmem.q work when running example.py.

So something about the communication, ZMQ sockets, changed in 0.7.9a?

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

Sorry about the spam, but when I CTRL-C example.py after it hangs, I can see this:

  File "example.py", line 48, in <module>
    profile=args.profile, extra_params=args.resources) as view:
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 803, in cluster_view
    while not _is_up(url_file, num_jobs):
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 745, in _is_up
    client = Client(url_file, timeout=60)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 496, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 613, in _connect
    evts = poller.poll(timeout*1000)
  File "/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/lib/python2.7/site-packages/zmq/sugar/poll.py", line 110, in poll
    return zmq_poll(self.sockets, timeout=timeout)
  File "_poll.pyx", line 115, in zmq.backend.cython._poll.zmq_poll (zmq/backend/cython/_poll.c:1678)
  File "checkrc.pxd", line 11, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/_poll.c:1879)
KeyboardInterrupt

@roryk
Copy link
Collaborator

roryk commented May 8, 2014

Hi Miika,

Is that trace from submitting to the highmem.q from the ngs.q? It definitely seems like when you are submitting to the highmem queue from the ngs.q host, the client (the main bcbio job) can't connect to the controller.

If you submit the example.py job as a job, does it work on both queues? So there would be the example.py job running, a controller job and a set of engines jobs running on the scheduler, not just the controller and engines.

Thanks for helping debug what is going on, don't worry about spamming-- more info is awesome.

@chapmanb
Copy link
Member

chapmanb commented May 8, 2014

Miika;
We didn't purposely change anything in communication along these lines, but agree with Rory that it looks like some kind of connectivity issue. I wonder if it's an IPython difference. What versions are you running on the working and not working installs:

/apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda/bin/conda list ipython

If they are different, does converting the broken IPython one to the working IPython version fix it (or vice versa, although you might not want to break the one that works). Just trying to isolate the root cause a bit. Hope this helps.

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

Rory, so qsubbing a wrapper script calling example.py to highmem.q did indeed run just fine, so must be something about communication between ngs.q nodes and the highmem.q host. So the answer to
"If you submit the example.py job as a job, does it work on both queues? So there would be the example.py job running, a controller job and a set of engines jobs running on the scheduler, not just the controller and engines." is yes.

Brad, here's the list for 0.7.9a:

# packages in environment at /apps/bcbio-nextgen/latest-devel/rhel6-x64/anaconda:
#
ipython                   2.0.0                    py27_0
ipython-cluster-helper    0.2.19                    <pip>

and here's the list for 0.7.8:

# packages in environment at /apps/bcbio-nextgen/0.7.8/rhel6-x64/anaconda:
#
ipython                   1.2.1                    py27_0
ipython-cluster-helper    0.2.17                    <pip>

So 1.2.1 vs. 2.0.0

Edit. Downgrading ipython-cluster-helper to 0.2.17 within 0.7.9a didn't help, so the culprit must be ipython 2.0.0

@roryk
Copy link
Collaborator

roryk commented May 8, 2014

Hi Miika,

Score! This seems like it might be related to #138. There is a solution to one type of network configuration problem here: https://bcbio-nextgen.readthedocs.org/en/latest/search.html?q=network+troubleshooting&check_keywords=yes&area=default#troubleshooting

If you find out what needs to be changed in your machine's configuration to let the ngs and highmem nodes talk to each other, could you follow up here so we can add it to the docs?

@mjafin
Copy link
Contributor

mjafin commented May 8, 2014

Thanks Rory, I'll see if I can figure out something. On both the ngs.q and highmem.q machines the /etc/hosts file is identical and lists all the hosts and their corresponding IP addresses. I also checked they're correct by running python -c 'import socket; print socket.gethostbyname(socket.gethostname())' on all the hosts and as mentioned, things worked with the old ipython version.

Even if I can't figure out the underlying reason, it's not greatly blocking me from getting the jobs run with the workarounds - obviously it would be nice to nail the real reason though.

@roryk
Copy link
Collaborator

roryk commented May 8, 2014

It also might be that accessing the zeromq ports is blocked on the highmem.q machines when coming from the ngs.q machines. Maybe the zeromq port range is blocked? iptables --list on the highmem.q machines might have some clues.

@jpeden1
Copy link

jpeden1 commented May 15, 2014

I see a .json file in my work directory under log/ipython. Not sure how to decode it? In our case it is eth5 that is the adapter that talks to the cluster. Below are the contents of the json file:

{"next_id": 2, "engines": {"0": "8585ba26-e167-459a-8303-f19de845d7b4", "1": "a0d9243b-8980-44dd-b104-c52d6f79d956"}}

Jim

@chapmanb
Copy link
Member

Jim;
Sorry it's still not working. As Miika mentioned, it could still be an adapter issue we didn't anticipate. The files with the IP addresses you want to look at are log/ipython/security/ipcontroller-*client.json and log/ipython/security/ipcontroller-*engine.json. The IP address in location should match the IP of your adapter, eth5. If it doesn't could you send along what it is and the output of ifconfig so we can see what adapter it does use? Thanks for the help debugging.

@jpeden1
Copy link

jpeden1 commented May 15, 2014

The security directory is empty. I had let bcbio run for 5 - 10 minutes before I hit control-C.

@chapmanb
Copy link
Member

Jim;
Sorry if that file is not produced, we'll probably need to understand your network configuration a bit more to offer additional advice. It seems like some configuration is making the new IPython version unhappy. What does this command report?

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

Ideally we'd like to see it pointing to your eth5 adapter. Thanks.

@jpeden1
Copy link

jpeden1 commented May 15, 2014

Here is what I get:

[('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])]

@chapmanb
Copy link
Member

Jim;
Thanks, that looks right assuming that 172.17.1.33 if the machine address you'd like to have exported. If the log/ipython/security/ipcontroller* files aren't being produced, that is the next point to debug. Sorry I don't have anything to suggest right now but I can read the IPython code to try and guess what might be causing it to fail writing these. If you have any thoughts about how this machine is configured differently than others that work, that would be super helpful. We're basically just guessing here so anything to go on might help identify the underlying cause. Thanks again.

@mjafin
Copy link
Contributor

mjafin commented May 15, 2014

@chapmanb I wonder if the ipcontroller files are removed as part of the temporary profile removal, i.e. when hitting CTRL-C? @jpeden1 maybe if you start a process that hangs, don't CTRL-C it but rather check from your terminal screen the location of the ipython profile and see if the ipcontroller files are there?

Out of curiosity, is eth0 visible from the other machines? Maybe ipython is trying to use eth0 rather than eth5

@jpeden1
Copy link

jpeden1 commented May 15, 2014

@mjafin reran bcbio and the json are there while it is running. They must get delete when I hit CTRL-C.

Here is the contains of the files:

[security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-client.json
{
"control": 39396,
"task": 57524,
"notification": 43872,
"task_scheme": "lru",
"mux": 33443,
"patch": "VMFixIPControllerApp",
"iopub": 48118,
"ssh": "",
"key": "2f424d43-2f6f-4088-b394-835845567dee",
"registration": 45187,
"interface": "tcp://*",
"signature_scheme": "hmac-sha256",
"pack": "json",
"unpack": "json",
"location": "172.19.1.134"
}

[security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-engine.json
{
"control": 34699,
"task": 51469,
"hb_ping": 45104,
"mux": 47048,
"patch": "VMFixIPControllerApp",
"pack": "json",
"hb_pong": 52224,
"ssh": "",
"key": "2f424d43-2f6f-4088-b394-835845567dee",
"registration": 45187,
"interface": "tcp://*",
"iopub": 51241,
"signature_scheme": "hmac-sha256",
"unpack": "json",
"location": "172.19.1.134"
}

Bcbio is still hanging. What is interest is that the old version of 0.7.9a which is running ipython 1.2.1 and ipython-cluster-helper 0.2.15 works when run from the same machine where the later bcbio hangs.

@jpeden1
Copy link

jpeden1 commented May 15, 2014

@mjafin eth0 is not visible from the compute nodes.

@mjafin
Copy link
Contributor

mjafin commented May 15, 2014

@jpeden1 How do the json files look like when you use the version that works?

@mjafin
Copy link
Contributor

mjafin commented May 15, 2014

@jpeden1 In your previous post you mention 172.17.1.33 is the IP of eth5 on the machine that doesn't work, is that right? In the above json dump the IP is 172.19.1.134 - is this an IP to something else?

@jpeden1
Copy link

jpeden1 commented May 15, 2014

So our submit node that we start bcbio is on a 172.17.X.X network. When the newer version of ipython start it is somehow selecting compute node that are on a 172.19.X.X network. This is an inifiband network. Our machine does not have access to this 172.19.X.X network. The older version of ipython is choosing the 172.17.X.X network correctly to start jobs on. What it appears we need is a way to tell bcbio to ONLY choose compute nodes that are on 172.17.X.X. How would we do this??

@jpeden1
Copy link

jpeden1 commented May 15, 2014

The other finding is that all of our 172.19.X.X compute nodes have a 172.17.X.X address. So if I get on a machine that has access to the 172.19.X.X network and I get the hostname of that machine. I can connect to it from our submit node that is hanging by using the hostname. In other words if a machine has an address of 172.19.1.134. It also has an 172.17.1.134 address that is associated with a hostname in DNS. If bcbio would call the compute nodes by hostname and not by IP bcbio would not hang.

@mjafin
Copy link
Contributor

mjafin commented May 16, 2014

Seems like our problems are related, and related to ipython 2.x being more liberal in how it chooses the IPs from the pool somehow.

chapmanb added a commit to roryk/ipython-cluster-helper that referenced this issue May 16, 2014
@chapmanb
Copy link
Member

Jim and Miika;
Thanks again for all the debugging. It sounds like we're at the root cause of the issue. To summarize, IPython 2.x uses a more thorough approach to discovering IP addresses and doesn't always choose the right one in complex cases. For Miika, this was due to VM-based interfaces. For Jim, it looks like it's due to addresses being bound to 2 IPs and only one working globally on the network.

Jim, when you ran the interface debugging command earlier, did you run it on the problem submit node (172.19.1.134) or on a different machine? If not on the problem submit node, my guess is that:

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

would give you something like:

[..., ('eth5', ['172.17.1.134', '172.19.1.134'])]

If that's right, then I think the right fix is to pick the first valid non-local address found for each interface. I pushed a new version which does this if you can upgrade with:

./bcbio/anaconda/bin/pip install --upgrade ipython-cluster-helper

You should get 0.2.22. Fingers crossed that will work. I don't think the other workarounds are generalizable since IPs are more likely to work than assuming clusters have correct DNS resolution everywhere. I also don't know another way to generalize that it should prefer the 172.17.xxx IPs over 172.19.xxx. From your side, it might also be worth making both network ranges fully visible over the network since IPython might not be the only software that would get tripped up by this.

Hope the new version of ipython-cluster-helper fixes it. Thanks again for all the help debugging.

@jpeden1
Copy link

jpeden1 commented May 16, 2014

Brad,

I did run the debugging on the problem submit node:

[('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])]

The problem submit node (172.17.1.33) can not talk to any thing on 172.19.X.X. That is on an inifiband network and the submit node does not have an inifiband nic nick. I don't understand how the problem submit node is even getting an 172.19.X.X address???

I'll do the upgrade and let you know the result.

Thanks

@mjafin
Copy link
Contributor

mjafin commented May 16, 2014

I might be adding to the confusion, but in my case it was the compute node that was causing the problem. The json files list compute node IPs if I'm not mistaken (I might!). If the compute node reports 172.19.x.x then obviously the submit node wouldn't be able to see it, right?

What do the compute nodes report for the eth interfaces?

@jpeden1
Copy link

jpeden1 commented May 16, 2014

@mjafin You are correct that the json file list compute nodes(see above). In our case the json file is showing the location as 172.19.1.134. That compute has two interfaces. The other interface has an IP of 172.17.1.134.

@jpeden1
Copy link

jpeden1 commented May 16, 2014

@chapmanb I did the upgrade for ipython-cluster-helper. Reran. Same problem. Bcbio hangs and the json files have compute nodes on the 172.19.X.X network. Is there a way to have it only select comute nodes on the 172.17.X.X network or to have called by name instead of by IP?? Where is bcbio selecting compute nodes??

@chapmanb
Copy link
Member

Jim;
Thanks for testing and sorry that we're still running into issues. It appears the problem is with resolving names on the compute node that the ipcontroller gets assigned to. The submit node may be a red herring -- or perhaps the problem submit node specifically schedules to different nodes than the working submit node. In response to your question, the cluster scheduler chooses this, not bcbio. You can look at the torque_controller* file that is created by if that has any clues to why things are scheduled in certain places.

My suggestion to debug would be to start a cluster, note the compute node that it gets assigned to, which is likely the 172.17.1.134 machine, and then ssh in there and run the interfaces command to give us a better idea of what is happening on that machine:

./anaconda/bin/python -c 'import netifaces; print [(iface, [x["addr"] for x in netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])]) for iface in netifaces.interfaces()]'

Hopefully that will provide more insight. Sorry for any confusion from my side; I'm not totally sure about your setup so am making best guesses here but hope this helps.

@jpeden1
Copy link

jpeden1 commented May 16, 2014

Brad,

As you requested I ssh to the compute node (172.17.1.134). Then ran your debug code:
[('lo', ['127.0.0.1']), ('eth0', ['172.17.1.134']), ('eth1', []), ('ib0', ['172.19.1.134']), ('ib1', [])]

Also, the older version of 0.7.9a runs fine from from the problem submit node.

chapmanb added a commit to roryk/ipython-cluster-helper that referenced this issue May 16, 2014
@chapmanb
Copy link
Member

Jim;
Thank you, this is super helpful. The issue is that you have two interfaces with different IPs, eth0 and ib0, and it's not clear to IPython which should be preferred. It arbitrarily picks the last one in the list, so will default to the ib0 one which is not configured to be reachable in your network.

I pushed a new version of ipython-cluster-helper (0.2.23) that prioritizes eth interfaces that will hopefully resolve the issue:

./bcbio/anaconda/bin/pip install --upgrade ipython-cluster-helper

As a small aside, this is independent of the version of bcbio, and is related to the version of IPython (2.x is problematic, 1.x will work). The changes I'm pushing work around it by monkey patching IPython.

So, fingers crossed that this will get things working for you and let you update at will. Thanks again for all the patience debugging.

@jpeden1
Copy link

jpeden1 commented May 16, 2014

I'll give that a try.

Someone here pointed out that it might fix the issue if we could change the "--ip=*" to only allow bcbio to use our 10GbE. Is that possible and where would I make that change????

Thanks again for all the help.

@jpeden1
Copy link

jpeden1 commented May 16, 2014

Did the install --upgrade. Reran bcbio. The .json's are CORRECT. :) It has gotten past the point where it was hanging. It will take a like while for it to finish this test job. But is looks promising!

@chapmanb
Copy link
Member

Jim;
Awesome that this worked. Thanks again for all the patience working through the issues. Did everything finish up okay?

@jpeden1
Copy link

jpeden1 commented May 20, 2014

Brad,
Have had a couple of different jobs finish without issue. Still wonder about "--ip=" option and if we could use that to have iPython select an eth interface? Thanks again for all the help and fixes!!

@chapmanb
Copy link
Member

Jim;
Brilliant, glad to have this working. Sorry about forgetting to respond to the --ip suggestion. You could use this to specify specific IPs but I don't know of a way to generalize this better in the ipython-cluster-helper code than how IPython is doing it. You'd need to be able to adjust the IP to point to the correct eth interface on every machine the ipcontroller starts, which is essentially what IPython tries to do when we specify '--ip=*'. Thanks again for all the work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants