New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torque is hanging indefinitely #416
Comments
Jason;
Hopefully that will at least get your running. If so we can try to trace back to see what might have broken it. One other thing to try is to look for old ipython controller processes from previous runs. I haven't seen that recently but know it solved the issue for a couple of people. Sorry again about the issue. |
Thanks Brad - The version in our 0.7.9 development build is In our stable old build (that still works) is Working on rolling back versions now but wanted to report. There are no dangling ipys out on the nodes - been there! |
Jason; |
Hi Guys,
|
Hi guys, Hmm, we're not having troubles running on SLURM or LSF, and we haven't touched the SGE and Torque bits of ipython-cluster-helper so I'm not sure where to look to start to debug. If you install the newest version of ipython-cluster-helper from https://github.com/roryk/ipython-cluster-helper and run the example.py script in ipython-cluster-helper, does it work? It would be helpful in narrowing down if it is something with ipython-cluster-helper or something in bcbio-nextgen. Thanks a lot for the troubleshooting, sorry I can't be of more help. |
Hey guys - super sorry for this one - turns out the DNS entry changed somewhere somehow on our galaxy machine that is doing the bcbio submissions. #facepalm |
Hi Jason. Thanks for following up-- why can't all bugs solve themselves? I reopened this so we can tackle Miika's issue. |
Jason -- thanks for this. Glad to have one problem solved. Miika -- is your problem with hanging or time outs? If hanging, could you ssh into the controller and engine machines and check memory usage? I've realized with larger runs memory usage is getting to be a problem and working on tactics to avoid that now. That's the only potential problem I know of right now so is my best guess. |
Brad, it's a problem with hanging. There's plenty memory available on the controller/engine node (256GB) and things work fine when I use 0.7.8 stable. Rory, how did you want me to run the example.py script? |
Hmm, now that I submitted it to another queue, off it goes.. false alarm I guess, and beats me what the reason might be! |
Two for two, this must be a dream. highmem sounds like a good queue for bcbio-nextgen to work on. Has it worked before on the highmem queue? If you run the example.py script like this:
Does it work? Is the smp parallel environment that is being selected correct? |
So, 0.7.8 worked with highmem.q on this same data. With the other queue, ngs.q, the example.py file produces more output. With highmem.q it doesn't. I'll try to figure out why. |
Hi Miika, ipython-cluster-helper not working correctly is a good clue. It seems like sometimes this behavior happens because the engines and controller can't talk to each other; we've seen it when the node the controller is running on is walled off from the engines. I'm interested in knowing what you find out. |
Me again... I take it back - it wasn't a DNS thing, I was reading the IPs wrong in chasing this down. So, Rory we pulled down your example.py script and gave that a spin - same deal, however a slight deviation to our queue as follows:
This does the same thing where it works fine in our stable version, but in our development clone, we're hanging until timeout - but the nodes are indeed checked out. This is where it sits:
This is specific to our 'galaxy' head node - if we try from the standard submission node or from a compute node running interactively, both branches work as expected. On the stable version, all submission hosts work fine with ipython-cluster, but the galaxy node where we must submit bcbio jobs the development branch is hanging... These are all centOS 6 machines. Is there additional debugging or logs to look or is there a place in the source we can add some debugging? |
Hi Jason, Great-- thanks for looking into this. @chapmanb will hopefully have some more ideas, but taking a crack at it: Are you by any chance running the bcbio_nextgen.py command from the head node, or are you also submitting that as a job? It has come up with a couple of cluster configurations that the compute nodes are all allowed to open up random ports on each other, but not to the head node, so the controller and engines can never connect if the job runs on the head node but is fine if the job is submitted to run on a compute node. It sounds to me like there is something specific about the galaxy node setup that is preventing the engines and controller from registering with each other. In the .ipython/profile_uuid/ directory there might be some log files specific for IPython that have some more information, in particular there are JSON files that have the connection details which might help narrow down which hosts can't talk to each other or if there is a port range that is getting blocked off. Hope something in there is on the right path! |
Hi Jason, Did this ever end up getting resolved? If so, what was the ultimate solution? If not, how can we help? :) |
Sorry i left this one hanging - no, we didn't get it solved, I've been punting and running on compute nodes. We actually just went down the rabbit hole to build a copy on the galaxy node and need to update SSL and reboot and so on... Not sure where or what is different but decided to try building afresh and hope that clears up the blockage. |
Rory, no luck here! It gets the slots:
but then is left hanging indefinitely here:
|
Hi Jason, Thanks for the update and sorry for the problems. Let us know if rebuilding the node doesn't fix it up. Miika, hmm-- I see on the processes that the actual bcbio job isn't on the list, is that job running on the head node? There is a client, the controller and a bunch of engines. The client sends jobs to the controller which distributes them to the engines, so the client needs to be able to sonnect to the ZMQ sockets the client sets up. Sometimes clusters are set up such that the compute nodes can talk to each other but the head node can't directly connect to ports on the compute nodes. If the bcbio_nextgen job is submitted to highmem.q as well, does that work? If not, does running that example.py script from ipython-cluster-helper work? Does it work on any queue, or just that one? |
Rory, both ngs.q and highmem.q work when running 0.7.8. With 0.7.9a ngs.q works but highmem.q doesn't. Tried both actual jobs and example.py. Was anything changed in the cluster-helper recently (post 0.7.8 release)? I'm usually not submitting the actual bcbio job itself to the queue, just running it in a |
OK, so after some further investigation, if I'm running So something about the communication, ZMQ sockets, changed in 0.7.9a? |
Sorry about the spam, but when I CTRL-C
|
Hi Miika, Is that trace from submitting to the highmem.q from the ngs.q? It definitely seems like when you are submitting to the highmem queue from the ngs.q host, the client (the main bcbio job) can't connect to the controller. If you submit the example.py job as a job, does it work on both queues? So there would be the example.py job running, a controller job and a set of engines jobs running on the scheduler, not just the controller and engines. Thanks for helping debug what is going on, don't worry about spamming-- more info is awesome. |
Miika;
If they are different, does converting the broken IPython one to the working IPython version fix it (or vice versa, although you might not want to break the one that works). Just trying to isolate the root cause a bit. Hope this helps. |
Rory, so qsubbing a wrapper script calling example.py to highmem.q did indeed run just fine, so must be something about communication between ngs.q nodes and the highmem.q host. So the answer to Brad, here's the list for 0.7.9a:
and here's the list for 0.7.8:
So 1.2.1 vs. 2.0.0 Edit. Downgrading ipython-cluster-helper to 0.2.17 within 0.7.9a didn't help, so the culprit must be ipython 2.0.0 |
Hi Miika, Score! This seems like it might be related to #138. There is a solution to one type of network configuration problem here: https://bcbio-nextgen.readthedocs.org/en/latest/search.html?q=network+troubleshooting&check_keywords=yes&area=default#troubleshooting If you find out what needs to be changed in your machine's configuration to let the ngs and highmem nodes talk to each other, could you follow up here so we can add it to the docs? |
Thanks Rory, I'll see if I can figure out something. On both the ngs.q and highmem.q machines the /etc/hosts file is identical and lists all the hosts and their corresponding IP addresses. I also checked they're correct by running Even if I can't figure out the underlying reason, it's not greatly blocking me from getting the jobs run with the workarounds - obviously it would be nice to nail the real reason though. |
It also might be that accessing the zeromq ports is blocked on the highmem.q machines when coming from the ngs.q machines. Maybe the zeromq port range is blocked? iptables --list on the highmem.q machines might have some clues. |
I see a .json file in my work directory under log/ipython. Not sure how to decode it? In our case it is eth5 that is the adapter that talks to the cluster. Below are the contents of the json file: {"next_id": 2, "engines": {"0": "8585ba26-e167-459a-8303-f19de845d7b4", "1": "a0d9243b-8980-44dd-b104-c52d6f79d956"}} Jim |
Jim; |
The security directory is empty. I had let bcbio run for 5 - 10 minutes before I hit control-C. |
Jim;
Ideally we'd like to see it pointing to your eth5 adapter. Thanks. |
Here is what I get: [('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])] |
Jim; |
@chapmanb I wonder if the ipcontroller files are removed as part of the temporary profile removal, i.e. when hitting CTRL-C? @jpeden1 maybe if you start a process that hangs, don't CTRL-C it but rather check from your terminal screen the location of the ipython profile and see if the ipcontroller files are there? Out of curiosity, is eth0 visible from the other machines? Maybe ipython is trying to use eth0 rather than eth5 |
@mjafin reran bcbio and the json are there while it is running. They must get delete when I hit CTRL-C. Here is the contains of the files: [security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-client.json [security]jpeden@galaxy-> more ipcontroller-d776da1d-16ec-4c85-9993-22f5b088bda5-engine.json Bcbio is still hanging. What is interest is that the old version of 0.7.9a which is running ipython 1.2.1 and ipython-cluster-helper 0.2.15 works when run from the same machine where the later bcbio hangs. |
@mjafin eth0 is not visible from the compute nodes. |
@jpeden1 How do the json files look like when you use the version that works? |
@jpeden1 In your previous post you mention 172.17.1.33 is the IP of eth5 on the machine that doesn't work, is that right? In the above json dump the IP is 172.19.1.134 - is this an IP to something else? |
So our submit node that we start bcbio is on a 172.17.X.X network. When the newer version of ipython start it is somehow selecting compute node that are on a 172.19.X.X network. This is an inifiband network. Our machine does not have access to this 172.19.X.X network. The older version of ipython is choosing the 172.17.X.X network correctly to start jobs on. What it appears we need is a way to tell bcbio to ONLY choose compute nodes that are on 172.17.X.X. How would we do this?? |
The other finding is that all of our 172.19.X.X compute nodes have a 172.17.X.X address. So if I get on a machine that has access to the 172.19.X.X network and I get the hostname of that machine. I can connect to it from our submit node that is hanging by using the hostname. In other words if a machine has an address of 172.19.1.134. It also has an 172.17.1.134 address that is associated with a hostname in DNS. If bcbio would call the compute nodes by hostname and not by IP bcbio would not hang. |
Seems like our problems are related, and related to ipython 2.x being more liberal in how it chooses the IPs from the pool somehow. |
…ce (bcbio/bcbio-nextgen#416). Add additional lxc created interface to avoid.
Jim and Miika; Jim, when you ran the interface debugging command earlier, did you run it on the problem submit node (
would give you something like:
If that's right, then I think the right fix is to pick the first valid non-local address found for each interface. I pushed a new version which does this if you can upgrade with:
You should get 0.2.22. Fingers crossed that will work. I don't think the other workarounds are generalizable since IPs are more likely to work than assuming clusters have correct DNS resolution everywhere. I also don't know another way to generalize that it should prefer the Hope the new version of |
Brad, I did run the debugging on the problem submit node: [('lo', ['127.0.0.1']), ('eth0', ['10.48.66.33']), ('eth1', []), ('eth2', []), ('eth3', []), ('eth4', []), ('eth5', ['172.17.1.33'])] The problem submit node (172.17.1.33) can not talk to any thing on 172.19.X.X. That is on an inifiband network and the submit node does not have an inifiband nic nick. I don't understand how the problem submit node is even getting an 172.19.X.X address??? I'll do the upgrade and let you know the result. Thanks |
I might be adding to the confusion, but in my case it was the compute node that was causing the problem. The json files list compute node IPs if I'm not mistaken (I might!). If the compute node reports 172.19.x.x then obviously the submit node wouldn't be able to see it, right? What do the compute nodes report for the eth interfaces? |
@mjafin You are correct that the json file list compute nodes(see above). In our case the json file is showing the location as 172.19.1.134. That compute has two interfaces. The other interface has an IP of 172.17.1.134. |
@chapmanb I did the upgrade for ipython-cluster-helper. Reran. Same problem. Bcbio hangs and the json files have compute nodes on the 172.19.X.X network. Is there a way to have it only select comute nodes on the 172.17.X.X network or to have called by name instead of by IP?? Where is bcbio selecting compute nodes?? |
Jim; My suggestion to debug would be to start a cluster, note the compute node that it gets assigned to, which is likely the
Hopefully that will provide more insight. Sorry for any confusion from my side; I'm not totally sure about your setup so am making best guesses here but hope this helps. |
Brad, As you requested I ssh to the compute node (172.17.1.134). Then ran your debug code: Also, the older version of 0.7.9a runs fine from from the problem submit node. |
…make better choices on clusters with multiple IPs: bcbio/bcbio-nextgen#416
Jim; I pushed a new version of ipython-cluster-helper (0.2.23) that prioritizes
As a small aside, this is independent of the version of bcbio, and is related to the version of IPython (2.x is problematic, 1.x will work). The changes I'm pushing work around it by monkey patching IPython. So, fingers crossed that this will get things working for you and let you update at will. Thanks again for all the patience debugging. |
I'll give that a try. Someone here pointed out that it might fix the issue if we could change the "--ip=*" to only allow bcbio to use our 10GbE. Is that possible and where would I make that change???? Thanks again for all the help. |
Did the install --upgrade. Reran bcbio. The .json's are CORRECT. :) It has gotten past the point where it was hanging. It will take a like while for it to finish this test job. But is looks promising! |
Jim; |
Brad, |
Jim; |
Hey there -
In trying the updates for #386 we have killed our development install with 756be0a - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via
qstat
orshowq
- however nothing is happening on the nodes when I look attop
orps aux
. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running-t local
works.... Any suggestions or additional info I can provide?Thanks!
The text was updated successfully, but these errors were encountered: