-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iPython times out indefinitely... #298
Comments
Jason; The other issue I've run into is old IPython ghost scripts running on the machine from previous runs. You could check If none of this helps: were you able to submit jobs/start IPython from this problem node previously? If not it could be blocking ZeroMQ ports or something else that IPython needs to start up that the compute nodes allow. Hope this help some. Thanks for the patience debugging this. |
Yes, we have basically 2 submit nodes - one is the standard login1 node and one is new that we built for galaxy. Both have worked great thus far. I did check every single node for dangling chads with this - https://gist.github.com/caddymob/8927861 while there were a few old ones out there, I killed them, wiped my work dirs and yet all still submissions hang - Very strange indeed. I haven't tried every node to submit (nor do I want to, I want/need to submit from galaxy), but did and still have run several jobs from this weekend runningt when I do my Had me head-scratching all weekend, now my IT guys are looking into it and none of us have a clue what happened... For the record qsub'in jobs, no-bcbio jobs work no problemo. It's something with iPy that has lost it's once glorious ability to heard trillions of tiny little DNA sequences I see you're up to 2.0.11 on iPy at this point, I'm at 2.0.9... I dont see how this will magically fix, but you got any other ideas? |
Jason; |
OK. I'm still so confused ATM, but I think I have an idea of where the issue may lie - I just submitted the same test job all with just -n 16 (so single node for me) from a) compute node b) login node and c) galaxy All failed to launch/get in the queue or create the 2 torque-PBS scripts. The work dir looks like this:
the iPython log is pretty uninformative -
(note the log file is my own tee'd log). But there is no torque job and nothing is queue'd in the cluster. Even if the queue was a long way out, I should be submitting the job and assuming I've set the This all however is using lab default CPU hour debiting account (managed by Gold). I used the 'red-button' queue reserved for a single specific priority project that will put said jobs to the top of the queue, they get queued and presumably will run. I first used this redbutton account with bcbio thanks to #258 last Thursday for a red-button job, and my regular jobs stopped working friday. Gold accounting binaries aren't seen by the computenodes but they do pass on the So, what did we learn? I'm not totally sure yet, but I did a full fresh test again same data using default acct from login1, galaxy and a compute node. No jobs queue or running. But! I then tried and used the red-button queue, from galaxy -- and sure enough they wrote the torque PBS scripts and are queued to run. I think it's on our end.. Not sure why 1st using the red-button stopped me from running against default or other accounts specifically for bcbio (*except if submitted from compute node..??). Will be digging in more tomorrow as I'll need my gurus to help and I have no intention of just red-buttoning every job or using a compute node to launch - I need to run from galaxy. To re-iterate, before I hit the red-button with bcbio, my jobs would at least queue. Post red-button, only red-button or submitting (suboptimally) from a compute node and the buzz saw fires up. Final test, I manually did a few
More testing to come - I'm bogarting the cluster right now with 1410 CPUs bcbio'in all sorts of awesome things (launched on compute to get pushed forward). Hard to get in line and test completely right now but the red-button submission at least getting queue'd is a good sign, or a sign of some sorts. I hoping this is helpful - kinda just taking notes here as I poke around in the dark. Strange one, ain't it? Bet it's going to be like the famous 'cat send emails farther than 500-miles' .... we're close - appreciate your glucose metabolism on this |
…e submission for help debugging bcbio/bcbio-nextgen#298. Make start commands forward compatible with IPython 2.0dev
Jason;
It should now provide error tracebacks in Sorry about all the pain debugging this and thanks for all your patience as always. Fingers crossed this will give us some useful information to sort it out. |
Hi Brad, OK - pretty sure I have this nailed down. Simply put it is that I MUST provide a
The issue it seems is the empty
As soon as I add
So this was part confusion and missing that little detail - and my MUST FIRE NOW jobs were for projects where I needed to debit accounts so I was specifying. I can handle this on my end, but to further tweak my feature request, it seems that if a user doesn't give a I have a successful run log pre-dating the addition of torque accounting where the resources is empty - |
Jason; If you upgrade to the latest ipython-cluster-helper:
it should work as you expect now and be good without a resource string. Thanks again for all the patience tracking this down and hope things work cleanly for you now. |
Thanks Brad - glad we got it sorted. The perfect storm of confused weird, but will upgrade and rock-out! OT: re cmd above, aformentioned subtle differences & tweet - @roryk & #282 (comment) -- twas a green anaconda, not python a python... Cladistics aside, read to tame another beautiful snake ;) |
Hi Brad,
Before I had jobs finished to update re #258 - bcbio stopped working. We would start a bcbio it would hang for however long the timeout was set to then iPython would throw an error. Here is the output:
iPython is not even getting as far as producing the PBS scripts. The error above is the only error we are seeing. bcbio WORKS if I check-out a compute node (using
-t torque
just to be clear, not just running local on said node) and start the job there. We have checked that all nodes are talking across the network.Any idea what meant be causing this? Second is there a way to have, I am guessing, iPython be more verbose??
The text was updated successfully, but these errors were encountered: