Python submitted DRMAA jobs are not running on the worker nodes #246

vipints · 2015-04-16T21:57:58Z

Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with Exit_status=127, here is one such event, tracejob -slm -n 2 3025874. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.

-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory

I am able to run this python script without Torque on the login machine and a worker node (with qlogin).

Did anybody use drmaa, python combination in cluster computing?

I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.

I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)

The text was updated successfully, but these errors were encountered:

tatarsky · 2015-04-16T22:01:49Z

That bash error looks suspiciously like what the bash patch for "shellshock" says when an improper function invocation is attempted.

tatarsky · 2015-04-16T22:02:08Z

BTW are you saying "this worked before yesterday" ???

vipints · 2015-04-16T22:02:45Z

yes this scripts were running perfectly until yesterday.

tatarsky · 2015-04-16T22:03:28Z

Did you perhaps "add a module" yesterday? As that error has to do with as far as I can tell the "modules" package.

vipints · 2015-04-16T22:04:03Z

No

tatarsky · 2015-04-16T22:04:20Z

Your .bashrc is modified as of this morning....what was changed?

tatarsky · 2015-04-16T22:05:36Z

Also attempting to reproduce....

vipints · 2015-04-16T22:06:07Z

Just deleted an empty line, that is what I remember...

tatarsky · 2015-04-16T22:09:54Z

Well I'll have to look around. No changes I can think of on the cluster except the epilog script which only fires if the queue is "active".

vipints · 2015-04-16T22:12:59Z

I didn't make any changes to the scripts. thank you @tatarsky.

tatarsky · 2015-04-16T22:14:00Z

Do you know BTW what "the worker node" was in that message? Can dig around but if you already know it would be appreciated.

vipints · 2015-04-16T22:15:29Z

gpu-1-13

vipints · 2015-04-16T22:20:10Z

I asked for nodes=1 and ppn=4 and it dispatched:

exec_host=gpu-1-13/6+gpu-1-17/19+gpu-1-15/10+gpu-3-8/14

tatarsky · 2015-04-16T22:24:26Z

Yeah I saw that. Does your script contain an attempt to use "module" ? Or perhaps provide me the location of the item you run? That error is coming from the modules /etc/profile.d/modules.sh as far as I can tell. Which is untouched, so I'm curious whats calling it.

vipints · 2015-04-16T22:25:38Z

sending an email with details.

tatarsky · 2015-04-16T22:26:18Z

Thanks!

tatarsky · 2015-04-17T12:25:54Z

Made title of this more specific for my tracking purposes.

tatarsky · 2015-04-17T15:03:53Z

Under some condition the method DRMAA python uses to submit jobs appears to get blocked from submitting more data. I have @vipints running again but am chasing down what the resolution was for this Torque mailing list discussion.

http://www.supercluster.org/pipermail/torqueusers/2014-January/016732.html

I do not believe the hotfix "introduced" this problem as the date of this is old. Opening ticket with Adaptive to enquire.

vipints · 2015-04-22T13:20:08Z

Hi @tatarsky, today morning I noticed that python drmaa submitted jobs are not dispatching to the worker node. I am not able to see the start time: for example: showstart 3122268

INFO: cannot determine start time for job 3122268

Don't know what is happening here.

tatarsky · 2015-04-22T13:37:22Z

I don't see the same issue as before.

Looks to me like a simple case of your jobs being rejected due to resources.

checkjob -v 3122268
Node Availability for Partition MSKCC --------

gpu-3-9                  rejected: Features
gpu-1-4                  rejected: Features
gpu-1-5                  rejected: Features
gpu-1-6                  rejected: Features
gpu-1-7                  rejected: HostList
gpu-1-8                  rejected: HostList
gpu-1-9                  rejected: HostList
gpu-1-10                 rejected: HostList
gpu-1-11                 rejected: HostList
gpu-1-12                 rejected: Features
gpu-1-13                 rejected: Features
gpu-1-14                 rejected: Features
gpu-1-15                 rejected: Features
gpu-1-16                 rejected: Features
gpu-1-17                 rejected: Features
gpu-2-4                  rejected: HostList
gpu-2-5                  rejected: HostList
gpu-2-6                  rejected: Features
gpu-2-7                  rejected: HostList
gpu-2-8                  rejected: Features
gpu-2-9                  rejected: Features
gpu-2-10                 rejected: HostList
gpu-2-11                 rejected: Features
gpu-2-12                 rejected: Features
gpu-2-13                 rejected: HostList
gpu-2-14                 rejected: Features
gpu-2-15                 rejected: Features
gpu-2-16                 rejected: Features
gpu-2-17                 rejected: Features
gpu-3-8                  rejected: Features
cpu-6-1                  rejected: Features
cpu-6-2                  rejected: HostList
NOTE:  job req cannot run in partition MSKCC (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 608  feasible procs:   0

Node Rejection Summary: [Features: 21][HostList: 11]

vipints · 2015-04-22T13:44:20Z

thanks @tatarsky, I saw this message, forgot to include in the previous message. Not sure why it got rejected as I am requesting limited resources 12gb mem and 40hrs cput_time.

tatarsky · 2015-04-22T13:48:41Z

This is a little weird, perhaps a syntax error?

Features: cpu-6-2

So it seems to be asking for a feature of a hostname....

tatarsky · 2015-04-22T13:54:11Z

Its weird if you look at the "required hostlist" cpu-6-2 does not appear in it yet I see you requesting it.

Opsys: ---  Arch: ---  Features: cpu-6-2

Required HostList: [gpu-1-12:1][gpu-1-13:1][gpu-1-16:1][gpu-1-17:1][gpu-1-14:1][gpu-1-15:1]
  [cpu-6-1:1][gpu-3-8:1][gpu-3-9:1][gpu-1-4:1][gpu-1-5:1][gpu-1-6:1]
  [gpu-2-17:1][gpu-2-16:1][gpu-2-15:1][gpu-2-14:1][gpu-2-12:1][gpu-2-11:1]
  [gpu-2-6:1][gpu-2-9:1][gpu-2-8:1]

tatarsky · 2015-04-22T13:54:34Z

From the queue file...

<submit_args flags="1"> -N pj_41d1c2f4-e8c0-11e4-97d2-5fd54d3e274e -l mem=12gb -l vmem=12gb -l pmem=12gb -l pvmem=12gb 
-l nodes=1:ppn=1 -l walltime=40:00:00 -l host=gpu-1-12+gpu-1-13+gpu-1-16+gpu-1-17+gpu-1-14+gpu-1-15+cpu-6-2+cpu-6-1+gpu-3-8
+gpu-3-9+gpu-1-4+gpu-1-5+gpu-1-6+gpu-2-17+gpu-2-16+gpu-2-15+gpu-2-14+gpu-2-12+gpu-2-11+gpu-2-6+gpu-2-9+gpu-2-8</submit_args
>```

vipints · 2015-04-22T13:57:12Z

yes correct, I am requesting specific hostnames in my submission argument. Due to the OOM issue I have blacklisted the following nodes ['gpu-1-10', 'gpu-1-9', 'gpu-1-8', 'gpu-1-11', 'gpu-1-7', 'gpu-2-5', 'gpu-2-13', 'gpu-2-7', 'gpu-2-4', 'gpu-2-10']

tatarsky · 2015-04-22T13:59:38Z

Try the submit without the blacklist. The OOM issue is not node related. I continue to work on the best solution to it.

tatarsky · 2015-10-31T13:59:24Z

Thats exciting because I believe from some threads the limit was 1024. Lets declare victory at 10K ;)

tatarsky · 2015-11-09T19:45:46Z

So what do you think the count is at?

vipints · 2015-11-09T19:56:56Z

so far I have reached 4287.

tatarsky · 2015-11-09T20:13:37Z

Very cool. I'll ask again in 14 days is my guesstimate for 10K. While it seems likely that was the fix, lets let it ride some more.

vipints · 2015-11-12T18:22:03Z

@tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch...
There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.
@cganote: just checking, is your drmaa jobs are OK with the patch?

cganote · 2015-11-12T18:27:13Z

I haven't seen any issues, but maybe I'm not getting enough jobs submitted through drmaa? I certainly haven't had 6000 yet.

-Carrie

From: Vipin <notifications@github.com mailto:notifications@github.com>
Reply-To: cBio/cbio-cluster <reply@reply.github.com mailto:reply@reply.github.com>
Date: Thursday, November 12, 2015 at 1:22 PM
To: cBio/cbio-cluster <cbio-cluster@noreply.github.com mailto:cbio-cluster@noreply.github.com>
Cc: Carrie Ganote <cganote@iu.edu mailto:cganote@iu.edu>
Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)

@tatarskyhttps://github.com/tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch...
There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.

@cganotehttps://github.com/cganote: just checking, is your drmaa jobs are OK with the patch?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/246#issuecomment-156190350.

vipints · 2015-11-12T18:38:54Z

Thanks @cganote.

tatarsky · 2015-11-12T18:48:35Z

Thats a rather odd number. I'd like to poke around for a bit before I restart pbs_server see if I learn anything new. Are you under time pressure to get more of these in?

vipints · 2015-11-12T18:50:56Z

If you find sometime in today evening, I will be happy. Thanks!

tatarsky · 2015-11-12T18:52:22Z

Similar code 15007 response in logs claiming "unauthorized request"

tatarsky · 2015-11-12T20:58:15Z

No new information gained. Restarted pbs_server.

vipints · 2015-11-25T03:13:20Z

@tatarsky, this time drmaa reached the limit of max_num_jobs in just after 366 job requests.

vipints · 2015-11-25T03:14:21Z

seems like an odd behavior this time.

vipints · 2015-11-25T03:37:26Z

Whenever you have small time frame, I may need a restart to the pbs_server thank you.

tatarsky · 2015-11-25T14:52:19Z

Restarted. I have this slated for possible test attempts on the new scheduler head I've built. The current issue is I need to have some nodes to test that system with and we're working on a schedule. It seems that the patch does not solve the problem. But its unclear if it hurts overall or helps. Seems weird this one didn't even get to the "normal" 1024 or so.

vipints · 2015-11-25T15:02:33Z

It could be a reason that someone else is using drmaa to submit the jobs to cluster. The count (366) of jobs is just from my side.

Is there anybody else using drmaa python combination to submit jobs on hal?

tatarsky · 2015-11-25T15:02:53Z

Not that I've ever heard of.

tatarsky · 2015-12-04T15:15:14Z

This worlds longest issue may be further attacked via #349 . However its unclear how it would be attacked at this moment in time.

raylim · 2016-07-01T16:47:46Z

Has there been any progress on this issue? Just encountered it today.

$ ipython
In [1]: import drmaa
In [2]: s = drmaa.Session()
In [3]: s.initialize()
In [4]: jt = s.createJobTemplate()
In [5]: jt.remoteCommand = 'hostname'
In [6]: jobid = s.runJob(jt)
In [7]: retval = s.wait(jobid)
In [8]: retval
Out[8]: JobInfo(jobId=u'7551501.hal-sched1.local', hasExited=False, hasSignal=False, terminatedSignal=u'unknown signal?!', hasCoreDump=False, wasAborted=True, exitStatus=127, resourceUsage={u'mem': u'0', u'start_time': u'1467389618', u'queue': u'batch', u'vmem': u'0', u'hosts': u'gpu-1-4/4', u'end_time': u'1467389619', u'submission_time': u'1467389616', u'cpu': u'0', u'walltime': u'0'})
In [9]: retval.exitStatus
Out[9]: 127

tatarsky · 2016-07-01T16:52:26Z

No. No precise solution has ever been found. I will restart pbs_server and you can tell me if it works after that. Then the item will be moved to Fogbugz.

vipints · 2016-07-01T16:53:16Z

I am not sure whether we found a way to fix this, If you are getting the error means the drmaa reached the max_num_of_jobs. to fix this you may need a pbs_server restart from admins to clear the job ids.

tatarsky · 2016-07-01T16:55:41Z

Server restarted to confirm your example is a case of this. If so, open a ticket in FogBugz via the email address listed in the /etc/motd on hal. I won't be processing items here further.

vipints · 2016-07-01T16:58:22Z

Sorry I meant to report via email to the cbio-admin group. Thanks @tatarsky!

tatarsky · 2016-07-01T17:00:19Z

Thats fine. This ticket has a long long gory history. But all further attempts to figure it out require involvement by the primary support which is as of today MSKCC staff. I will assist then as needed but I don't feel this is likely trivially fixed. As we both know DRMAA is quite a hack for Torque.

tatarsky · 2016-07-01T17:02:17Z

I do notice since last we battled this there is another release of pbs-drmaa. 1.0.19.

Perhaps by some friday miracle they are using the Torque 5.0 submit call compared to the crufty 4.0 one that seems to be buggy.

vipints · 2016-07-01T17:03:26Z

Yeah that is correct, Seems like they have support for v5. May be we can try after the long weekend. I didn't check the recent release version.

tatarsky · 2016-07-01T17:04:29Z

I see we actually noticed it when it came out last year. I see nothing overly "Torque 5" in it yet.

I am unlikely to look at this further today. Confirm/deny that your example now works with pbs_server restarted and open a ticket for some work next week.

raylim · 2016-07-01T17:05:14Z

Yes, python drmaa job submission works now.

tatarsky · 2016-07-01T17:06:51Z

Kick an email to the address listed for problem reports in /etc/motd (sorry I'm not placing it again in the public Git) to start tracking it there. We'll reference this Git thread but we no longer process bugs here.

tatarsky · 2016-07-01T17:07:33Z

Not that this one is likely to be fixable anytime soon. We've tried for many years and DRMAA is basically not well supported by Adaptive.

tatarsky changed the title ~~Jobs are not running on the worker nodes~~ Python submitted DRMAA jobs are not running on the worker nodes Apr 17, 2015

tatarsky self-assigned this Apr 17, 2015

tatarsky added vendorticket Torque/Moab scheduler labels Apr 17, 2015

Python submitted DRMAA jobs are not running on the worker nodes #246

Python submitted DRMAA jobs are not running on the worker nodes #246

Comments

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

vipints commented Apr 16, 2015

tatarsky commented Apr 16, 2015

tatarsky commented Apr 17, 2015

tatarsky commented Apr 17, 2015

vipints commented Apr 22, 2015

tatarsky commented Apr 22, 2015

vipints commented Apr 22, 2015

tatarsky commented Apr 22, 2015

tatarsky commented Apr 22, 2015

tatarsky commented Apr 22, 2015

vipints commented Apr 22, 2015

tatarsky commented Apr 22, 2015

tatarsky commented Oct 31, 2015

tatarsky commented Nov 9, 2015

vipints commented Nov 9, 2015

tatarsky commented Nov 9, 2015

vipints commented Nov 12, 2015

cganote commented Nov 12, 2015

vipints commented Nov 12, 2015

tatarsky commented Nov 12, 2015

vipints commented Nov 12, 2015

tatarsky commented Nov 12, 2015

tatarsky commented Nov 12, 2015

vipints commented Nov 25, 2015

vipints commented Nov 25, 2015

vipints commented Nov 25, 2015

tatarsky commented Nov 25, 2015

vipints commented Nov 25, 2015

tatarsky commented Nov 25, 2015

tatarsky commented Dec 4, 2015

raylim commented Jul 1, 2016 • edited Loading

tatarsky commented Jul 1, 2016

vipints commented Jul 1, 2016

tatarsky commented Jul 1, 2016

vipints commented Jul 1, 2016

tatarsky commented Jul 1, 2016

tatarsky commented Jul 1, 2016

vipints commented Jul 1, 2016 • edited Loading

tatarsky commented Jul 1, 2016

raylim commented Jul 1, 2016

tatarsky commented Jul 1, 2016

tatarsky commented Jul 1, 2016

raylim commented Jul 1, 2016 •

edited

Loading

vipints commented Jul 1, 2016 •

edited

Loading