Request to copy stageout files failed (redux) #186

rgejman · 2014-11-24T22:27:19Z

Referencing issue #173, which I can't seem to re-open. I hope this is not a bad portent.

PBS Job Id: 2589028.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/7
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589028.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.o2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.e2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.ER': No such file or directory
*** end error output

The text was updated successfully, but these errors were encountered:

rgejman · 2014-11-24T22:35:31Z

Another one. Could this be related to #182?

PBS Job Id: 2589518.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/23
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589518.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.o2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.e2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.ER': No such file or directory
*** end error output

tatarsky · 2014-11-24T22:36:24Z

Not us right now. We are not running migrations.

tatarsky · 2014-11-24T22:36:42Z

Looking for other causes as I can.

tatarsky · 2014-11-24T23:01:02Z

gpu-1-12 seems extremely loaded down...trying to determine why. It appears in three of those messages today of a total of five. gpu-1-4 is the other.

rgejman · 2014-11-25T05:24:35Z

3 more in the past 20 minutes, all on gpu-1-12.

tatarsky · 2014-11-25T12:16:24Z

You have the highest percentage of these in the logs. Anything interesting about the qsubs I can report to Adaptive? I am looking at the section of the pbs_mom that generates this.

I suspect its migration load related from the relative times. I've backed it off for the morning and will update #182 as the day goes if that changes.

rgejman · 2014-11-25T13:59:09Z

Not really. The calls are all of the form:
echo "script.sh args1 args2 args3 file1 > fileout.txt" | qsub -V -d . -l nodes=1:ppn=1,mem=1gb,pmem=1gb,walltime=12:00:00

tatarsky · 2014-11-25T14:44:07Z

Draining gpu-1-12 for inspection just in case its node related. I noted the recent ones from it.

But that will be awhile.

ratsch · 2014-11-25T16:16:49Z

the option "-d ." might be problematic. Try "-d pwd". Does it make a difference?

akahles · 2014-11-25T16:22:12Z

I guess you meant -d pwd`` but your backticks were replaced by markdown

rgejman · 2014-11-25T16:26:04Z

I can try this in a future batch of runs. However, I do not think this is the problem for two reasons:

Only a very small percentage of jobs fail with this error (<1%)
The directory is correctly set, as evidenced by the path in the error message.
The error is a file access error when trying to copy "/var/spool/torque/spool/$pid.mskcc-fe1.local.OU" to the current directory
If I interpret @tatarsky correctly, it sounds like some other people are having the same errors. Perhaps they are also using the -d . argument... but probably not.

This argues to me that the problem lies with disk access, not with the the parameters to the job.

tatarsky · 2014-11-25T16:29:57Z

Actually, if you don't mind in about 4-5 hours there will be a period where the migration is at a step where there should be no disk I/O similar to when we see these. If you could NOT make any changes to your side of the fence it would be helpful.

I believe (but haven't had time to dig heavily into the code surrounding this error) its a form of overload manifestation. But I'm not clear why it would "give up" rather than simply wait. So I will look closer, but am very involved in the migration right now as its a complex step coming up.

rgejman · 2014-11-25T16:31:46Z

No problem. The jobs are continuously submitted to the server by a master script so as to not overload the scheduler... if that's still even a problem.

tatarsky · 2014-11-25T16:32:13Z

We can test that another day as well ;)

tatarsky · 2014-11-25T22:01:20Z

This is just to document that at 5:00PM all migration tasks are idle for a moment.
I have some careful steps coming up and am pausing for a moment to review those steps and do my validations of them.

During this time, all I/O is "regular I/O". I expect to return to migration tasks later this evening.

tatarsky · 2014-12-12T19:31:04Z

Please note I will re-start looking at this shortly now that at least one very large variable is gone. Not today but its back on my run queue.

tatarsky · 2014-12-17T16:50:25Z

So the primary hit I see on this error now post migration comes from the jobs of @nrosed I believe. It would be interesting to try to compare notes on those submit scripts. File system I/O appears quite light.

tatarsky · 2015-01-12T16:04:53Z

The incidence of these is being looked at again as I see no compelling reason these are filesystem related. There are periodic bursts however. I am looking at the code closer that surrounds this error message.

tatarsky · 2015-03-07T17:28:26Z

We are looking at this in light of the recent E_NOMEM token incident. To see if this is related.

akahles · 2015-03-12T16:57:04Z

Since you are currently looking at this again. I see it still occurring in my logs once in a while:

PBS Job Id: 2910950.mskcc-fe1.local
Job Name:   validate_fastq
Exec host:  gpu-2-6/1
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-2-6' for job 2910950.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2910950.mskcc-fe1.local.OU to /cbio/grlab/projects/ICGC/orig_data/TCGA/validate_fastq/9291fb76-e606-4750-b6da-300108cbacc7/validation.log, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2910950.mskcc-fe1.local.OU': No such file or directory
*** end error output

tatarsky · 2015-12-04T15:18:08Z

And while I've never figured this out I may attempt to determine if there is a setting I can alter for this in #349

tatarsky · 2015-12-09T17:03:48Z

So also in the ancient bug bin I've been working on trying to figure this one out.

I've traced a single occurence of this problem on the new Torque system to basically a Torque function called sys_copy and determined what I suspect is some unexpected exit code of the /bin/cp is being returned in some edge case. Yes, its an actual /bin/cp at the end due to the usecp directive we have always had to move spool files from the local node spool to the specified location in the job script.

I've submitted > 100000 jobs to calculate Pi to 10000 places and got one of these failures today.

The code involved appears to have a hardwired 4 attempts. And a rather silly sleep line to wait before trying again.

I should note that my jobs and their output areas are NOT to GPFS homedir locations. But a separate one via NFS that is a convention of the ROCKS environment. So its a good control case.

So I have at least a very specific location of the code I can try with Adaptive again and I believe I can turn up a debug knob for the test new system that would be very impacting on the real one and perhaps get the precise exit status code if I issue another 100000 jobs ;)

Windmills. Tilting. Me.

tatarsky · 2015-12-09T17:06:13Z

Oh, and in case I lose where I was. The items in question are in the pbs_mom code at:

sys/resmom/requests.c 3088

tatarsky · 2015-12-09T17:18:20Z

Here BTW is the loop around the attempts to copy:

for (loop = 1;loop < 4;++loop)
    {

(do the cp stuff)

  if ((loop % 2) == 0)
      sleep(loop / 2 * 3 + 1);
    }  /* END for (loop) */

I believe that results in a sleep of 0, 4, 0 and 7 seconds between the attempts. I wish that was explained and perhaps sleep( 10 * loop); is really the cure ;)

I have no idea what you would use the mod to skip a sleep on the odd numbers of the loop.

rgejman changed the title ~~PBS -- An error has occurred processing your job, see below.~~ Request to copy stageout files failed (redux) Nov 24, 2014

tatarsky self-assigned this Dec 12, 2014

jchodera added the Torque/Moab scheduler label Dec 17, 2014

tatarsky mentioned this issue Dec 17, 2014

What high priority scheduler issues remain to be resolved? #197

Closed

tatarsky mentioned this issue Jan 20, 2015

"request to copy stageout files failed" #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to copy stageout files failed (redux) #186

Request to copy stageout files failed (redux) #186

rgejman commented Nov 24, 2014

rgejman commented Nov 24, 2014

tatarsky commented Nov 24, 2014

tatarsky commented Nov 24, 2014

tatarsky commented Nov 24, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

ratsch commented Nov 25, 2014

akahles commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

tatarsky commented Nov 25, 2014

tatarsky commented Dec 12, 2014

tatarsky commented Dec 17, 2014

tatarsky commented Jan 12, 2015

tatarsky commented Mar 7, 2015

akahles commented Mar 12, 2015

tatarsky commented Dec 4, 2015

tatarsky commented Dec 9, 2015

tatarsky commented Dec 9, 2015

tatarsky commented Dec 9, 2015

Request to copy stageout files failed (redux) #186

Request to copy stageout files failed (redux) #186

Comments

rgejman commented Nov 24, 2014

rgejman commented Nov 24, 2014

tatarsky commented Nov 24, 2014

tatarsky commented Nov 24, 2014

tatarsky commented Nov 24, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

ratsch commented Nov 25, 2014

akahles commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

rgejman commented Nov 25, 2014

tatarsky commented Nov 25, 2014

tatarsky commented Nov 25, 2014

tatarsky commented Dec 12, 2014

tatarsky commented Dec 17, 2014

tatarsky commented Jan 12, 2015

tatarsky commented Mar 7, 2015

akahles commented Mar 12, 2015

tatarsky commented Dec 4, 2015

tatarsky commented Dec 9, 2015

tatarsky commented Dec 9, 2015

tatarsky commented Dec 9, 2015