Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to copy stageout files failed (redux) #186

Open
rgejman opened this issue Nov 24, 2014 · 24 comments
Open

Request to copy stageout files failed (redux) #186

rgejman opened this issue Nov 24, 2014 · 24 comments
Assignees

Comments

@rgejman
Copy link

rgejman commented Nov 24, 2014

Referencing issue #173, which I can't seem to re-open. I hope this is not a bad portent.

PBS Job Id: 2589028.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/7
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589028.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.o2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589028.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/ucec/STDIN.e2589028, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589028.mskcc-fe1.local.ER': No such file or directory
*** end error output
@rgejman rgejman changed the title PBS -- An error has occurred processing your job, see below. Request to copy stageout files failed (redux) Nov 24, 2014
@rgejman
Copy link
Author

rgejman commented Nov 24, 2014

Another one. Could this be related to #182?

PBS Job Id: 2589518.mskcc-fe1.local
Job Name:   STDIN
Exec host:  gpu-1-12/23
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-1-12' for job 2589518.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.OU to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.o2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.OU': No such file or directory
*** end error output

Unable to copy file /var/spool/torque/spool/2589518.mskcc-fe1.local.ER to /cbio/cslab/projects/weinhold/hlapred/lusc/STDIN.e2589518, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2589518.mskcc-fe1.local.ER': No such file or directory
*** end error output

@tatarsky
Copy link
Contributor

Not us right now. We are not running migrations.

@tatarsky
Copy link
Contributor

Looking for other causes as I can.

@tatarsky
Copy link
Contributor

gpu-1-12 seems extremely loaded down...trying to determine why. It appears in three of those messages today of a total of five. gpu-1-4 is the other.

@rgejman
Copy link
Author

rgejman commented Nov 25, 2014

3 more in the past 20 minutes, all on gpu-1-12.

@tatarsky
Copy link
Contributor

You have the highest percentage of these in the logs. Anything interesting about the qsubs I can report to Adaptive? I am looking at the section of the pbs_mom that generates this.

I suspect its migration load related from the relative times. I've backed it off for the morning and will update #182 as the day goes if that changes.

@rgejman
Copy link
Author

rgejman commented Nov 25, 2014

Not really. The calls are all of the form:
echo "script.sh args1 args2 args3 file1 > fileout.txt" | qsub -V -d . -l nodes=1:ppn=1,mem=1gb,pmem=1gb,walltime=12:00:00

@tatarsky
Copy link
Contributor

Draining gpu-1-12 for inspection just in case its node related. I noted the recent ones from it.

But that will be awhile.

@ratsch
Copy link

ratsch commented Nov 25, 2014

the option "-d ." might be problematic. Try "-d pwd". Does it make a difference?

@akahles
Copy link

akahles commented Nov 25, 2014

I guess you meant -d pwd`` but your backticks were replaced by markdown

@rgejman
Copy link
Author

rgejman commented Nov 25, 2014

I can try this in a future batch of runs. However, I do not think this is the problem for two reasons:

  1. Only a very small percentage of jobs fail with this error (<1%)
  2. The directory is correctly set, as evidenced by the path in the error message.
  3. The error is a file access error when trying to copy "/var/spool/torque/spool/$pid.mskcc-fe1.local.OU" to the current directory
  4. If I interpret @tatarsky correctly, it sounds like some other people are having the same errors. Perhaps they are also using the -d . argument... but probably not.

This argues to me that the problem lies with disk access, not with the the parameters to the job.

@tatarsky
Copy link
Contributor

Actually, if you don't mind in about 4-5 hours there will be a period where the migration is at a step where there should be no disk I/O similar to when we see these. If you could NOT make any changes to your side of the fence it would be helpful.

I believe (but haven't had time to dig heavily into the code surrounding this error) its a form of overload manifestation. But I'm not clear why it would "give up" rather than simply wait. So I will look closer, but am very involved in the migration right now as its a complex step coming up.

@rgejman
Copy link
Author

rgejman commented Nov 25, 2014

No problem. The jobs are continuously submitted to the server by a master script so as to not overload the scheduler... if that's still even a problem.

@tatarsky
Copy link
Contributor

We can test that another day as well ;)

@tatarsky
Copy link
Contributor

This is just to document that at 5:00PM all migration tasks are idle for a moment.
I have some careful steps coming up and am pausing for a moment to review those steps and do my validations of them.

During this time, all I/O is "regular I/O". I expect to return to migration tasks later this evening.

@tatarsky
Copy link
Contributor

Please note I will re-start looking at this shortly now that at least one very large variable is gone. Not today but its back on my run queue.

@tatarsky tatarsky self-assigned this Dec 12, 2014
@tatarsky
Copy link
Contributor

So the primary hit I see on this error now post migration comes from the jobs of @nrosed I believe. It would be interesting to try to compare notes on those submit scripts. File system I/O appears quite light.

@tatarsky
Copy link
Contributor

The incidence of these is being looked at again as I see no compelling reason these are filesystem related. There are periodic bursts however. I am looking at the code closer that surrounds this error message.

@tatarsky
Copy link
Contributor

tatarsky commented Mar 7, 2015

We are looking at this in light of the recent E_NOMEM token incident. To see if this is related.

@akahles
Copy link

akahles commented Mar 12, 2015

Since you are currently looking at this again. I see it still occurring in my logs once in a while:

PBS Job Id: 2910950.mskcc-fe1.local
Job Name:   validate_fastq
Exec host:  gpu-2-6/1
An error has occurred processing your job, see below.
request to copy stageout files failed on node 'gpu-2-6' for job 2910950.mskcc-fe1.local

Unable to copy file /var/spool/torque/spool/2910950.mskcc-fe1.local.OU to /cbio/grlab/projects/ICGC/orig_data/TCGA/validate_fastq/9291fb76-e606-4750-b6da-300108cbacc7/validation.log, error 1
*** error from copy
/bin/cp: cannot stat `/var/spool/torque/spool/2910950.mskcc-fe1.local.OU': No such file or directory
*** end error output

@tatarsky
Copy link
Contributor

tatarsky commented Dec 4, 2015

And while I've never figured this out I may attempt to determine if there is a setting I can alter for this in #349

@tatarsky
Copy link
Contributor

tatarsky commented Dec 9, 2015

So also in the ancient bug bin I've been working on trying to figure this one out.

I've traced a single occurence of this problem on the new Torque system to basically a Torque function called sys_copy and determined what I suspect is some unexpected exit code of the /bin/cp is being returned in some edge case. Yes, its an actual /bin/cp at the end due to the usecp directive we have always had to move spool files from the local node spool to the specified location in the job script.

I've submitted > 100000 jobs to calculate Pi to 10000 places and got one of these failures today.

The code involved appears to have a hardwired 4 attempts. And a rather silly sleep line to wait before trying again.

I should note that my jobs and their output areas are NOT to GPFS homedir locations. But a separate one via NFS that is a convention of the ROCKS environment. So its a good control case.

So I have at least a very specific location of the code I can try with Adaptive again and I believe I can turn up a debug knob for the test new system that would be very impacting on the real one and perhaps get the precise exit status code if I issue another 100000 jobs ;)

Windmills. Tilting. Me.

@tatarsky
Copy link
Contributor

tatarsky commented Dec 9, 2015

Oh, and in case I lose where I was. The items in question are in the pbs_mom code at:

sys/resmom/requests.c 3088

@tatarsky
Copy link
Contributor

tatarsky commented Dec 9, 2015

Here BTW is the loop around the attempts to copy:

for (loop = 1;loop < 4;++loop)
    {

(do the cp stuff)

  if ((loop % 2) == 0)
      sleep(loop / 2 * 3 + 1);
    }  /* END for (loop) */

I believe that results in a sleep of 0, 4, 0 and 7 seconds between the attempts. I wish that was explained and perhaps sleep( 10 * loop); is really the cure ;)

I have no idea what you would use the mod to skip a sleep on the odd numbers of the loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants