Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning /scratch of old files #249

Open
tatarsky opened this issue Apr 20, 2015 · 19 comments
Open

Cleaning /scratch of old files #249

tatarsky opened this issue Apr 20, 2015 · 19 comments
Assignees

Comments

@tatarsky
Copy link
Contributor

There appears to be no concept of removal of old files in /scratch and they are frankly littered with such files. Not a space issue, but a large number of "top dir" files making it a bit pokey to stat/ls.

But I don't know the policy that was adopted for /scratch on the nodes so before I add a cron job to say remove any file more than six months old please comment.

Alternately I can "not care" and let folks clean up themselves but the fact there are files as far back as 2013 suggests that isn't being done ;)

@jchodera
Copy link
Member

Thanks for noticing this! We should definitely have a cleaning policy and announce it.

Deleting files older than six months with a cron job sounds great for now. We can then figure out if there is a better practical policy for the future.

@tatarsky
Copy link
Contributor Author

Well, I was looking at something else but I'll cron something up or extend tmpwatch (the one that does tmp and /var/tmp)

@akahles
Copy link

akahles commented Apr 20, 2015

I would rather treat this generally as temporary space. In general, if somebody wants to write tmp data during a job, each job creates a job-specific temporary directory that is present in the TMPDIR environment variable:

qlogin 
qsub: waiting for job 3077568.mskcc-fe1.local to start
qsub: job 3077568.mskcc-fe1.local ready
...
echo $TMPDIR
/scratch/3077568.mskcc-fe1.local

This directory will be removed upon completion of the job. So the only use case I see for the remainder of /scratch is that data should persist locally on the node after job completion (e.g., big data for heavy local I/O of many successive jobs).

Could we just have a designates directory for this (e.g., /scratch/share ...) that is cleaned up less frequently and clean up the rest very frequently (e.g., daily) to prevent people from accidentally filling up /scratch.

@jchodera
Copy link
Member

+1 on @akahles's suggestion for policy.

I had totally forgotten about the automatic setting g of $TMPDIR. It may be worth an email announcement that people should use $TMPDIR (with a pointer to the wiki docs) or else the safety of their files cannot be guaranteed.

The addition of /scratch/shared that is cleaned of old files periodically is also a great idea.

@ratsch
Copy link

ratsch commented Apr 21, 2015

+1

On Apr 20, 2015, at 8:58 AM, John Chodera notifications@github.com wrote:

+1 on @akahles https://github.com/akahles's suggestion for policy.

I had totally forgotten about the automatic setting g of $TMPDIR. It may be worth an email announcement that people should use $TMPDIR (with a pointer to the wiki docs) or else the safety of their files cannot be guaranteed.

The addition of /scratch/shared that is cleaned of old files periodically is also a great idea.


Reply to this email directly or view it on GitHub #249 (comment).

@tatarsky
Copy link
Contributor Author

Noted above and will draft up some items here and prep the defined cleaning items. Nothing will happen for awhile so others might note this.

@tatarsky tatarsky self-assigned this Apr 21, 2015
@tatarsky
Copy link
Contributor Author

/scratch/shared was created. Working on logic of cleaning it and above dirs and announcement. No cleaning will take place for awhile. This is a back burner item but moving it along.

@rgejman
Copy link

rgejman commented Aug 25, 2015

Is $TMPDIR supposed to refer to the /scratch/$jobid folder as specified in the wiki? Currently it seems to just point to /scratch/ despite a job-specific folder being created. e.g.:

$ qlogin
qsub: waiting for job 5301084.mskcc-fe1.local to start
qsub: job 5301084.mskcc-fe1.local ready
$ echo $TMPDIR
/scratch/
$ echo $PBS_JOBID
5301084.mskcc-fe1.local
$ cd /scratch/5301084.mskcc-fe1.local

@tatarsky
Copy link
Contributor Author

I don't show the above so I would check your wrapper script in case it is resetting or clearing environment.

qsub -I  -q active
gpu-2-15$ echo $TMPDIR
/scratch/5301085.mskcc-fe1.local
gpu-2-15$ echo $PBS_JOBID
5301085.mskcc-fe1.local

@rgejman
Copy link

rgejman commented Aug 25, 2015

Right as usual, @tatarsky. Thanks.

@tatarsky
Copy link
Contributor Author

Well, I don't know about that ;)

BTW...I've never gotten back to the original goal of this Git entry. I'll re-engage it shortly.

@tatarsky
Copy link
Contributor Author

Note shortly (and I can announce this more) that any items in the top level of /scratch will be deleted using an age based cron job. My current proposal is 60 days of no mtime. I am in no hurry to implement this but figured I'd move it along a notch.

Items in /scratch/shared will not be subject to this cron job. So a need for a more persistent /scratch item can be placed here using hopefully some subdir logical naming method (such as your username) to prevent conflict ;)

Items correctly using PBS $TMPDIR will be deleted based on age if left after a job as by default those are rooted in /scratch itself.

Items in docker images areas will be handled with a still being discussed #288 method.

Questions/concerns can continue to go here. I will make a louder statement of implementation when necessary.

@akahles
Copy link

akahles commented Oct 14, 2015

Items correctly using PBS $TMPDIR will be deleted based on age if left after a job as by default those are rooted in /scratch itself.

Just for clarification. $TMPDIR defaults to /scratch/JOBID. This directory is normally removed when the job finishes. Are there exceptions to this rule?

@tatarsky
Copy link
Contributor Author

Sorry that was unclear in my comment above. I've seen I believe a few instances where it appeared perhaps due to some of our past "lost jobs" some dirs in /scratch that looked like orphaned TMPDIR items. I wanted to make sure people understood I wasn't going to delete their TMPDIR areas unless they were really really old and obviously not in use.

May have been another cluster, but basically active TMPDIR areas won't match the cron mtime rule in this plan.

@akahles
Copy link

akahles commented Oct 14, 2015

Thanks for clarifying.

@tatarsky
Copy link
Contributor Author

Some examples on cpu-6-1 which I suspect were orphaned during some of the issues that machine had:

drwxr-xr-x 2 (removed)  (removed)  4096 Mar  6  2015 2900274[750].mskcc-fe1.local

Such obvious orphans would match the cron job and be deleted. If you feel such a dir is NOT an orphan now would be the time to see whats out there.

You for example on the same system I see clearly your TMPDIR areas for your running jobs. All with Oct 14 timestamps.

@tatarsky
Copy link
Contributor Author

Helpful pattern to observe such animals on a node:

ls -ald /scratch/*mskcc-fe1.local

@tatarsky
Copy link
Contributor Author

I am going to document the concept of this in the Wiki and then announce a trial run. I've left this idle for too long but the space out there isn't really that used so I've ignored it. But we should have a policy and a cleaning script in case that changed.

@tatarsky
Copy link
Contributor Author

I have attempted to define the above and will likely re-issue a Git request when the actual script is ready for running. This is not viewed as urgent but its probably something we will need someday.

https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#scratch-disk-space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants