Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for ntCard option to specify directory for temporary files #5

Closed
mahesh-panchal opened this issue Aug 8, 2017 · 5 comments
Closed
Assignees

Comments

@mahesh-panchal
Copy link

Hi,

Could I request an option to write temporary files to a user specified location please?
I was just testing ntCard with my data and the job gave no output. The cluster reported that the hard disk quota had been reached where my input files were, although the output was supposed to be written to a folder in my home directory. I guess this implies that temporary files were being written to the project folder which is at max quota.

My script:

#! /bin/bash

#SBATCH -J "ntCard test"
#SBATCH -A b2010042
#SBATCH -t 7-00:00:00
#SBATCH -n 16
#SBATCH -p node
#SBATCH -e log.ntcard_Spruce.2017-08-08_13.00-%j.out
#SBATCH -o log.ntcard_Spruce.2017-08-08_13.00-%j.out

export PATH="~/bin/ntCard/bin:$PATH"
FQDIR=/proj/b2010042/nobackup/douglas/fosmid-pool-data/raw-data/
time ntcard -t $SLURM_NPROCS -p spruce_freq $FQDIR/*.fq.gz

Thank you for telling me about ntCard. It was nice to meet you at ISMB.
Regards,
Mahesh.

@mohamadi
Copy link
Collaborator

mohamadi commented Aug 8, 2017

@mahesh-panchal Hi Mahesh, ntCard does not generate intermediate files. The output histogram files from ntCard are about few hundred bytes.

From your script I see you're using .gz files as inputs. So, the issue may be related to OS TEMP space related to gzip processes. Can you change your TMPDIR to somewhere with enough space such as /var/tmp?

export TMPDIR=/var/tmp

Another solution could be reducing the number of processes in your script, i.e. $SLURM_NPROCS.

@mohamadi mohamadi self-assigned this Aug 8, 2017
@mahesh-panchal
Copy link
Author

Hmm. Interesting. The problem cannot be the TMPDIR since this is set to use node scratch disk (/scratch/<job_id>) and that has quite a bit of space.

What is the reasoning behind using reducing the number of cores used?

@mahesh-panchal
Copy link
Author

mahesh-panchal commented Aug 9, 2017

I've tried some other things too now (including setting TMPDIR), like cd'ing to the output directory, symlinking the files to the output directory, but I'm still not getting output. The cluster hasn't reported an error this time either, but the weird thing is the absence of output now. All the input files definitely exist, and are not broken symlinks.

Is there supposed to be more written to the screen than:

Runtime(sec): 4638.3723

real    77m18.667s
user    262m28.935s
sys     14m13.065s

?

@mohamadi
Copy link
Collaborator

mohamadi commented Aug 9, 2017

@mahesh-panchal

What is the reasoning behind using reducing the number of cores used?

Every thread will be working on separate .gz file in parallel. Each thread will fork a gzip process to read the fq.gz files and the higher the number of threads, the more TMP space required for gzip processes in total.

Is there supposed to be more written to the screen than:

I just realized you haven't specified the value(s) of k in your script. Please include it by adding -k option. For example for k=64 use:

time ntcard -k 64 -t $SLURM_NPROCS -p spruce_freq $FQDIR/*.fq.gz

By default the outputs will be written on freq_k$k.hist in the CWD. In your script you have specified spruce_freq as prefix for outputs, so you should see spruce_freq_k64.hist in the current or specified working directory.

@mahesh-panchal
Copy link
Author

Thank you Hamid,

After including the -k option, it now works. The output is now there.

Thanks again for puzzling this through with me.

Regards,
Mahesh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants