Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67

mmokrejs · 2018-01-21T16:18:33Z

Hi,
although I provided 19 input files the code run in a single thread. To further scale it could also do the conversion in multiple chunks on each file?

spades.py \
--only-assembler \
--pe1-1 /scratch/mygenome/paired_end_497bp_201709/HT5V3BCXY.2.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.3091.3091.pairs_1.fastq \
--pe1-2 /scratch/mygenome/paired_end_497bp_201709/HT5V3BCXY.2.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.3091.3091.pairs_2.fastq \
--pe2-1 /scratch/mygenome/paired_end_619bp/HKMHTBCXX.1.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.19552.19552.pairs_1.fastq \
--pe2-2 /scratch/mygenome/paired_end_619bp/HKMHTBCXX.1.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.19552.19552.pairs_2.fastq \
--pe3-1 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe3-2 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe3-s /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.singletons.fq \
--pe4-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe4-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe4-s /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.singletons.fq \
--pe5-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe5-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe5-s /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.singletons.fq \
--mp1-1 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp1-2 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--mp2-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp2-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--mp3-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp3-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--trusted-contigs /scratch/work/project/bio/open-9-41/assemblies/tadpole_k165/tt_16D1C3L12.tadpole.contigs.k165.fa \
-t 104 --nanopore /scratch/mygenome/OxfordNanopore/tt_16D1C3L12.OxNano.fastq -m 3000 -k 55,77,99,127 -o tt_16D1C3L12__SPAdes3.11.1_noecc

This probably won't happen soon but let me open a feature request for this. Current version is SPAdes3.11.1. Thank you.

The text was updated successfully, but these errors were encountered:

asl · 2018-01-21T16:20:00Z

This part is normally I/O bound, so multiple threads would make the situation even worse.

mmokrejs · 2018-01-21T16:23:02Z

We have a parallel filesystem (LustreFS) served by I think 54 working slave machines, and infiniband inbetween. How are the data laid over the many hosts and drives is user configurable per directory or even per file. Stripe size is currently 1MB I think.

And would I be sure the data fits into memory, I would use ramdisk for the actual processing and then move the resulting files into storage filesystem. Oh yes, it does:

$ du -sh mygenome__SPAdes3.11.1_noecc/.bin_reads/
56G	mygenome__SPAdes3.11.1_noecc/.bin_reads/
$

The input uncompressed FASTQ files occupied 435.86GB.

mmokrejs · 2018-01-21T16:31:12Z

Here you can see the "disc" traffic is 102MB/s on average, more reading than writing.

112 x86_64 Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz are available with 3.2TB physical, local RAM

asl · 2018-01-21T16:36:04Z

Here you can see the "disc" traffic is 104MB/s on average, more reading than writing.

This is how it should be. We're reading FASTQ (text format) and convert to the internal binary format. The read:write ratio 9:1 is very close to the text FASTQ : SPAdes binary format file size ratio.

mmokrejs · 2018-01-21T18:02:54Z

Here is what the filesystem handles if applications are properly written to read/write in large chunks. A very efficient alternative. bamsort comes from https://github.com/gt1/biobambam2

# samtools sort of a 149GB BAM file takes 1.2TB RAM and uses only a single thread despite '-@ 15' argument
# samtools sort -@ $xthreads -m "$gb_mem_per_thread"G -O bam -T "$1" -o "$2".sorted.bam "$2".bam || exit 255
# 
# bamsort comes from https://github.com/gt1/biobambam2
LIBMAUS2_POSIXFDINPUT_BLOCKSIZE_OVERRIDE==1m
export LIBMAUS2_POSIXFDINPUT_BLOCKSIZE_OVERRIDE
bamsort SO=coordinate blockmb="$take_memory" inputthreads="$input_threads" outputthreads="$output_threads" level=9 index=1 I="$2".bam O="$2".sorted.bam

The currently running SPAdes process running read_converter.hpp/binary_converter.hpp supposedly overloaded metadata servers of LustreFS and the kernel after 40minutes of attempts to flush buffers (see high system CPU load in red color in figures below) gave up. I see similar issues when apps write many and too small chunks appending to existing files. Running truss or strace or similar profiling tool should reveal the actual write size of SPAdes binaries.

mmokrejs · 2018-01-24T11:56:55Z

I cannot login to the cluster node to verify this but although I am running spades.py --tmp-dir /ramdisk/$PBS_JOBID it seems it is still reading and writing at same pace to LustreFS (~100 kBps). Although I do not see any improvements in terms of the times how quickly spades.py moves to process the many input FASTA files.

And, while the log says now:

0:46:19.694 12M / 700M INFO General (read_converter.hpp : 84) Converting reads to binary format for library #6 (takes a while)

I should not see the paired_6_*.seq files on the networked filesystem until this step is complete, right? They should be still in --tmp-dir.

asl · 2018-01-24T11:59:04Z

These files will be in the output dir since they are reused across iterations (= long living). Everything else will be on scratch.

mmokrejs · 2018-01-24T12:02:41Z

I don't understand. The paired_6_*.seq have same modification timestamp because they were continually updated for some while during processing the library #6 of input files. This should have happened in --tmp-dir and then the paired_6_*.seq files should have been moved to tt_16D1C3L12__SPAdes3.11.1_noecc_ramdisk/.bin_reads/. But, until library #7 processing started these files should not be existing in tt_16D1C3L12__SPAdes3.11.1_noecc_ramdisk/.bin_reads/, so what am I missing?

asl · 2018-01-24T12:05:32Z

This is not how it is done currently. We may consider doing this in some next SPAdes versions. Patches are always welcome though.

This was referenced Jan 21, 2018

Any Option for bamsort in Parallel? gt1/biobambam2#59

Open

Memory available for splitting buffers: 125 Gb #65

Open

mmokrejs changed the title ~~Parallelize single-threaded read_converter step~~ Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted Jan 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67

Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67

mmokrejs commented Jan 21, 2018

asl commented Jan 21, 2018

mmokrejs commented Jan 21, 2018 •

edited

Loading

mmokrejs commented Jan 21, 2018 •

edited

Loading

asl commented Jan 21, 2018

mmokrejs commented Jan 21, 2018 •

edited

Loading

mmokrejs commented Jan 24, 2018 •

edited

Loading

asl commented Jan 24, 2018

mmokrejs commented Jan 24, 2018 •

edited

Loading

asl commented Jan 24, 2018 •

edited

Loading

Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67

Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67

Comments

mmokrejs commented Jan 21, 2018

asl commented Jan 21, 2018

mmokrejs commented Jan 21, 2018 • edited Loading

mmokrejs commented Jan 21, 2018 • edited Loading

asl commented Jan 21, 2018

mmokrejs commented Jan 21, 2018 • edited Loading

mmokrejs commented Jan 24, 2018 • edited Loading

asl commented Jan 24, 2018

mmokrejs commented Jan 24, 2018 • edited Loading

asl commented Jan 24, 2018 • edited Loading

mmokrejs commented Jan 21, 2018 •

edited

Loading

mmokrejs commented Jan 21, 2018 •

edited

Loading

mmokrejs commented Jan 21, 2018 •

edited

Loading

mmokrejs commented Jan 24, 2018 •

edited

Loading

mmokrejs commented Jan 24, 2018 •

edited

Loading

asl commented Jan 24, 2018 •

edited

Loading