-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67
Comments
This part is normally I/O bound, so multiple threads would make the situation even worse. |
We have a parallel filesystem (LustreFS) served by I think 54 working slave machines, and infiniband inbetween. How are the data laid over the many hosts and drives is user configurable per directory or even per file. Stripe size is currently 1MB I think. And would I be sure the data fits into memory, I would use ramdisk for the actual processing and then move the resulting files into storage filesystem. Oh yes, it does:
The input uncompressed FASTQ files occupied 435.86GB. |
This is how it should be. We're reading FASTQ (text format) and convert to the internal binary format. The read:write ratio 9:1 is very close to the text FASTQ : SPAdes binary format file size ratio. |
Here is what the filesystem handles if applications are properly written to read/write in large chunks. A very efficient alternative. bamsort comes from https://github.com/gt1/biobambam2
The currently running SPAdes process running read_converter.hpp/binary_converter.hpp supposedly overloaded metadata servers of LustreFS and the kernel after 40minutes of attempts to flush buffers (see high system CPU load in red color in figures below) gave up. I see similar issues when apps write many and too small chunks appending to existing files. Running |
I cannot login to the cluster node to verify this but although I am running And, while the log says now:
I should not see the |
These files will be in the output dir since they are reused across iterations (= long living). Everything else will be on scratch. |
I don't understand. The |
This is not how it is done currently. We may consider doing this in some next SPAdes versions. Patches are always welcome though. |
Hi,
although I provided 19 input files the code run in a single thread. To further scale it could also do the conversion in multiple chunks on each file?
This probably won't happen soon but let me open a feature request for this. Current version is SPAdes3.11.1. Thank you.
The text was updated successfully, but these errors were encountered: