Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Spades be made deterministic? #111

Closed
andreyto opened this issue May 8, 2018 · 3 comments
Closed

Can Spades be made deterministic? #111

andreyto opened this issue May 8, 2018 · 3 comments

Comments

@andreyto
Copy link

andreyto commented May 8, 2018

I am using Spades in a clinical strain surveillance application. It would be highly desirable to be able to generate exactly the same outputs from the same inputs and parameters every time that I run Spades. I need a complete "repeatability".
In my tests, this does not seem to be the case.
Would you have any pointers on how much work it would take to add a --deterministic switch?
If I implemented this, would you be interested in a pull request?
Quick search for srand in v.3.11.0 code suggests this:

  • Spades code itself tries to fix the seed in various places (srand(48))
  • bwa and samtools from the ext directory - not clear yet, probably not deterministic. Might have to change both how they are called with the seed argument, as well as the code where identical reads are randomly placed.
  • nlopt is definitely not. I can see this line nlopt_srand_time_default(); /* default is non-deterministic */. They want to have different seeds in different threads, and that function generates the seed by combining time with the thread ID.
  • Is multithreading in Spades likely to be a separate physical source of randomness, regardless of the seed being used in the generator? If so, I am willing to run everything in a single thread in the --deterministic mode. Otherwise it might take too much work to implement the fixed ordering of work among the threads.
@asl
Copy link
Member

asl commented May 8, 2018

Could you please report the cases when SPAdes output is non-deterministic (e.g. input / parameters and the observed behavior)? SPAdes is designed to be deterministic modulo the # of thread option (so, the output for 8 threads could be different compared to the output of SPAdes using 16 threads) and our tests shows that it is indeed so.

@andreyto
Copy link
Author

andreyto commented May 9, 2018

@asl You are correct. Spades indeed appears to be deterministic. I must have incorrectly assigned to Spades some other source of randomness in my overall workflow. I have now ran the script below, and got back identical md5 sums from all 40 invocations.
Thanks!

$ cat run_spades.sh
#!/bin/bash
#PBS -t 1-40%20
#PBS -l nodes=1:ppn=16
echo "Host" $(hostname) "Time Begin" $(date) "Uptime" $(uptime)
cd $PBS_O_WORKDIR
set -ex
#TEST_DATA=$HOME/work/spades/assembler/test_dataset
#TEST_FQ_1=$TEST_DATA/ecoli_1K_1.fq.gz
#TEST_FQ_2=$TEST_DATA/ecoli_1K_2.fq.gz
TEST_DATA=my_data_dir
TEST_FQ_1=$TEST_DATA/my_reads_1.fastq.gz
TEST_FQ_2=$TEST_DATA/my_reads_2.fastq.gz
OUT_DIR=out.$PBS_ARRAYID
spades.py \
    -1 $TEST_FQ_1 \
    -2 $TEST_FQ_2 \
    -t 16 \
    --sc -k 33,55,77,99,127 --careful \
    -o $OUT_DIR
seqkit stats -a $OUT_DIR/contigs.fasta > $OUT_DIR/contigs.stats
## sort by sequence and renumber contigs to make the file independent of contig output order
seqkit sort -s $OUT_DIR/contigs.fasta | perl -pe 's/>.*/>$./g' > $OUT_DIR/contigs.sorted.fasta
md5sum $OUT_DIR/contigs.sorted.fasta > $OUT_DIR/contigs.md5

echo "Host" $(hostname) "Time End" $(date) "Uptime" $(uptime)

@andreyto andreyto closed this as completed May 9, 2018
@asl
Copy link
Member

asl commented May 9, 2018

Great! Note that there is no need to reorder the contigs. The output is deterministic as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants