Does NanoSim works for large genomes? #18

haghshenas · 2017-04-21T00:25:21Z

Hello,

I tried to simulate reads using NanoSim for the human genome.
The read_analysis.py script worked and generated the required models.
However, simulator.py script is running for 3 days now and is not generating any output or simulated reads.
Is there anything special I need to do for large genomes?
Can you please comment on how NanoSim scales for human size genome?

cheny19 · 2017-04-21T00:33:29Z

Yes, NanoSim works fine with large genomes, like human. The run time should range from minutes to hours, depending on the desired number of reads. If it has been running for 3 days already, you can stop it now, because it means something is going wrong.

Please make sure you got version 1.1.0, as it fixed several bugs. If you still run into the problem, let me know which dataset you used to generate the models, and you can also email me the log file so I have a better understanding of the problem.

Thanks!

haghshenas · 2017-04-21T18:31:35Z

Please make sure you got version 1.1.0, as it fixed several bugs.

I'm using the latest commit. And other tools meet the requirements:

$ lastal --version
lastal 847

$ R --version
R version 3.3.3 (2017-03-06) -- "Another Canoe"...

$ python --version
Python 2.7.5

$ python -c "import numpy; print numpy.version.version;"
1.12.1

let me know which dataset you used to generate the models

I obtained the dataset from Nanopore WGS Consortium repository. Here is the link to the fastq file:
http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-3306352129-FAB42798.fastq.gz

Here are the steps I followed:

Make a set of training reads from top 5000 reads in the fastq file:

$ cat rel3-nanopore-wgs-3306352129-FAB42798.fastq | awk 'NR % 4 == 1 {print ">s" (++x)} NR % 4 == 2' | head -n 10000 > real.fasta

Run the read_analysis.py script:

$ read_analysis.py -i real.fasta -r hg38.fa
2017-04-20 18:00:03: Read pre-process and unaligned reads analysis
2017-04-20 18:00:03: Alignment with LAST
2017-04-20 23:27:25: Aligned reads analysis
2017-04-20 23:27:25: match and error models
2017-04-20 23:27:33: Model fitting
2017-04-20 23:35:42: Finished!
$ ls
hg38.fa                                      training_besthit.maf
model_fitting.Rout                           training_del.hist
real.fasta                                   training_error_markov_model
ref_genome.bck                               training.fasta
ref_genome.des                               training_first_match.hist
ref_genome.prj                               training_ht_ratio
ref_genome.sds                               training_ins.hist
ref_genome.ssp                               training.maf
ref_genome.suf                               training_match.hist
ref_genome.tis                               training_match_markov_model
rel3-nanopore-wgs-3306352129-FAB42798.fastq  training_mis.hist
training_aligned_length_ecdf                 training_model_profile
training_aligned_reads_ecdf                  training_unaligned_length_ecdf
training_align_ratio

Run the simulator.py script:

$ simulator.py linear -r hg38.fa

But the simulator.py script keeps running without generating any simulated reads. Also simulated.log is empty.

Please let me know if you need any of the files.
Thanks.

haghshenas · 2017-04-25T01:55:06Z

Hi there,

Did you have a chance to look into this issue?

Thanks

cheny19 · 2017-04-25T17:26:48Z

I repeated your steps and didn't notice any error. One thing to notice is it takes long to read in the reference genome given the size. I haven't finished reading in yet, but the simulated.log is not empty. If you force quit the program, It will show information like this:

2017-04-25 10:16:15: ../simulator/src/simulator.py linear -r ~/genome.fa -n 10
2017-04-25 10:16:15: Read error profile
2017-04-25 10:16:15: Read ECDF of unaligned reads
2017-04-25 10:16:15: Read ECDF of aligned reads
2017-04-25 10:16:15: Read in reference genome

If you do have such log file, it means it's trying to read in the reference genome, and probably better to choose a faster machine, or just wait. If your log file is empty, unfortunately I cannot reproduce this problem.

haghshenas · 2017-04-25T18:31:28Z

I get the same log when I kill the process. But I don't think reading the reference genome should take more than 3 days as I am using a powerful workstation.
I start to suspect that there is a problem with handling large genome. Please let me know if waiting helps you with generating reads.

cheny19 · 2017-04-25T22:27:09Z

NanoSim reads in the genome and converts all bases to upper case. I think this is the most time consuming part for large genome right now. I'll fix the code and let you know. You don't have to wait for it to run. Thanks for letting me know about this!

haghshenas · 2017-04-25T22:58:44Z

I actually fixed the code and will send a pull request. There were two very slow parts in the code.

reading the reference fasta file as you mentioned. It can be fixed with a generator function (I used the one from here)
fixing ambiguous cases in the reference. I used a method that avoids slicing the string.

Note that this code is still not optimized in using memory which can be fixed.

cheny19 closed this as completed Mar 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does NanoSim works for large genomes? #18

Does NanoSim works for large genomes? #18

haghshenas commented Apr 21, 2017 •

edited

Loading

cheny19 commented Apr 21, 2017

haghshenas commented Apr 21, 2017

haghshenas commented Apr 25, 2017

cheny19 commented Apr 25, 2017

haghshenas commented Apr 25, 2017

cheny19 commented Apr 25, 2017

haghshenas commented Apr 25, 2017

Does NanoSim works for large genomes? #18

Does NanoSim works for large genomes? #18

Comments

haghshenas commented Apr 21, 2017 • edited Loading

cheny19 commented Apr 21, 2017

haghshenas commented Apr 21, 2017

haghshenas commented Apr 25, 2017

cheny19 commented Apr 25, 2017

haghshenas commented Apr 25, 2017

cheny19 commented Apr 25, 2017

haghshenas commented Apr 25, 2017

haghshenas commented Apr 21, 2017 •

edited

Loading