Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does NanoSim works for large genomes? #18

Closed
haghshenas opened this issue Apr 21, 2017 · 7 comments
Closed

Does NanoSim works for large genomes? #18

haghshenas opened this issue Apr 21, 2017 · 7 comments

Comments

@haghshenas
Copy link
Contributor

haghshenas commented Apr 21, 2017

Hello,

I tried to simulate reads using NanoSim for the human genome.
The read_analysis.py script worked and generated the required models.
However, simulator.py script is running for 3 days now and is not generating any output or simulated reads.
Is there anything special I need to do for large genomes?
Can you please comment on how NanoSim scales for human size genome?

@cheny19
Copy link
Collaborator

cheny19 commented Apr 21, 2017

Yes, NanoSim works fine with large genomes, like human. The run time should range from minutes to hours, depending on the desired number of reads. If it has been running for 3 days already, you can stop it now, because it means something is going wrong.

Please make sure you got version 1.1.0, as it fixed several bugs. If you still run into the problem, let me know which dataset you used to generate the models, and you can also email me the log file so I have a better understanding of the problem.

Thanks!

@haghshenas
Copy link
Contributor Author

Please make sure you got version 1.1.0, as it fixed several bugs.

I'm using the latest commit. And other tools meet the requirements:

$ lastal --version
lastal 847

$ R --version
R version 3.3.3 (2017-03-06) -- "Another Canoe"...

$ python --version
Python 2.7.5

$ python -c "import numpy; print numpy.version.version;"
1.12.1

let me know which dataset you used to generate the models

I obtained the dataset from Nanopore WGS Consortium repository. Here is the link to the fastq file:
http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-3306352129-FAB42798.fastq.gz

Here are the steps I followed:

  1. Make a set of training reads from top 5000 reads in the fastq file:
$ cat rel3-nanopore-wgs-3306352129-FAB42798.fastq | awk 'NR % 4 == 1 {print ">s" (++x)} NR % 4 == 2' | head -n 10000 > real.fasta
  1. Run the read_analysis.py script:
$ read_analysis.py -i real.fasta -r hg38.fa
2017-04-20 18:00:03: Read pre-process and unaligned reads analysis
2017-04-20 18:00:03: Alignment with LAST
2017-04-20 23:27:25: Aligned reads analysis
2017-04-20 23:27:25: match and error models
2017-04-20 23:27:33: Model fitting
2017-04-20 23:35:42: Finished!
$ ls
hg38.fa                                      training_besthit.maf
model_fitting.Rout                           training_del.hist
real.fasta                                   training_error_markov_model
ref_genome.bck                               training.fasta
ref_genome.des                               training_first_match.hist
ref_genome.prj                               training_ht_ratio
ref_genome.sds                               training_ins.hist
ref_genome.ssp                               training.maf
ref_genome.suf                               training_match.hist
ref_genome.tis                               training_match_markov_model
rel3-nanopore-wgs-3306352129-FAB42798.fastq  training_mis.hist
training_aligned_length_ecdf                 training_model_profile
training_aligned_reads_ecdf                  training_unaligned_length_ecdf
training_align_ratio
  1. Run the simulator.py script:
$ simulator.py linear -r hg38.fa

But the simulator.py script keeps running without generating any simulated reads. Also simulated.log is empty.

Please let me know if you need any of the files.
Thanks.

@haghshenas
Copy link
Contributor Author

Hi there,

Did you have a chance to look into this issue?

Thanks

@cheny19
Copy link
Collaborator

cheny19 commented Apr 25, 2017

I repeated your steps and didn't notice any error. One thing to notice is it takes long to read in the reference genome given the size. I haven't finished reading in yet, but the simulated.log is not empty. If you force quit the program, It will show information like this:

2017-04-25 10:16:15: ../simulator/src/simulator.py linear -r ~/genome.fa -n 10
2017-04-25 10:16:15: Read error profile
2017-04-25 10:16:15: Read ECDF of unaligned reads
2017-04-25 10:16:15: Read ECDF of aligned reads
2017-04-25 10:16:15: Read in reference genome

If you do have such log file, it means it's trying to read in the reference genome, and probably better to choose a faster machine, or just wait. If your log file is empty, unfortunately I cannot reproduce this problem.

@haghshenas
Copy link
Contributor Author

I get the same log when I kill the process. But I don't think reading the reference genome should take more than 3 days as I am using a powerful workstation.
I start to suspect that there is a problem with handling large genome. Please let me know if waiting helps you with generating reads.

@cheny19
Copy link
Collaborator

cheny19 commented Apr 25, 2017

NanoSim reads in the genome and converts all bases to upper case. I think this is the most time consuming part for large genome right now. I'll fix the code and let you know. You don't have to wait for it to run. Thanks for letting me know about this!

@haghshenas
Copy link
Contributor Author

I actually fixed the code and will send a pull request. There were two very slow parts in the code.

  1. reading the reference fasta file as you mentioned. It can be fixed with a generator function (I used the one from here)
  2. fixing ambiguous cases in the reference. I used a method that avoids slicing the string.

Note that this code is still not optimized in using memory which can be fixed.

@cheny19 cheny19 closed this as completed Mar 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants