Read Simulation: reference length limitation? #30

ampodu · 2018-07-09T18:25:25Z

Hi,
I was hoping to use this tool to simulate some reads for a set of amplicons. However, it gives me an error when I try it:

src/simulator.py linear -r ~/R/projects/umi.sim/nanosim.input.fasta -c training.juplasmid -o
sim.pcrprod
Traceback (most recent call last):
  File "src/simulator.py", line 716, in <module>
    main()
  File "src/simulator.py", line 710, in main
    simulation(ref, out, dna_type, perfect, kmer_bias, max_readlength, min_readlength)
  File "src/simulator.py", line 284, in simulation
    read_mutated = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias, False)
  File "src/simulator.py", line 577, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

the reference file contains 10 000 entries with a size of 1,3 kb. So I thought maybe it can not deal with so many entries. That is why I tried the simulation with a plasmid of 7.4 kb, but that outputs the same error.

When I try it with a 49kb lambda genome it works, that is why I assume there is a limitation in the reference size?

Is it possible to adjust NanoSim for shorter references or am I missing something?

The text was updated successfully, but these errors were encountered:

cheny19 · 2018-07-09T18:37:42Z

NanoSim is able to deal with multiple entries in a fasta file and it should work fine for small reference genomes, just take longer to generate a proper read length. Could you send me your reference plasmid fasta file so I can test on it? Thanks!

ampodu · 2018-07-10T07:09:50Z

I send the file to your bcgsc.ca mail address.

I did some more testing and it seems that every reference over ~40kb works but with ~20kb or less it always crashes for me. However, if I use --perfect, it works just fine. The problem probably lies within mutate_read.

ampodu · 2018-07-10T15:04:27Z

I found a solution: specifying the "max_len" as the length of the reference contigs.
Seems like it was trying to create reads longer than the reference?

cheny19 · 2018-07-10T17:13:43Z

Oh, I see. NanoSim will try to mimic the length distribution as the training profile. So if your reference genome is smaller than the empirical length distribution, it may not able to find the right length for each read. Specifying the max_len is the solution and probably you don't mind the simulated read length doesn't mach the profile given the reference size.

ampodu closed this as completed Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Simulation: reference length limitation? #30

Read Simulation: reference length limitation? #30

ampodu commented Jul 9, 2018

cheny19 commented Jul 9, 2018

ampodu commented Jul 10, 2018

ampodu commented Jul 10, 2018

cheny19 commented Jul 10, 2018 •

edited

Loading

Read Simulation: reference length limitation? #30

Read Simulation: reference length limitation? #30

Comments

ampodu commented Jul 9, 2018

cheny19 commented Jul 9, 2018

ampodu commented Jul 10, 2018

ampodu commented Jul 10, 2018

cheny19 commented Jul 10, 2018 • edited Loading

cheny19 commented Jul 10, 2018 •

edited

Loading