IndexError: list index out of range #13

stsmall · 2016-07-06T20:29:28Z

Hi,
Thanks for the useful software. I am having an issue where redundans returns an index error.
Python 2.7.11; I also installed all programs manually from versions/links listed in INSTALL.sh. The test case ran without error.

redundans.py -v -i ../../Anf.1.fastq.gz ../../Anf.2.fastq.gz -f redundans.in.fasta -t 5 -o run1 --sspacebin ~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl --noscaffolding --nogapclosing

Options: Namespace(fasta='redundans.in.fasta', fastq=['../../Anf.1.fastq.gz', '../../Anf.2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '', mode 'w' at 0x7f47e31fe1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=False, noreduction=True, noscaffolding=False, outdir='run1', overlap=0.66, sspacebin='~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl', threads=5, verbose=True)
Aligning 69476742 mates per library...
Insert size statistics Mates orientation stats
FastQ files median mean stdev FF FR RF RR
../../Anf.1.fastq.gz ../../Anf.2.fastq.gz 403 399.70 135.09 4 9974 22 0

[Wed Jul 6 15:45:19 2016] Reduction...

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

run1/contigs.fa 347383710 121440 88802556 25.56 49234 40.54 79.471 0 260148921 74.89 72206 59.46
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 403, in
main()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 398, in main
o.verbose, o.log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 263, in redundans
limit = get_read_limit(reducedFname, readLimit, verbose, log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 99, in get_read_limit
stats = fasta_stats(open(fasta))
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/fasta_stats.py", line 18, in fasta_stats
faidx = FastaIndex(handle)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 37, in init
self._generate_index()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 70, in _generate_index
stats = self.get_stats(header, seq, offset)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 186, in get_stats
linebases, linebytes = len(seq[0].strip()), len(seq[0])
IndexError: list index out of range

stsmall · 2016-07-07T16:32:06Z

I also noticed that the contigs.reduced.fa has double ">" in the header and the final output is incomplete.
$ tail contigs.reduced.fa
'>>flattened_line_165493:0-1609'
CTGTGCCAGAAGTTATTCGAGGAACCAAGTTCTTTACCCTGTTTGAGAAAATGTAAAATTTTCCTCATTTTTGAGCAATGTAGCAAAAATTTTAAATGTGATATATTTTGATGGGAATGTGTTGCAATAAGTTTCTTTTCACTTAAATAGTGCTTTAAAGTAAAACAAGAATAAAAAATCAGAAAAAAAATTTTAAAAAATTTTCAACTCGAAAATTTCATAAGGGGTACCCCTTGCCATTTTTTTGAGAAATTTTGCAAAAAAATAAAATTGTTTTATTTTAACGCCAATGGGTTGCAATGAGTTTCTTTTCACTCTAATAATGCTTGAAATAAAAAAAAGCGTGAAAAATAATAAAAAAATCAAAAAATTCAAAAAAAAAAATGAAAATTTTCCTCATTTTTGCACCAACTTTCGCCTATAACTCGGTCGGTACCCAACCGATCGCCAATCTTTAACCTGTGGTCGATAGATGGCACCAATGGCTACATTTTCTTCTCGGACAGTCATGCCCTCAGATGTCTGTGCCAGAAGTTATTCGAGGAACCAAGTTCCTTACCCTGTTTGAGAAAATGTAAAATTTTCCTCAGTTTTGAGCAATGCAGCAAAAATTTTAAATGTGATATATTTTGATGGGAATTTGTTGCAATAAGTTTCTTTTCACTTAAATAGTGCTTTAAATTAAAAAAAGAATAAAAAATCAGAAAAAAATTTCAAAAAAATTTTCCACTCCAAAATTTCATAAGGCTTACCCCTTACGATTTTTTTAGGAAATTTTGCAAAAAAAAATAAAATTGTTTTATTTTAATCCCAATTGGTTGCAATGAGTTTCTTTTCACTCTAATAATGCTTGAAATAAAAAAGAGAGTGAAAAATAATAAAAAAATCAAAAATTTCAAAAGGTTTGAAAATTTCATGAGGTTATAGCCTTACTTTTTTCTTGGGAAATTTAGCAAAATTTTATGAATATACGTATTGGTCGTATTGACGTAAATTGACGAGTTGGTTTCCTGTCTGGGAACATCTGGAAACAGGGTCCTTATTGATCTCGTCATGTTCTACCAAAATGGAAACTTTTGTTAGTTCACGTGTTTCTTCCTAGGTGGCGCTTTGCCACATGTTCGTGTCCTAGCTAACTAGGATGCTCTGATCACCACAGTACTCCTCTCCTAGACCGGTCACCGCTGTTACCGATAGCGCAGAGGGATTATAACTCTGCGTTGTGATCGCAGTCGGTCGATATGTTATCGTCCTCCTTTTCTCCGGTCTCCGGTGATGCTGGACTGGACGTCGGCTCTTGGGTTCGAGATGGATACTCGGCCGATGGCGGGTTGTCCGTTGGTGTTTTTCGTCTTCGCGTCGGGTCGGGTGGATCGTCTGTTTTGTAGGCCGATTTCAGGCAATGTTGGCGGTTTCAGGCGGTTGTCCATTTATCAGAATCTTGAAAAACTTAAGGCTTCGCTGAAGTATCCAGTATGATCCTTGGTAGGGTGCTTCGGGCCAACGAGAAAAGTGGTCGATGATTGTCCGGCAGTAGCGGTTGCCGTTGCTGCCTGCAAAGGTCCGATGATGTCTTCGTTGACGTGAGAGAACTTTCGCTCATTGACAC
'>% '

lpryszcz · 2016-07-10T16:34:49Z

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the file?
Or at least do samtools faidx file.fasta or grep -n '^>' file.fasta and make sure all header lines are properly formatted?

stsmall · 2016-07-10T18:27:33Z

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

scapella · 2016-07-10T21:01:52Z

Hi there,

Just to make your life easier when counting duplicated entries, you shoud
do something like "grep ">" INPUT.fa | sort -u | wc -l" vs "grep -c ">"
INPUT.fa". Without ordering, you will not remove duplicated headers.

Hope it helps to clarify.

Cheers,

S

Salva

On Sun, Jul 10, 2016 at 8:27 PM, stsmall notifications@github.com wrote:

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),

or mute the thread
<
https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN
.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAH4h1ECJ6WFi_AK4l73Jd7dY7zrDvLrks5qUTmVgaJpZM4JGeDN
.

stsmall · 2016-07-10T21:15:58Z

ah yes of course, you are right. I always forget to sort when using
uniq. sort -u is a better choice.

On 7/10/16 17:01, Salvador Capella wrote:

Hi there,

Just to make your life easier when counting duplicated entries, you shoud
do something like "grep ">" INPUT.fa | sort -u | wc -l" vs "grep -c ">"
INPUT.fa". Without ordering, you will not remove duplicated headers.

Hope it helps to clarify.

Cheers,

S

Salva

On Sun, Jul 10, 2016 at 8:27 PM, stsmall notifications@github.com wrote:

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub

#13 (comment),

or mute the thread
<

https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN

.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub

#13 (comment),
or mute the thread

https://github.com/notifications/unsubscribe/AAH4h1ECJ6WFi_AK4l73Jd7dY7zrDvLrks5qUTmVgaJpZM4JGeDN
.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACtzP7CPqZfzcaxER5_n9FcxSFvn2BIoks5qUV3BgaJpZM4JGeDN.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

lpryszcz · 2016-07-11T11:41:01Z

Thanks for the feedback, I'll try to add input checking and exception for malformed fasta entries. Stay tuned!

stsmall · 2016-07-12T19:08:28Z

Hi,
So I cleaned up the duplicates and samtools faidx completes without errors. When I run redundans.py I am still getting an index error. Same one as before:

Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 403, in
main()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 398, in main
o.verbose, o.log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 263, in redundans
limit = get_read_limit(reducedFname, readLimit, verbose, log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 99, in get_read_limit
stats = fasta_stats(open(fasta))
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/fasta_stats.py", line 18, in fasta_stats
faidx = FastaIndex(handle)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 37, in init
self._generate_index()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 70, in _generate_index
stats = self.get_stats(header, seq, offset)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 186, in get_stats
linebases, linebytes = len(seq[0].strip()), len(seq[0])
IndexError: list index out of range

lpryszcz · 2016-07-13T06:47:47Z

Hi, I have reproduced your issue. Simply you have empty sequences in your FASTA, like in the example below:

>NODE_243_length_56_cov_106_ID_485
GTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCT
>NODE_244_length_56_cov_213_ID_487
>NODE_245_length_56_cov_149_ID_489
GGACAACCAACTGGACAACCAACTGGACAACCAACTGGACAACCAACAGGACAACC

I have added a piece of code to handle it. Just pull latest vervsion

git pull origin master

Nevertheless, you should remove them from your FASTA.

lpryszcz self-assigned this Jul 11, 2016

lpryszcz added the enhancement label Jul 11, 2016

lpryszcz closed this as completed Jul 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: list index out of range #13

IndexError: list index out of range #13

stsmall commented Jul 6, 2016 •

edited

Loading

stsmall commented Jul 7, 2016 •

edited

Loading

lpryszcz commented Jul 10, 2016 •

edited

Loading

stsmall commented Jul 10, 2016

scapella commented Jul 10, 2016

stsmall commented Jul 10, 2016

lpryszcz commented Jul 11, 2016

stsmall commented Jul 12, 2016

lpryszcz commented Jul 13, 2016 •

edited

Loading

IndexError: list index out of range #13

IndexError: list index out of range #13

Comments

stsmall commented Jul 6, 2016 • edited Loading

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

stsmall commented Jul 7, 2016 • edited Loading

lpryszcz commented Jul 10, 2016 • edited Loading

stsmall commented Jul 10, 2016

scapella commented Jul 10, 2016

stsmall commented Jul 10, 2016

lpryszcz commented Jul 11, 2016

stsmall commented Jul 12, 2016

lpryszcz commented Jul 13, 2016 • edited Loading

stsmall commented Jul 6, 2016 •

edited

Loading

stsmall commented Jul 7, 2016 •

edited

Loading

lpryszcz commented Jul 10, 2016 •

edited

Loading

lpryszcz commented Jul 13, 2016 •

edited

Loading