Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #13

Closed
stsmall opened this issue Jul 6, 2016 · 8 comments
Closed

IndexError: list index out of range #13

stsmall opened this issue Jul 6, 2016 · 8 comments
Assignees

Comments

@stsmall
Copy link

stsmall commented Jul 6, 2016

Hi,
Thanks for the useful software. I am having an issue where redundans returns an index error.
Python 2.7.11; I also installed all programs manually from versions/links listed in INSTALL.sh. The test case ran without error.

redundans.py -v -i ../../Anf.1.fastq.gz ../../Anf.2.fastq.gz -f redundans.in.fasta -t 5 -o run1 --sspacebin ~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl --noscaffolding --nogapclosing

Options: Namespace(fasta='redundans.in.fasta', fastq=['../../Anf.1.fastq.gz', '../../Anf.2.fastq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '', mode 'w' at 0x7f47e31fe1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=False, noreduction=True, noscaffolding=False, outdir='run1', overlap=0.66, sspacebin='~/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl', threads=5, verbose=True)
Aligning 69476742 mates per library...
Insert size statistics Mates orientation stats
FastQ files median mean stdev FF FR RF RR
../../Anf.1.fastq.gz ../../Anf.2.fastq.gz 403 399.70 135.09 4 9974 22 0

[Wed Jul 6 15:45:19 2016] Reduction...

file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]

run1/contigs.fa 347383710 121440 88802556 25.56 49234 40.54 79.471 0 260148921 74.89 72206 59.46
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 403, in
main()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 398, in main
o.verbose, o.log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 263, in redundans
limit = get_read_limit(reducedFname, readLimit, verbose, log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 99, in get_read_limit
stats = fasta_stats(open(fasta))
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/fasta_stats.py", line 18, in fasta_stats
faidx = FastaIndex(handle)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 37, in init
self._generate_index()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 70, in _generate_index
stats = self.get_stats(header, seq, offset)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 186, in get_stats
linebases, linebytes = len(seq[0].strip()), len(seq[0])
IndexError: list index out of range

@stsmall
Copy link
Author

stsmall commented Jul 7, 2016

I also noticed that the contigs.reduced.fa has double ">" in the header and the final output is incomplete.
$ tail contigs.reduced.fa
'>>flattened_line_165493:0-1609'
CTGTGCCAGAAGTTATTCGAGGAACCAAGTTCTTTACCCTGTTTGAGAAAATGTAAAATTTTCCTCATTTTTGAGCAATGTAGCAAAAATTTTAAATGTGATATATTTTGATGGGAATGTGTTGCAATAAGTTTCTTTTCACTTAAATAGTGCTTTAAAGTAAAACAAGAATAAAAAATCAGAAAAAAAATTTTAAAAAATTTTCAACTCGAAAATTTCATAAGGGGTACCCCTTGCCATTTTTTTGAGAAATTTTGCAAAAAAATAAAATTGTTTTATTTTAACGCCAATGGGTTGCAATGAGTTTCTTTTCACTCTAATAATGCTTGAAATAAAAAAAAGCGTGAAAAATAATAAAAAAATCAAAAAATTCAAAAAAAAAAATGAAAATTTTCCTCATTTTTGCACCAACTTTCGCCTATAACTCGGTCGGTACCCAACCGATCGCCAATCTTTAACCTGTGGTCGATAGATGGCACCAATGGCTACATTTTCTTCTCGGACAGTCATGCCCTCAGATGTCTGTGCCAGAAGTTATTCGAGGAACCAAGTTCCTTACCCTGTTTGAGAAAATGTAAAATTTTCCTCAGTTTTGAGCAATGCAGCAAAAATTTTAAATGTGATATATTTTGATGGGAATTTGTTGCAATAAGTTTCTTTTCACTTAAATAGTGCTTTAAATTAAAAAAAGAATAAAAAATCAGAAAAAAATTTCAAAAAAATTTTCCACTCCAAAATTTCATAAGGCTTACCCCTTACGATTTTTTTAGGAAATTTTGCAAAAAAAAATAAAATTGTTTTATTTTAATCCCAATTGGTTGCAATGAGTTTCTTTTCACTCTAATAATGCTTGAAATAAAAAAGAGAGTGAAAAATAATAAAAAAATCAAAAATTTCAAAAGGTTTGAAAATTTCATGAGGTTATAGCCTTACTTTTTTCTTGGGAAATTTAGCAAAATTTTATGAATATACGTATTGGTCGTATTGACGTAAATTGACGAGTTGGTTTCCTGTCTGGGAACATCTGGAAACAGGGTCCTTATTGATCTCGTCATGTTCTACCAAAATGGAAACTTTTGTTAGTTCACGTGTTTCTTCCTAGGTGGCGCTTTGCCACATGTTCGTGTCCTAGCTAACTAGGATGCTCTGATCACCACAGTACTCCTCTCCTAGACCGGTCACCGCTGTTACCGATAGCGCAGAGGGATTATAACTCTGCGTTGTGATCGCAGTCGGTCGATATGTTATCGTCCTCCTTTTCTCCGGTCTCCGGTGATGCTGGACTGGACGTCGGCTCTTGGGTTCGAGATGGATACTCGGCCGATGGCGGGTTGTCCGTTGGTGTTTTTCGTCTTCGCGTCGGGTCGGGTGGATCGTCTGTTTTGTAGGCCGATTTCAGGCAATGTTGGCGGTTTCAGGCGGTTGTCCATTTATCAGAATCTTGAAAAACTTAAGGCTTCGCTGAAGTATCCAGTATGATCCTTGGTAGGGTGCTTCGGGCCAACGAGAAAAGTGGTCGATGATTGTCCGGCAGTAGCGGTTGCCGTTGCTGCCTGCAAAGGTCCGATGATGTCTTCGTTGACGTGAGAGAACTTTCGCTCATTGACAC
'>% '

@lpryszcz
Copy link
Collaborator

lpryszcz commented Jul 10, 2016

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the file?
Or at least do samtools faidx file.fasta or grep -n '^>' file.fasta and make sure all header lines are properly formatted?

@stsmall
Copy link
Author

stsmall commented Jul 10, 2016

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

@scapella
Copy link

Hi there,

Just to make your life easier when counting duplicated entries, you shoud
do something like "grep ">" INPUT.fa | sort -u | wc -l" vs "grep -c ">"
INPUT.fa". Without ordering, you will not remove duplicated headers.

Hope it helps to clarify.

Cheers,

S

Salva

On Sun, Jul 10, 2016 at 8:27 PM, stsmall notifications@github.com wrote:

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),

or mute the thread
<
https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN
.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAH4h1ECJ6WFi_AK4l73Jd7dY7zrDvLrks5qUTmVgaJpZM4JGeDN
.

@stsmall
Copy link
Author

stsmall commented Jul 10, 2016

ah yes of course, you are right. I always forget to sort when using
uniq. sort -u is a better choice.

On 7/10/16 17:01, Salvador Capella wrote:

Hi there,

Just to make your life easier when counting duplicated entries, you shoud
do something like "grep ">" INPUT.fa | sort -u | wc -l" vs "grep -c ">"
INPUT.fa". Without ordering, you will not remove duplicated headers.

Hope it helps to clarify.

Cheers,

S

Salva

On Sun, Jul 10, 2016 at 8:27 PM, stsmall notifications@github.com wrote:

Hi,
You are right. Samtools faidx complains that there are duplicate
sequence IDs. However when I do 'grep -c ">" FOO.fa | wc -l | uniq' the
number of headers is the same as just 'grep -c ">"'. So it seems to be
an issue with my fasta files. I shared the fasta file anyway but I will
test samtools1.3 (running 1.2) and shortening header names tomorrow.

https://www.dropbox.com/s/1seutpqnbiikxzn/Anfunestus_MW.redundans.in.fasta.gz?dl=0

thanks,
scott

On 7/10/16 12:34, Leszek wrote:

Hi, I suppose it's wrongly formatted FASTA file. Can you send me the
file?
Or at least do |samtools faidx file.fasta| or |grep -C '^>'
file.fasta| and make sure all header lines are properly formatted?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub

#13 (comment),

or mute the thread
<

https://github.com/notifications/unsubscribe/ACtzP_VDQHS4y0rmlLIM-eErSEGEghlWks5qUR8qgaJpZM4JGeDN

.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub

#13 (comment),
or mute the thread

https://github.com/notifications/unsubscribe/AAH4h1ECJ6WFi_AK4l73Jd7dY7zrDvLrks5qUTmVgaJpZM4JGeDN
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACtzP7CPqZfzcaxER5_n9FcxSFvn2BIoks5qUV3BgaJpZM4JGeDN.

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

@lpryszcz lpryszcz self-assigned this Jul 11, 2016
@lpryszcz
Copy link
Collaborator

Thanks for the feedback, I'll try to add input checking and exception for malformed fasta entries. Stay tuned!

@stsmall
Copy link
Author

stsmall commented Jul 12, 2016

Hi,
So I cleaned up the duplicates and samtools faidx completes without errors. When I run redundans.py I am still getting an index error. Same one as before:

Traceback (most recent call last):
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 403, in
main()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 398, in main
o.verbose, o.log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 263, in redundans
limit = get_read_limit(reducedFname, readLimit, verbose, log)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/redundans.py", line 99, in get_read_limit
stats = fasta_stats(open(fasta))
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/fasta_stats.py", line 18, in fasta_stats
faidx = FastaIndex(handle)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 37, in init
self._generate_index()
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 70, in _generate_index
stats = self.get_stats(header, seq, offset)
File "/afs/crc.nd.edu/user/s/ssmall2/programs_that_work/redundans/FastaIndex.py", line 186, in get_stats
linebases, linebytes = len(seq[0].strip()), len(seq[0])
IndexError: list index out of range

@lpryszcz
Copy link
Collaborator

lpryszcz commented Jul 13, 2016

Hi, I have reproduced your issue. Simply you have empty sequences in your FASTA, like in the example below:

>NODE_243_length_56_cov_106_ID_485
GTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCAGTTGGTTGTCCT
>NODE_244_length_56_cov_213_ID_487
>NODE_245_length_56_cov_149_ID_489
GGACAACCAACTGGACAACCAACTGGACAACCAACTGGACAACCAACAGGACAACC

I have added a piece of code to handle it. Just pull latest vervsion

git pull origin master

Nevertheless, you should remove them from your FASTA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants