Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds error during normalization: freebayes #22

Open
stsmall opened this issue Jan 5, 2015 · 7 comments
Open

Out of bounds error during normalization: freebayes #22

stsmall opened this issue Jan 5, 2015 · 7 comments

Comments

@stsmall
Copy link

stsmall commented Jan 5, 2015

Hi Brad,
I reran the same files with 0.2.1 and received a different set of errors. It again halted during the processing/normalization of the freebayes joint calling file.
If I was only going to use SNPs and not indels is it OK to remove indels/MNPs from the input files of (HaplotypeCaller, UnifiedGenotyper, Freebayes) prior to running ensemble? This set of file w/ SNPs only runs to completion. Since you closed the other issue I opened this as a new issue. Let me know if the offending input file is needed, it is 150mb but I can upload it to dropbox and provide a link.
thanks,
scott

Progress.log:
2015-01-05T00:16:50 :: State :begin :: {:desc "Starting variation analysis"}
2015-01-05T00:16:50 :: State :clean :: {:desc "Cleaning input VCF: combo"}
2015-01-05T02:03:45 :: State :merge :: {:desc "Merging multiple input files: combo"}
2015-01-05T02:03:45 :: State :prep :: {:desc "Prepare VCF, resorting to genome build: combo"}
2015-01-05T02:10:14 :: State :normalize :: {:desc "Normalize MNP and indel variants: combo"}
2015-01-05T02:10:14 :: State :clean :: {:desc "Cleaning input VCF: gatk-hc"}
2015-01-05T03:51:01 :: State :merge :: {:desc "Merging multiple input files: gatk-hc"}
2015-01-05T03:51:01 :: State :prep :: {:desc "Prepare VCF, resorting to genome build: gatk-hc"}
2015-01-05T03:56:43 :: State :normalize :: {:desc "Normalize MNP and indel variants: gatk-hc"}
2015-01-05T03:56:43 :: State :clean :: {:desc "Cleaning input VCF: gatk-ug"}
2015-01-05T06:05:23 :: State :merge :: {:desc "Merging multiple input files: gatk-ug"}
2015-01-05T06:05:23 :: State :prep :: {:desc "Prepare VCF, resorting to genome build: gatk-ug"}
2015-01-05T06:11:33 :: State :normalize :: {:desc "Normalize MNP and indel variants: gatk-ug"}
2015-01-05T06:11:33 :: State :clean :: {:desc "Cleaning input VCF: freebayes"}
2015-01-05T09:06:19 :: State :merge :: {:desc "Merging multiple input files: freebayes"}
2015-01-05T09:06:19 :: State :prep :: {:desc "Prepare VCF, resorting to genome build: freebayes"}

Stack:
Exception in thread "main" java.lang.IndexOutOfBoundsException
at clojure.lang.PersistentVector.arrayFor(PersistentVector.java:107)
at clojure.lang.PersistentVector.nth(PersistentVector.java:111)
at clojure.lang.RT.nth(RT.java:763)
at bcbio.variation.normalize$fix_vcf_line$fix_info__651.invoke(normalize.clj:273)
at bcbio.variation.normalize$fix_vcf_line.invoke(normalize.clj:280)
at bcbio.variation.normalize$vcf_by_chrom$write_by_chrom__662.invoke(normalize.clj:292)
at bcbio.variation.normalize$vcf_by_chrom$fn__665.invoke(normalize.clj:306)
at bcbio.variation.normalize$vcf_by_chrom.invoke(normalize.clj:300)
at bcbio.variation.normalize$write_prepped_vcf.invoke(normalize.clj:345)
at bcbio.variation.normalize$prep_vcf.doInvoke(normalize.clj:369)
at clojure.lang.RestFn.invoke(RestFn.java:969)
at bcbio.variation.combine$dirty_prep_work.invoke(combine.clj:147)
at bcbio.variation.combine$gatk_normalize.invoke(combine.clj:187)
at bcbio.variation.compare$prepare_vcf_calls$fn__7526.invoke(compare.clj:120)
at clojure.core$map$fn__4207.invoke(core.clj:2487)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.PersistentVector.create(PersistentVector.java:51)
at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31)
at clojure.core$vec.invoke(core.clj:354)
at bcbio.variation.compare$prepare_vcf_calls.invoke(compare.clj:121)
at bcbio.variation.compare$variant_comparison_from_config$iter__7582__7586$fn__7587.invoke(compare.clj:255)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$tree_seq$walk__4647$fn__4648.invoke(core.clj:4475)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.LazySeq.more(LazySeq.java:96)
at clojure.lang.RT.more(RT.java:607)
at clojure.core$rest.invoke(core.clj:73)
at clojure.core$flatten.invoke(core.clj:6478)
at bcbio.variation.compare$variant_comparison_from_config.invoke(compare.clj:254)
at bcbio.variation.ensemble$consensus_calls.invoke(ensemble.clj:113)
at bcbio.variation.ensemble$_main.doInvoke(ensemble.clj:133)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:617)
at bcbio.variation.core$_main.doInvoke(core.clj:35)
at clojure.lang.RestFn.applyTo(RestFn.java:137)

@chapmanb
Copy link
Owner

chapmanb commented Jan 5, 2015

Scott;
Sorry about the continued problems. If you could make the FreeBayes file available that would be a big help to help diagnose further. This error indicates there is something wrong with one of the VCF lines in the file since it does not have the INFO column but I'm not sure why that would happen from just the traceback. If the file is not obviously truncated or otherwise problematic I can dig more to see what is happening. Thanks again for all the help debugging.

@stsmall
Copy link
Author

stsmall commented Jan 5, 2015

Hi Brad,
I am working on posting the file. Could you please comment on my question above:
"If I was only going to use SNPs and not indels is it OK to remove indels/MNPs from the input files of (HaplotypeCaller, UnifiedGenotyper, Freebayes) prior to running ensemble or will this cause discordance errors? This set of files w/ SNPs only runs to completion."
thanks,
scott

@chapmanb
Copy link
Owner

chapmanb commented Jan 5, 2015

Scott;
Sorry about forgetting to respond to that. That's fine with the cavaet that MNPs also contain SNPs, just compressed together in haplotypes, so you'll miss those. If you've previously pre-processed the files with bcbio or manually using vcfallelicprimitives, this shouldn't be a problem as they'll already be split into individual SNPs.

@stsmall
Copy link
Author

stsmall commented Jan 5, 2015

@chapmanb
Copy link
Owner

chapmanb commented Jan 7, 2015

Scott;
I've been fruitlessly trying to dig for an error in this file and haven't been able to find anything yet after running through a few different validation tools. So apologies, I'm still not sure what is going on. To reproduce it in the contect of bcbio.variation I'd also need the reference file, since it looks like a custom assembly you might have locally. You can send it to me off list via Dropbox if it's not too large.

Practically, these inputs do need to be run through some post-processing to normalize and split MNPs into individual SNPs/indels. Here are the post-calling commands we use in bcbio for this:

https://github.com/chapmanb/bcbio-nextgen/blob/cf66cea237037a6d2d98851ce46d821abc965fd8/bcbio/variation/freebayes.py#L109

Doing that might help resolve your issue if it's somehow related to MNPs, but if not happy to dig more with the reference file.

@stsmall
Copy link
Author

stsmall commented Jan 7, 2015

thanks Brad, I will give this a try.
One quick question, how do I use the {fix_ambig} pipe? I see it defined
as "vcfutils" and imported into python, but couldnt find the module.

On 1/7/15 9:17 AM, Brad Chapman wrote:

Scott;
I've been fruitlessly trying to dig for an error in this file and
haven't been able to find anything yet after running through a few
different validation tools. So apologies, I'm still not sure what is
going on. To reproduce it in the contect of bcbio.variation I'd also
need the reference file, since it looks like a custom assembly you
might have locally. You can send it to me off list via Dropbox if it's
not too large.

Practically, these inputs do need to be run through some
post-processing to normalize and split MNPs into individual
SNPs/indels. Here are the post-calling commands we use in bcbio for this:

https://github.com/chapmanb/bcbio-nextgen/blob/cf66cea237037a6d2d98851ce46d821abc965fd8/bcbio/variation/freebayes.py#L109

Doing that might help resolve your issue if it's somehow related to
MNPs, but if not happy to dig more with the reference file.


Reply to this email directly or view it on GitHub
#22 (comment).

Every gun that is made, every warship launched,
every rocket fired signifies,in the final sense,
a theft from those who hunger and are
not fed, those who are cold and are
not clothed. This world in arms is
not spending money alone. It is
spending the sweat of its
laborers, genius of its
scientists, the hopes
of its children.
--Dwight D. Eisenhower

@chapmanb
Copy link
Owner

chapmanb commented Jan 7, 2015

Scott;
I don't think that step is essential, unless your reference genome has non-N ambiguous bases, so you could skip it. If you want to include it, it's only a big nasty awk command defined here:

https://github.com/chapmanb/bcbio-nextgen/blob/cf66cea237037a6d2d98851ce46d821abc965fd8/bcbio/variation/vcfutils.py#L81

Hope this helps fix the underlying issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants