New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
canfam3 dbSNP - ensembl 75 #386
Comments
Thanks Jason, sorry for the delay in getting back to you. It looks like that file doesn't have any scaffolds or anything in it, so if you just add 'chr' to the beginning of every line, gzip it back up and add the dbsnp line pointing to it in the resources file in the genome directory it should be all good. Could you try that and see if it ends up working as expected? |
It turns out @chapmanb already had a script to do something similar in the utilities section of cloudbiolinux for mouse, I generalized it a bit to work for other genomes and when it is working point you to it. |
Hi Jason I fixed up that dbSNP preparer here: chapmanb/cloudbiolinux@ed4c5b3 It will require a reinstallation of the canFam3 genome; the original genome preparation had a bug where it was not karyotype sorting the multiFASTA genome file. I fixed that bug here: chapmanb/cloudbiolinux@1496390 The last bit is to stick that canFam dbSNP file up so bcbio-nextgen can grab it and fix up that resources file and test it then you should be all set. |
Rory -- thanks for doing all the work of preparing this and updating the genome. I finalized the last bits of automating the download and sticking the dbSNP file on S3, so we should be good to go. Jason, please let us know if you run into any problems. We haven't used this so if anything looks off please yell and we can reopen and fix this. |
Hey guys, sorry for the delay - we just tried to pull down the latest and before we even get to the canine dbSNP, we are borking on STAR. Seems the rnaseq.gtf is not on S3.. Any suggestions?
|
Hi Jason, Sorry about that-- the upgrade system is a little brittle when it comes to the genomes. Is the canFam genome installed already? If you do it in two parts, install the canFam3 genome with:
and then install the STAR index with:
Does that work? |
Unfortunately no, same error. Yes, the canFam3.fa is in there - here is everything in the canine genome dir:
|
Gotcha. Thanks Jason. Sorry, I am a tool-- I never made the RNA-seq bit for canFam3. I added support for it with the prepper script in cloubiolinux here: chapmanb/cloudbiolinux@581c92e99f514fb6 When it is done I'll have Brad stick it up on S3. If you don't want to wait, you can prep it yourself if you grab the cloudbiolinux repo, use the python that is installed with bcbio-nextgen and run:
inside the /packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog directory. |
Hey Rory - We ran the prepper script - thanks for this. Once we had the rnaseq gtfs and so forth built, we had to |
Hi Jason, Thanks a lot for testing, it is good to know that it works on someone elses machine. The pieces are there to automatically prep and install arbitrary genomes from Ensemble we just need to glue them together and knowing that the rnaseq prep part is good to go is helpful. |
…ffutils database fixes. Add canFam3 resources: bcbio/bcbio-nextgen#386
Hey gents - so we finally got RNA-seq working with STAR on the canines but neglected to check the dbSNP for canine. Turns out we don't have a
Did we miss something? |
Jason;
|
Not quite - dying like this:
|
Actually the upgrade broke the whole build - running a pipeline, not an upgrade also dies with the same error on |
Jason;
Hope this gets things working again |
Thanks Brad - that worked getting the install back up. Unfortunately, darn dogs are still an issue with the dbSNP file... 10 hours later with 512 cores I get the error below with gatk-haplotype. What do you think, should I just replace INFO white space with underscores or give freebayes a shot? Is there a module in bcbio to clean the dbSNP file that I've missed? Thanks for all the help!
|
Jason;
Hope this works for you. Thanks again for all the patience getting this going. |
Hey Brad! So, it took a couple tries - first the run was still referencing the old
I then re-launched from a single node running
Any ideas? |
…efore passing to samtools commandline #386
Jason; For the command line error, I pushed a fix that writes the regions to a file. This normally allows us to work around the command line length issues so hopefully it'll resolve this as well. This doesn't have anything specifically to do with canFam3, but is likely a result of having a lot of chromosomes and thus a lot of regions. Hopefully this will keep things moving along. Thanks again for all the patience getting this running. |
Bummer - no.. Still dies .Again fresh job from the beginning:
|
…avoid command line length issues. #386
Jason; If it still fails, could you pass along the last command line in |
Hmmm, better, sorta - 39966 parallel analysis blocks, however it still hangs at As for the re-run - yes, it is actually starting back at BWA on the single node, not just stepping through. I've ssh'd in to the node to check - strange right? I will try the |
…ce chromosomes to avoid excessive numbers of split regions. #386
Jason; So I pushed yet another fix which I hope will help if you're able to retry that. I don't think you'll need The re-run of bwa steps is also totally wrong. I'm at a loss for why that is. Would you be able to post a Thanks again for the help debugging. |
Hi Brad, |
Miika; |
Hey Brad - So, I wiped the |
Jason; |
No I dont... but I'll send you a genome or two over aspera... standby |
Jason;
Is it possible you were running a different install than intended or not fully up to date with the latest development version? I also couldn't replicate the issue you saw with restarting bwa from scratch. All of my runs skip past that quickly and restart right at the last end point. So I'm officially confused as to what is going on, but I hope if you can get the same code as the latest it will work better for you. I did identify a bug in preparing the canFam3 genome, where it could pick up fasta files from the transcripts directory and generate a reference genome with ~35k contigs since it included the transcripts. That is now fixed, but I don't think it is your issue since my install with that problem died immediately during alignment (some tools don't like 35k contigs, I guess). However, you might want to double check that your genome install looks right ( I'll do more testing on memory usage and variant call hanging but wanted to report back about the region and re-run alignment discrepency and see if we could resolve those before debugging more. Hope this helps some and thanks again for all the patience debugging this. |
Hey Brad, Thanks so much for testing - not sure why you get drastically less blocks - my run has 6 dogs, 3 different breeds. Did you test just dog one or both dogs I set? I sent 2 different breeds in hopes of maximizing variation diversity (if this even matters in this case?) ... As far as the canFam3 ref, I have the following - looks correct to me, ~3,500 contigs, not ~35,000:
|
Jason; |
Hey Brad - well, after all that we're down to 5008 analysis blocks! My job is stuck in the queue ATM but I think this 10X reduction in regions was the problem so we should be set. Will let you know if i hit any more snags but wanted to update you for now. Thanks for all the work and apologies for the confusion. That was intense! |
… algorithm argument in bcbio_system.yaml. Experimental feature to test parallelization memory usage on genomes with larger number of regions #386
Jason;
I'm testing this from my side as well, so haven't made it automatically enabled. If it helps with scaling we could also add the same idea to function outputs (right now it is only inputs) so that would give us 10x reduction in message sizes. Miika, if you have time to test these with your genome as well I'd welcome feedback if they help/hurt/do nothing. Thanks again to all of you for the testing and patience getting this running. |
Thanks Brad, will throw 25 monkey exomes at it next week! |
OK so here's a quick update:
|
Brilliant -- thank you for the update. Glad to hear it's generally working better and the improvements in splitting short contigs help reduce the total block count. I'm also stress testing this and have a few more small fixes pending but mostly for GATK HaplotypeCaller scaling and avoiding pyzmq errors which it looks like you're not seeing. For the disk space issue, is it possible your temporary file space could fill up? pybedtools uses the system temporary directory but a useful fix would be to use I'll work on integrating this as well. Thanks for the pointers and testing. |
@chapmanb I've set Edit: I've got this already in
(pointing to our gpfs) |
Miika; |
Well, in my testing of the larger data set, I got this:
Sorry, not very informative! :D |
…ping of return messages through IPython, increase available GC sockets to IPython, and perform better downsampling of GATK calls. #386
Miika; If you still run into issues, practically the best thing to do is skip the extra chromosomes if you don't actually need variant calls there. Hope this helps fix things up and get your processing done. |
@chapmanb I tried running the analysis excluding the extra bits from the bed file, but I guess the underlying problem is that all regions still get processed prior to variant calling (the 'bamprep' stage).
I think I can run one sample at a time, but 48 samples is clearly too many. Any suggestions? |
Miika;
Those should get excluded from bamprep and variantcalling but it sounds like I may need to push some fixes if that's not happening correctly. It would be useful to know if the analysis blocks have the regions or not and I can try to hone in on potential issues. More generally, how much memory do you have on the box that is running the main script? It looks like that is the one being killed, so it would be helpful to get a sense of how bad we're doing relative to 48 samples time n blocks. Thanks again. |
Thanks again Brad, here's the output:
(It goes on) The box has 192GB RAM.. |
Miika;
For the example files Jason sent for dog, this reduces from 4952 regions with all the extra bits, to 1763 after only keeping standard chr* contigs. This is with the latest development version and no changes other than adding a |
Thanks Brad, rather than continuing to hijack this thread I opened a new one #445 |
Looks like this was continued on and fixed in #445, so closing. |
greetings! Can we add the canine dbSNP vcf to the variation resources in 9dcb447, please? I realize recallibration will not be available but getting rsIDs sure would be nice :)
The vcf can be obtained here: ftp://ftp.ensembl.org/pub/release-75/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz
Only thing is the canine genome for bcbio has "chr" prefixes on contigs where the dbSNP does not... I seem to recal you have a ensembl <--> ucsc conversion method from when we added the rn5 genome, so hoping this is easy without just awk'in on a 'chr' :)
Thanks!
The text was updated successfully, but these errors were encountered: