New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trio pipeline #2961
Comments
Hi Konstantinos @kokyriakidis! I used gatk4, gatk3.8, samtools, freebayes, playtypus and ensembl for Trio Exome analysis. SV/CNV analysis is different. While for small variants you might want to call them in a batch, You may also try to call CNVs using XHMM across many samples with similar coverage Sergey |
Hi @naumenko-sa ! Do you recommend the same tools for Trio analysis? Or should I opt for GATK, deepvariant, strelka2, vardict with the ensemble method? |
Hi Konstantinos @kokyriakidis! More callers + ensemble does not necessarily mean better calling. Below is my simple validation from 2018. You may see that for SNV it is enough to use samtools, for indels gatk gives better results. More extensive validation methods and results you may find among publications by Justin Zook: I'd suggest to run validations using NA12878 and trios (Ashkenasim trio, Chinese trio) using Giab and their article to see how tools are working in your environment. Bcbio would benefit from any updated validation alike: Using many tools and combining them with ensemble would allow you to pick up variants However, this approach has many downsides:
Briefly, for germline WES variant analysis I'd suggest to use gatk4 (or gatk3.8 - which works better for you), and focus more on validation, variant filtration, annotation, and interpretation. Small variant calling in germline as a bioinformatics problem seems to be (almost) solved at 99% precision and sensitivity (worse for indels, but again 99% if you have WGS data). Sergey |
@naumenko-sa Thank you for your wonderful explanation! I want to ask something that bothers me: |
https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#sample-information describes how to do it. https://gatkforums.broadinstitute.org/gatk/discussion/7696/pedigree-ped-files is the file to create. |
@roryk Let's say I have a TRIO:
The columns are:
How does it match with the samples? Do I have to change their name? My template is:
|
The 'description' metadata is what should be the individual ID in the PED file. bcbio will call of the samples in the same batch together, and they will get annotated in the GEMINI database according to the family structure in the PED file. It will also check to make sure the family structure is correct; often times people are unaware what their actual family structure is, so it is good to check. |
Thank you so much for the clarification! |
Any thought why octopus stalls in trio analysis?
Certain chromosomes finished correctly (19/25). After that it pops this error. I tried to rerun but had the same response |
Thanks much for the report and apologies about the problem. This was a bug in finalizing the octopus VCF files, which is now fixed in the latest development. If you update with ( |
@chapmanb @naumenko-sa When I run the trio pipeline I got this error. Any thoughts?
My template is:
The vcf annotation file I used is the following from the @naumenko-sa cre project:
|
@chapmanb I get another error from OCTOPUS during it's run:
Then octopus continues to process chromosomes and after a lot of minutes it ends with this:
|
Konstantinos; The latest development has two fixes for the two issues:
Please let us know if you hit any other problems and thanks again for all the patience debugging. |
@chapmanb Using just gatk-haptotype I got these error messages now:
My template is:
My vcf annotation files is:
|
I'm guessing your PED file is malformed somehow, can you pass on the PED file you are using? |
My PED file:
|
Got the same error when I did not specify a vcfanno annotation file. So this file does not cause the problem |
I tried running it WITHOUT a PED file and I got these errors:
|
Thanks for investigating, sorry for being slow about getting back to you. I think the original problem is with the format of your PED file. I'm guessing that the samplenames don't match what is in the PED file. Do your samplenames get swapped to having an X in front of them during the bcbio run? We try to make everything compatible downstream with R, since lots of folks use it for analysis, and R won't allow you have column names that start with a number. I'm guessing that is the problem with the PED file. |
Is the last run from your vcfanno file without a PED file? I think your vcfanno file is probably not doing the right thing, if so. Do you need to use a custom file? Can you just use the annotations bcbio provides? |
The previous run was WITH VCFANNO file and WITHOUT PED file. I run it again WITHOUT PED OR VCFANNO FILE with the following template
I get these errors:
|
Thanks, sorry for all the back and forth. Could you send me a snippet of the VCF file that has this error so I can take a look? It looks like you might be able to subset it down to just rs1437329596 to get the error to be reproducible. It would be helpful if you could include a few variants that don't fail, if this wasn't the first variant as well.
is the command that is failing. If you replace
with your snippet, we should be able to see if the snippet fails, which would work as a test case. |
First I will try to run it again with no vcfanno file or ped file BUT with sample names and description not starting with number and I will update you |
Ok! |
I got the same errors again when running with NO VCFANNO AND NO PED FILE AND SAMPLE AND DESCRIPTION NAMES NOT STARTING WITH NUMBERS:
I am very frustrated. Neither the PED nor the VCFANNO file cause the problem. Do you want me to upload the fastq files so you can have a look? |
I have uploaded the whole project folder after the failed run so you can have a look!
|
I had to specify
When the process ended, the last command stated:
Is this normal?
|
@naumenko-sa Do we have any news? |
Hi @kokyriakidis !
after removing duplicates with
the duplicate record is gone, but the allele A is gone as well
I've submitted a bug to
this time the allele A is kept:
SN |
Thanks so much for all your help Sergey! So in order to get it work I just update bcbio after cloudbiolinux PR is merged or do I have to do the steps you did? |
Yes, you'd need to |
Ok! I am waiting for your confirmation! Thanks again for helping and testing! |
It seems that I get this error now when running with GEMINI:
Without GEMINI everything worked fine |
Hi Sergey and the rest! Will sticking to gnomad 2.1 help with this issue? |
This is still due to a PED file problem-- could you send along the PED file you are using? |
Thanks, could you also pass along the YAML file you are using? |
Thanks, how about this file? |
@roryk The PED file is fine right? |
Hi @kokyriakidis, Yes, it looks fine to me. Could you also send along the VCF file or a snippet from it so I can test locally? |
The whole project is here:
|
Thanks, the PED file and the VCF file don't have the same names. The PED file has the names as |
@roryk Sorry this was from another run. Please check again the link above. I have uploaded the updated run (which also fails) |
Thanks, it looks like the PED file has the ethnicity column but no entry for it. If you add an ethnicity column to the PED file and populate it with -9 it should fix this problem. |
hmm thanks! Maybe update the readme, so others know that |
Thanks, I think we shouldn't be adding that column unless it is there, so it is a bug. |
Running the command again caused the same error when changed PED. Should I delete the work folder and run it again? |
Heya, it will still be using the broken PED file if it is on the disk, I think if you just delete the PED file in the GEMINI directory it will populate a new one. |
I'll update the documentation to mention we require the ethnicity as well, sorry about that. |
It finally run successfully! Thanks a lot all of you for the help! |
Great! Sorry for the problems! |
Ethnicity is not required in the PED file description but we were treating it as if it was, leading to errors with standard PED files. Addresses issue brought up in #2961.
@chapmanb
I would like to run a trio analysis in whole exome samples. Can I use all callers (strelka2, deepvariant. vardict, gatk etc) for a trio analysis with samples having the same batch name? Can I use the ensemble method?
I am also trying to do CNV analysis in this trio. Can I add all svcallers? Do all work with single germline sample?
It would also be nice to specify in the documentation:
Which callers can be used for Germline Variant Calling
Which callers can only be used for Somatic (Tumor-Normal) Variant Calling
Which callers can be used for Germline SV Calling
Which callers can only be used for Somatic (Tumor-Normal) SV Calling
Which callers can be user for Trio analysis
The text was updated successfully, but these errors were encountered: