New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read in multi-sample bcftools stats file #1087
Comments
Hi @WimSpee, Sounds sensible. I only have one example output file from bcftools currently and it doesn't have In terms of implementation, data in MultiQC is typically sample-level but we can plot whatever we want really. I guess there could be a flag where if Do we have data corresponding to the current table columns? Phil |
Example data available in duplicate issue #1272 |
Done in #1271 - if you get a chance please give it a test! |
Is your feature request related to a problem? Please describe.
For large variant callings files (e.g. 500GB+, 500M variants, 150+ samples) it takes very long to create a bcftools stats file for each sample. For each sample it is then needed to parse the whole 500GB file.
bcbio/bcbio-nextgen#2336
An alternative is to create a single multi-sample bcftools stats file, for which the 500GB VCF is only read once.
-s, --samples <list> list of samples for sample stats, "-" to include all samples
The resulting file has a PSC and PSI blocks with per sample information.
The rest of the information in the stats file is not per sample QC information, but aggregated QC information for the whole variant calling.
The information outside of the PSC and PSI blocks can be discarded, because there is not place to put QC information in the multiQC reports that is not on the level of a single sample?
MultiQC currently does not recognize (the samples in) the multi-sample VCF.
It takes the name of the file as the sample name, and I think it also does not use the PSC or PSI information, but reads in the analysis level information.
Describe the solution you'd like
Read in the PSC and PSI information from multi-sample BCFTools stats files. Don't use the information outside of the PSC and PSI blocks, or show it as analysis level information.
Describe alternatives you've considered
Keep creating (in bcbio) and reading in (in MultiQC) single sample bcftools stats files.
But make it more efficient to do this, by using a multi-sample BCF file, or first splitting in 1 pass the multi-sample VCF file to single sample VCF files. See the linked bcbio ticket.
Thank you.
The text was updated successfully, but these errors were encountered: