Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read in multi-sample bcftools stats file #1087

Closed
WimSpee opened this issue Dec 24, 2019 · 3 comments
Closed

Read in multi-sample bcftools stats file #1087

WimSpee opened this issue Dec 24, 2019 · 3 comments
Labels
bug: module Bug in a MultiQC module module: change

Comments

@WimSpee
Copy link

WimSpee commented Dec 24, 2019

Is your feature request related to a problem? Please describe.
For large variant callings files (e.g. 500GB+, 500M variants, 150+ samples) it takes very long to create a bcftools stats file for each sample. For each sample it is then needed to parse the whole 500GB file.

bcbio/bcbio-nextgen#2336

An alternative is to create a single multi-sample bcftools stats file, for which the 500GB VCF is only read once.

-s, --samples <list> list of samples for sample stats, "-" to include all samples

The resulting file has a PSC and PSI blocks with per sample information.

# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC   [2]id   [3]sample       [4]nRefHom      [5]nNonRefHom   [6]nHets        [7]nTransitions [8]nTransversions       [9]nIndels      [10]average depth       [11]nSingletons [12]nHapRef     [13]nHapAlt     [14]nMissing
# PSI, Per-Sample Indels
# PSI   [2]id   [3]sample       [4]in-frame     [5]out-frame    [6]not applicable       [7]out/(in+out) ratio   [8]nHets        [9]nAA

The rest of the information in the stats file is not per sample QC information, but aggregated QC information for the whole variant calling.

The information outside of the PSC and PSI blocks can be discarded, because there is not place to put QC information in the multiQC reports that is not on the level of a single sample?

MultiQC currently does not recognize (the samples in) the multi-sample VCF.
It takes the name of the file as the sample name, and I think it also does not use the PSC or PSI information, but reads in the analysis level information.

Describe the solution you'd like
Read in the PSC and PSI information from multi-sample BCFTools stats files. Don't use the information outside of the PSC and PSI blocks, or show it as analysis level information.

Describe alternatives you've considered
Keep creating (in bcbio) and reading in (in MultiQC) single sample bcftools stats files.
But make it more efficient to do this, by using a multi-sample BCF file, or first splitting in 1 pass the multi-sample VCF file to single sample VCF files. See the linked bcbio ticket.

Thank you.

@ewels
Copy link
Member

ewels commented Apr 2, 2020

Hi @WimSpee,

Sounds sensible. I only have one example output file from bcftools currently and it doesn't have PSC or PSI blocks. Please could you submit a PR to the test data repo with one or more example output files?

In terms of implementation, data in MultiQC is typically sample-level but we can plot whatever we want really. I guess there could be a flag where if PSC or PSI is found in any of the files parsed a warning box is shown above the plots alerting the user that (at least some) of the data shown is at an aggregate level. If those sections are found then we ditch everything that is currently put into the General Statistics table and instead use per-sample numbers from those sections.

Do we have data corresponding to the current table columns? Vars / SNP / Indel / Ts/Tv / MNP

Phil

@ewels ewels added module: change waiting: example data Needs example data before we can proceed labels Apr 2, 2020
@ewels ewels mentioned this issue Jul 4, 2021
2 tasks
@ewels ewels removed the waiting: example data Needs example data before we can proceed label Jan 27, 2022
@ewels
Copy link
Member

ewels commented Jan 27, 2022

Example data available in duplicate issue #1272

@ewels
Copy link
Member

ewels commented Jan 28, 2022

Done in #1271 - if you get a chance please give it a test!

@ewels ewels closed this as completed Jan 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: module Bug in a MultiQC module module: change
Projects
None yet
Development

No branches or pull requests

2 participants