Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial subset script for v3 #48

Closed
wants to merge 25 commits into from
Closed
Changes from 12 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
bfe6bda
Initial subset script for v3
mike-w-wilson Mar 12, 2020
7cc11b4
Updated to use general subset function
mike-w-wilson Mar 12, 2020
48655d8
Updated to add metadata
mike-w-wilson Mar 12, 2020
3dec95b
Added .bgz to vcf file path so shards block gzip
mike-w-wilson Mar 12, 2020
84dd46d
addressed feedback
mike-w-wilson Mar 18, 2020
7d90cc1
Added VCF info header
mike-w-wilson Mar 18, 2020
2b18a06
Formatted with Black
mike-w-wilson Mar 18, 2020
bfd58a6
no really formatted with black
mike-w-wilson Mar 18, 2020
2544004
Added Format to header and upped min partitions
mike-w-wilson Mar 19, 2020
887351b
small optimizations
mike-w-wilson Mar 24, 2020
43f388e
Add subset and gnomad callstats
mike-w-wilson Aug 18, 2020
b7e3c79
Fixed dict and AF callstat grab
mike-w-wilson Aug 18, 2020
c413b93
Fixed dict and AF callstat grab
mike-w-wilson Aug 18, 2020
ef72c96
Updated header, comments, and hardcoded sparse
mike-w-wilson Aug 19, 2020
d8124de
Removed AC and AC_raw
mike-w-wilson Aug 19, 2020
d9e48e7
Removed _adj from annotations
mike-w-wilson Aug 19, 2020
73ade62
drop adj
mike-w-wilson Aug 24, 2020
85f326f
Merge branches 'master' and 'mw/subset_v3' of https://github.com/broa…
jkgoodrich Oct 27, 2021
9e3aa37
Changes needed to make subset. Use release files instead of freq and …
jkgoodrich Oct 29, 2021
255ad04
Update script to use VDS and form options
mike-w-wilson Oct 28, 2022
5a38e05
Fix missing args
mike-w-wilson Oct 28, 2022
3681815
Fix missing args
mike-w-wilson Oct 28, 2022
399c682
Remove old arg
mike-w-wilson Oct 28, 2022
b0aa68a
Add basic VCF export
mike-w-wilson Nov 2, 2022
1e83259
Add variant qc field selection
mike-w-wilson Nov 3, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
292 changes: 292 additions & 0 deletions gnomad_qc/v3/subset/subset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
import argparse
import logging
import sys

import hail as hl

from gnomad_qc.v3.resources.meta import meta as metadata
from gnomad_qc.v3.resources.annotations import get_info
from gnomad_qc.v3.resources.raw import get_gnomad_v3_mt
from gnomad.resources.grch38.gnomad import GENOME_POPS
from gnomad.utils.filtering import subset_samples_and_variants


logging.basicConfig(
level=logging.INFO,
format="%(asctime)s: %(message)s",
datefmt="%m/%d/%Y %I:%M:%S %p",
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

HEADER_DICT = {
jkgoodrich marked this conversation as resolved.
Show resolved Hide resolved
"format": {
"GQ": {"Description": "Genotype Quality"},
"SB": {
"Description": "Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias"
},
"AD": {
"Description": "Allelic depths for the ref and alt alleles in the order listed"
},
"PID": {
"Description": "Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group"
},
"GT": {"Description": "Genotype"},
"MIN_DP": {"Description": "Minimum DP observed within the GVCF block"},
"PGT": {
"Description": "Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another"
},
"PL": {
"Description": "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"
},
"DP": {"Description": "Approximate read depth"},
"RGQ": {"Description": ""},
},
"info": {
"AC": {"Description": "Alternate Allele count in gnomAD post-filtering"},
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
"AC_raw": {"Description": "Raw alternate allele count in gnomAD"},
"AS_ReadPosRankSum": {
"Description": "allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias"
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
},
"AS_MQRankSum": {
"Description": "Allele-specifc Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities"
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
},
"AS_RAW_MQ": {
"Description": "Allele-specifc raw root mean square of the mapping quality of reads across all samples"
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
},
"AS_QUALapprox": {
"Description": "Allele-specifc sum of PL[0] values; used to approximate the QUAL score"
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
},
"AS_MQ_DP": {
"Description": "Allele-specifc depth over variant samples for better MQ calculation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allele-specifc -> Allele-specific

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't say this anymore, but there are more below. Must be copy paste error

},
"AS_VarDP": {
"Description": "Allele-specific (informative) depth over variant genotypes -- including ref, RAW format"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is meant by including ref, RAW format? Might help to elaborate. Also what is meant by (informative)?

Copy link
Contributor Author

@mike-w-wilson mike-w-wilson Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, definition was taken from https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/walkers/GnarlyGenotyper/twoSampleASDB.vcf so I'm not entirely sure what informative actually means here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently @ch-kr has this in the common code "Description": "Depth over variant genotypes (does not include depth of reference samples)" I think it is fine to keep as is for now for this callset (gnomad v3.0 currently looks like there is no description), but we should decide on the best explanation and be consistent within our callsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- gnomad_methods should be the truth data.

},
"AS_MQ": {
"Description": "Allele-specifc root mean square of the mapping quality of reads across all samples"
},
"AS_QD": {
"Description": "Allele-specifc variant call confidence normalized by depth of sample reads supporting a variant"
},
"AS_FS": {
"Description": "Allele-specifc Phred-scaled p-value of Fisher's exact test for strand bias"
},
"AS_SB_TABLE": {
"Description": "Allele-specific forward/reverse read counts for strand bias tests"
},
"ReadPosRankSum": {
"Description": "Z-score from Wilcoxon rank sum test of alternate vs. reference read position bias"
},
"MQRankSum": {
"Description": "Z-score from Wilcoxon rank sum test of alternate vs. reference read mapping qualities"
},
"RAW_MQ": {
"Description": "Raw root mean square of the mapping quality of reads across all samples"
},
"QUALapprox": {
"Description": "Sum of PL[0] values; used to approximate the QUAL score"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this based off of

I actually don't see QUALapprox or AS_QUALapprox in ether of the UKBB or gnomad v3.0 headers. I am fine leaving as is if this is the default then

},
"MQ_DP": {
"Description": "Depth over variant samples for better MQ calculation"
},
"VarDP": {"Description": "(informative) depth over variant genotypes"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about (informative)

"SB": {
"Description": "Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."
},
"MQ": {
"Description": "Root mean square of the mapping quality of reads across all samples"
},
"QD": {
"Description": "Variant call confidence normalized by depth of sample reads supporting a variant"
},
"FS": {
"Description": "Phred-scaled p-value of Fisher's exact test for strand bias"
},
},
}


def format_info_for_vcf(ht) -> hl.Table:
"""Formats gnomAD MT fields with types that are not condusive to vcf export
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline after """ and another before :param

:param ht: Table to reformat
:return: Reformatted Table
"""
logger.info("Formatting the info hail table for VCF export")
for f, ft in ht.info.dtype.items():
if ft == hl.dtype("int64"):
ht = ht.annotate(
info=ht.info.annotate(**{f: hl.int32(hl.min(2 ** 31 - 1, ht.info[f]))}) # add warning that number being capped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this comment a TODO or add the warning in now

)
elif ft == hl.dtype("float64"):
ht = ht.annotate(
info=ht.info.annotate(
**{f: hl.float32(hl.min(2 ** 31 - 1, ht.info[f]))}
)
)
elif ft == hl.dtype("array<int64>"):
ht = ht.annotate(
info=ht.info.annotate(
**{f: ht.info[f].map(lambda x: hl.int32(hl.min(2 ** 31 - 1, x)))}
)
)
elif ft == hl.dtype("array<float64>"):
ht = ht.annotate(
info=ht.info.annotate(
**{f: ht.info[f].map(lambda x: hl.float32(hl.min(2 ** 31 - 1, x)))}
)
)
return ht


def compute_partitions(mt, entry_size=3.5, partition_size=128000000) -> int:
"""Computes a very rough estimate for the optimal number of partitions in a MT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newlane after """

using a the hail recommended partition size of 128MB and the rough estimate
of 3.5B per entry in the gnomAD sparse MT.

:param mt: MatrixTable that will be resized
:param entry_size: Average size in bytes that a single entry requires , defaults to 3.5
:param partition_size: Amount of data per partition. Hail recommends 128MB, defaults to 128000000
:return: number of partitions for resizing MT
"""
rows, columns = mt.count()
mt_disk_est = rows * columns * entry_size
n_partitions = (
int(mt_disk_est / partition_size) if (mt_disk_est / partition_size) > 20 else 20 # make max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this 20 a parameter and add 20 as a default

)
return n_partitions # bring to gnomad_methods -> parameterize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this comment higher up and let's make it a TODO so we don't forget to do it



def main(args):
hl.init(log="/subset.log", default_reference="GRCh38")
pop = args.pop
subset = args.subset
hard_filter = args.hard_filter_samples
release = args.release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename release_only?

sparse = args.sparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I wonder if it is better to just hardcode sparse as True because we know that get_gnomad_v3_mt is sparse and this script isn't currently generalized to take other datasets. You can add a Note explaining that knowledge of sparse vs dense this is very important to the downstream steps and we can modify when we generalize this more.

header_dict = HEADER_DICT

mt = get_gnomad_v3_mt(
key_by_locus_and_alleles=True, remove_hard_filtered_samples=hard_filter, release_only=release
)
info_ht = get_info().ht()
info_ht = format_info_for_vcf(info_ht)

if "SB" in info_ht.info and not isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment for why this is needed

info_ht.info.SB, hl.expr.ArrayNumericExpression
):
info_ht = info_ht.annotate(
info=info_ht.info.annotate(SB=hl.flatten(info_ht.info.SB))
)

meta = metadata.ht()

if pop and pop in GENOME_POPS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a warning or error if pop exists, but is not in GENOME_POPS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added choices to the pop arg which errors out when the specified pop is not in GENOME_POPS as soon as the script is first started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is great, thank you!

logger.info(f"Subsetting samples to {pop} population")
mt = mt.filter_cols(meta[mt.col_key].pop == pop)

if subset:
gt_expr = "LGT" if sparse else "GT" #assumes matrix table is not split - infer
mt = subset_samples_and_variants(mt, subset, sparse=sparse, gt_expr=gt_expr)

if args.add_meta:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change add_meta to export_meta because add_meta sounds like it will somehow add this to the VCF export

meta = meta.semi_join(mt.cols())
meta.export(f"{args.output_path}/metadata.tsv.bgz")

if sparse:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I think you can remove the sparse argument because you already predefine using get_gnomad_v3_mt and this script is still specific to gnomAD v3.0 (though I do know for future subsets we will want to generalize this more)

mt = hl.experimental.sparse_split_multi(mt, filter_changed_loci=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this matters at all, but just a note that hl.experimental.sparse_split_multi doesn't filter out * alleles. I'm not sure if that will be a problem or not, but it was something we ran into in UKBB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these were originally filtered out but I havent had issues with the two previous subsets I ran

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good

mt = hl.experimental.densify(mt)
else:
mt = hl.split_multi_hts(mt)

mt = mt.filter_rows(hl.len(mt.alleles) > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment about why this step is needed (just because it is specific to a sparse dataset)

mt = mt.filter_rows(hl.agg.any(mt.GT.is_non_ref()))

mt = mt.drop(mt.gvcf_info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be for a sparse MT (or densified sparse) so if you decide to keep the sparse argument, this needs to be under the if sparse block

mt = mt.annotate_rows(info=info_ht[mt.row_key].info)

if args.subset_call_stats:
mt = mt.annotate_rows(subset_callstats=hl.agg.call_stats(mt.GT,mt.alleles))
mt = mt.annotate_rows(info=mt.info.annotate(subset_AC=mt.subset_callstats.AC[1],
subset_AN=mt.subset_callstats.AN,
subset_AF=hl.float32(mt.subset_callstats.AF[1])))
mt = mt.drop('subset_callstats')
header_dict['info'].update({"subset_AC": {"Description": "Alternate Allele count in subset post-filtering"} ,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above to specify what the post-filtering refers to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also same comment as above, will you need to specify Number for any of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in the adj and raw callstats for subset as well as updated the header dict to include number where necessary.

"subset_AN": {"Description": "Ref and Alternate allele number in subset post-filtering"},
"subset_AF": {"Description": "Alternate Allele frequency in subset post-filtering"}})

if args.add_gnomad_freqs:
from gnomad_qc.v3.resources.annotations import freq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally just a question for my knowledge. Why import here rather than at the top of the script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not crowd the namespace with unnecessary variables/methods but I'm not sure it's really necessary since this is not yet generalized for wider use. Im going to move it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

freq = freq.ht()
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
mt = mt.annotate_rows(info=mt.info.annotate(gnomAD_AC_adj=freq[mt.row_key].freq[0].AC,
gnomAD_AN_adj=freq[mt.row_key].freq[0].AN,
gnomAD_AF_adj=hl.float32(freq[mt.row_key].freq[0].AF),
gnomAD_AC_raw=freq[mt.row_key].freq[1].AC,
gnomAD_AN_raw=freq[mt.row_key].freq[1].AN,
gnomAD_AF_raw=hl.float32(freq[mt.row_key].freq[1].AF)))
header_dict['info'].update({"gnomAD_AC_adj": {"Description": "High quality alternate allele count in gnomAD post-filtering"},
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
"gnomAD_AN_adj": {"Description": "High quality allele number. The total number of called alleles in gnomAD post-filtering"},
"gnomAD_AF_adj": {"Description": "High quality alternate allele frequency in gnomAD post-filtering"},
"gnomAD_AC_raw": {"Description": "Raw alternate allele count in gnomAD post-filtering"},
"gnomAD_AN_raw": {"Description": "Raw allele number. The total number of called alleles in gnomAD post-filtering"},
"gnomAD_AF_raw": {"Description": "Raw alternate allele frequency in gnomAD post-filtering"}})

partitions = (
compute_partitions(mt) if not args.num_vcf_shards else args.num_vcf_shards
)
logger.info(f"Naive coalescing to {partitions}")

mt = mt.naive_coalesce(partitions)
if args.checkpoint_path:
jkgoodrich marked this conversation as resolved.
Show resolved Hide resolved
mt = mt.checkpoint(args.checkpoint_path, overwrite=True)

logger.info("Exporting VCF")
hl.export_vcf(
mt,
f"{args.output_path}/sharded_vcf.bgz",
parallel="header_per_shard",
metadata=HEADER_DICT,
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
)


if __name__ == "__main__":

parser = argparse.ArgumentParser(
description="This script subsets gnomAD using a list of samples or population"
)

parser.add_argument(
"-s", "--subset", help="Path to subset table of sample IDs with header: s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify this should be a text file (I am assuming). I got confused at first and expected it to be a path to a HT

)
parser.add_argument("--pop", help="Subset from specific population") # add choices for gnomad pops
mike-w-wilson marked this conversation as resolved.
Show resolved Hide resolved
parser.add_argument(
"--add-meta", help="Pull sample subset metadata", action="store_true"
)
parser.add_argument(
"--output-path", help="Output file path for subsetted VCF, do not include file", required=True,
)
parser.add_argument(
"--num-vcf-shards", help="Number of shards in output VCF", type=int # clarify docstring
)
parser.add_argument("--checkpoint-path", help="Path to final mt checkpoint")
parser.add_argument("--hard-filter-samples", help="Removes hard filtered samples", required=True, action="store_true")
parser.add_argument("--release", help="Keep release samples only", required=True, action="store_true")
parser.add_argument(
"--subset-call-stats", help="Adds subset callstats, AC, AN, AF"
)
parser.add_argument(
"--add-gnomad-freqs", help="Adds gnomAD's adj and raw callstats"
)
parser.add_argument("--sparse", help="Use if source MT is sparse", action="store_true")
parser.add_argument(
"-o",
"--overwrite",
help="Overwrite all data from this subset (default: False)",
action="store_true",
)
args = parser.parse_args()

if not (args.subset or args.pop):
sys.exit('Error: At least subset path or population must be specified')

main(args)