-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial subset script for v3 #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments in the current code. Also, this will produce a VCF without Fields definitions -- I can see how this may not be crucial atm but since we generally want field definitions I think this should be added. Most of it is in gnomad_qc.v2.variant_qc.perpare_release_data.py
. I also have a v3 version but I realized it's not in yet -- I'll make sure to do that and happy to complement your code with these fields as well.
gnomad_qc/v3/subset/subset.py
Outdated
} | ||
) | ||
) | ||
elif ft == hl.dtype("array<array<int32>>"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely not generic and won't be right in many cases. Often times, array<array<...>>
are converted to pipe-delimited arrays in VCF-land. If this is correct for a specific field, I think it'd be best to have it outside this function and apply it to this field directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching this. This was for the SB field and I realize you handle that separately in the ht_to_vcf_mt
. I've add that in.
gnomad_qc/v3/subset/subset.py
Outdated
mt = mt.cols().drop("pop") | ||
|
||
if subset: | ||
mt = subset_samples_and_variants(mt, subset, sparse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add gt_expr = "LGT"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!
gnomad_qc/v3/subset/subset.py
Outdated
mt = mt.drop(mt.gvcf_info) | ||
mt = mt.annotate_rows(info=info_ht[mt.row_key].info) | ||
|
||
mt = mt.naive_coalesce(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be a param or a function of the resulting size of the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a shot using the recommended 128MB per partition but there isnt currently a built in way to get the actual size of the MT so I also used an entry estimate I calculated from the size of vcf shard.
I've added in the vcf header. I used the info and format descriptions from v2 that overlapped and also what I found here: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/walkers/GnarlyGenotyper/twoSampleASDB.vcf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major, but I think we need to roll much of this code into gnomad_methods
so that it's accessible for future uses. I think a gnomad_methods.utils.vcf
would be a great idea and a place to put all the headers for example. I realize this is probably more than you signed up for with this subsetting but I think it'd be worth it. Let me know what you think and you feel like starting this -- otherwise I guess we'd just add a comment in the script that this should be generalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more optimization point
@lfrancioli , I agree with moving much of this code to |
Changed my mind on this. I'm going to reorg gnomad_methods, move the header, compute partitions and format function to a vcf module in gnomad_method and then update this script. Stay tuned! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments and requested changes, but nothing big. I know we will need to make all of this use more of the code in gnomad_methods (and update it), but we can do that after this current subset that we need to make
Thanks for the quick feedback @jkgoodrich , back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small comments and a question. Very close. Thank you Mike!
gnomad_qc/v3/subset/subset.py
Outdated
gnomAD_AC_raw=freq_ht[mt.row_key].freq[1].AC, | ||
gnomAD_AN_raw=freq_ht[mt.row_key].freq[1].AN, | ||
gnomAD_AF_raw=hl.float32(freq_ht[mt.row_key].freq[1].AF))) | ||
header_dict['info'].update({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these and the AC, AN, AF... in the HEADER_DICT
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I've reached out to Laurent for help answering this. I'm thinking of dropping the above AC
and AC_raw
to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After speaking with Laurent and investigating the AC
and AC_raw
in the info file compared to the frequencies file, AC
and AC_raw
are computed on all samples while the frequencies ACs are computed on release only. I am dropping AC
and AC_raw
from the info_ht before annotating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, because this is the raw table. For the sites table I think he resets these to only the release samples (and high quality unrelated). I am assuming AN and AF also should be dropped. The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC,... Also I think we can drop the '_adj' on everything. The way it is in gnomAD and UKBB is that we specify if it is raw, but otherwise it is adj and it is described in the header what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a AN or AF in the raw table. I'm not sure I understand this sentence "The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC" Are you suggesting renaming the subset_AC
annotation to AC
so on for the other callstat annotations? dropping _adj
from the annotations and updating headers makes sense to me, especially if it is consistent with gnomAD and UKBB. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, reread the comment and now I understand changing the subset_AC to AC. I think we'll just have to be clear in the email but that works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was wondering if you think we instead of subset_AC
we name it AC
, subset_AN
we name it AN
, and subset_AF
we name it AF
. Since they will be a reflection of the samples in the VCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I've updated the annotations so if you wouldnt mind one last look, I'd appreciate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Mike! I think this looks good. Now you can get that subset going!!
Clarifying note: this script is more focused on creating separate sub-callsets of gnomAD and not about creating frequency array annotations for subsets of gnomAD |
Mike says this needs to be looked at again for another round of review, and we should re-evaluate priority. |
May no longer be relevant |
…add a checkpoint before Naive coalescing and compute_partitions which has a count in it
Replaced by #566 |
This is an initial PR for the gnomAD subsetting script. @ch-kr told me she is working on a more general subset script that will be committed to the general repo using similar logic. Once that is in, this script will be updated. However, since the subsets requests are coming in I wanted to make sure this was reviewed before handing over more VCFs to collaborators.