Initial subset script for v3 #48

mike-w-wilson · 2020-03-12T16:24:32Z

This is an initial PR for the gnomAD subsetting script. @ch-kr told me she is working on a more general subset script that will be committed to the general repo using similar logic. Once that is in, this script will be updated. However, since the subsets requests are coming in I wanted to make sure this was reviewed before handing over more VCFs to collaborators.

lfrancioli

A few small comments in the current code. Also, this will produce a VCF without Fields definitions -- I can see how this may not be crucial atm but since we generally want field definitions I think this should be added. Most of it is in gnomad_qc.v2.variant_qc.perpare_release_data.py. I also have a v3 version but I realized it's not in yet -- I'll make sure to do that and happy to complement your code with these fields as well.

lfrancioli · 2020-03-16T13:47:07Z

gnomad_qc/v3/subset/subset.py

+                    }
+                )
+            )
+        elif ft == hl.dtype("array<array<int32>>"):


This is definitely not generic and won't be right in many cases. Often times, array<array<...>> are converted to pipe-delimited arrays in VCF-land. If this is correct for a specific field, I think it'd be best to have it outside this function and apply it to this field directly.

Yes, thanks for catching this. This was for the SB field and I realize you handle that separately in the ht_to_vcf_mt. I've add that in.

lfrancioli · 2020-03-16T14:00:27Z

gnomad_qc/v3/subset/subset.py

+        mt = mt.cols().drop("pop")
+
+    if subset:
+        mt = subset_samples_and_variants(mt, subset, sparse=True)


add gt_expr = "LGT" ?

lfrancioli · 2020-03-16T14:02:06Z

gnomad_qc/v3/subset/subset.py

+    mt = mt.drop(mt.gvcf_info)
+    mt = mt.annotate_rows(info=info_ht[mt.row_key].info)
+
+    mt = mt.naive_coalesce(1000)


I wonder if this should be a param or a function of the resulting size of the data?

I gave this a shot using the recommended 128MB per partition but there isnt currently a built in way to get the actual size of the MT so I also used an entry estimate I calculated from the size of vcf shard.

mike-w-wilson · 2020-03-19T12:14:15Z

I've added in the vcf header. I used the info and format descriptions from v2 that overlapped and also what I found here: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/walkers/GnarlyGenotyper/twoSampleASDB.vcf
I took a shot at making the number of partitions for naive coalesce a function. I set a minimum to 20 because anything lower than that was taking a very long time. Back to you @lfrancioli !

lfrancioli

Nothing major, but I think we need to roll much of this code into gnomad_methods so that it's accessible for future uses. I think a gnomad_methods.utils.vcf would be a great idea and a place to put all the headers for example. I realize this is probably more than you signed up for with this subsetting but I think it'd be worth it. Let me know what you think and you feel like starting this -- otherwise I guess we'd just add a comment in the script that this should be generalized.

gnomad_qc/v3/subset/subset.py

lfrancioli

one more optimization point

gnomad_qc/v3/subset/subset.py

mike-w-wilson · 2020-03-24T13:44:58Z

@lfrancioli , I agree with moving much of this code to gnomad_methods and happy to take on the subsetting/vcf module. For now I think it makes sense to merge this here and once the gnomad_methods reorg is complete, which I'm working on now, I will move the sensible portions of this code to a gnomad_methods.utils.vcf submodule. Is there any downside to merging this to gnomad_qc now? If not, I can add the comment but will also work on getting this into gnomad_methods very soon.

mike-w-wilson · 2020-03-24T16:45:32Z

Changed my mind on this. I'm going to reorg gnomad_methods, move the header, compute partitions and format function to a vcf module in gnomad_method and then update this script. Stay tuned!

jkgoodrich

Some comments and requested changes, but nothing big. I know we will need to make all of this use more of the code in gnomad_methods (and update it), but we can do that after this current subset that we need to make

gnomad_qc/v3/subset/subset.py

mike-w-wilson · 2020-08-19T00:24:08Z

Thanks for the quick feedback @jkgoodrich , back to you!

jkgoodrich

Just some small comments and a question. Very close. Thank you Mike!

gnomad_qc/v3/subset/subset.py

jkgoodrich · 2020-08-19T14:37:26Z

gnomad_qc/v3/subset/subset.py

+                                                    gnomAD_AC_raw=freq_ht[mt.row_key].freq[1].AC, 
+                                                    gnomAD_AN_raw=freq_ht[mt.row_key].freq[1].AN,
+                                                    gnomAD_AF_raw=hl.float32(freq_ht[mt.row_key].freq[1].AF)))
+        header_dict['info'].update({


What is the difference between these and the AC, AN, AF... in the HEADER_DICT above?

Good question, I've reached out to Laurent for help answering this. I'm thinking of dropping the above AC and AC_raw to avoid any confusion.

After speaking with Laurent and investigating the AC and AC_raw in the info file compared to the frequencies file, AC and AC_raw are computed on all samples while the frequencies ACs are computed on release only. I am dropping AC and AC_raw from the info_ht before annotating.

Ah, yes, because this is the raw table. For the sites table I think he resets these to only the release samples (and high quality unrelated). I am assuming AN and AF also should be dropped. The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC,... Also I think we can drop the '_adj' on everything. The way it is in gnomAD and UKBB is that we specify if it is raw, but otherwise it is adj and it is described in the header what it is

There isn't a AN or AF in the raw table. I'm not sure I understand this sentence "The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC" Are you suggesting renaming the subset_AC annotation to AC so on for the other callstat annotations? dropping _adj from the annotations and updating headers makes sense to me, especially if it is consistent with gnomAD and UKBB. Thanks again!

Sorry, reread the comment and now I understand changing the subset_AC to AC. I think we'll just have to be clear in the email but that works for me.

Yeah I was wondering if you think we instead of subset_AC we name it AC, subset_AN we name it AN, and subset_AF we name it AF. Since they will be a reflection of the samples in the VCF.

This makes sense. I've updated the annotations so if you wouldnt mind one last look, I'd appreciate it

gnomad_qc/v3/subset/subset.py

jkgoodrich

Thank you Mike! I think this looks good. Now you can get that subset going!!

gtiao · 2020-09-21T18:27:54Z

Clarifying note: this script is more focused on creating separate sub-callsets of gnomAD and not about creating frequency array annotations for subsets of gnomAD

gtiao · 2021-06-17T13:10:07Z

Mike says this needs to be looked at again for another round of review, and we should re-evaluate priority.

gtiao · 2021-07-08T19:34:44Z

May no longer be relevant

…dinstitute/gnomad_qc into mw/subset_v3

…add a checkpoint before Naive coalescing and compute_partitions which has a count in it

mike-w-wilson · 2024-06-10T12:55:21Z

Replaced by #566

Initial subset script for v3

bfe6bda

mike-w-wilson requested review from lfrancioli and jkgoodrich March 12, 2020 16:24

mike-w-wilson added 3 commits March 12, 2020 14:41

Updated to use general subset function

7cc11b4

Updated to add metadata

48655d8

Added .bgz to vcf file path so shards block gzip

3dec95b

lfrancioli reviewed Mar 16, 2020

View reviewed changes

mike-w-wilson added 5 commits March 18, 2020 15:36

addressed feedback

84dd46d

Added VCF info header

7d90cc1

Formatted with Black

2b18a06

no really formatted with black

bfd58a6

Added Format to header and upped min partitions

2544004

lfrancioli reviewed Mar 19, 2020

View reviewed changes

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved

lfrancioli reviewed Mar 19, 2020

View reviewed changes

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved

small optimizations

887351b

mike-w-wilson added 3 commits August 18, 2020 13:25

Add subset and gnomad callstats

43f388e

Fixed dict and AF callstat grab

b7e3c79

Fixed dict and AF callstat grab

c413b93

jkgoodrich requested changes Aug 18, 2020

View reviewed changes

Updated header, comments, and hardcoded sparse

ef72c96

jkgoodrich reviewed Aug 19, 2020

View reviewed changes

mike-w-wilson added 2 commits August 19, 2020 13:37

Removed AC and AC_raw

d8124de

Removed _adj from annotations

d9e48e7

jkgoodrich approved these changes Aug 19, 2020

View reviewed changes

drop adj

73ade62

jkgoodrich marked this pull request as draft September 8, 2021 18:31

jkgoodrich and others added 3 commits October 27, 2021 15:24

Merge branches 'master' and 'mw/subset_v3' of https://github.com/broa…

85f326f

…dinstitute/gnomad_qc into mw/subset_v3

Changes needed to make subset. Use release files instead of freq and …

9e3aa37

…add a checkpoint before Naive coalescing and compute_partitions which has a count in it

Update script to use VDS and form options

255ad04

mike-w-wilson requested review from lfrancioli and removed request for lfrancioli October 28, 2022 14:51

mike-w-wilson marked this pull request as ready for review October 28, 2022 14:52

mike-w-wilson requested a review from jkgoodrich October 28, 2022 14:53

mike-w-wilson assigned mike-w-wilson and jkgoodrich Oct 28, 2022

mike-w-wilson added 5 commits October 28, 2022 11:00

Fix missing args

5a38e05

Fix missing args

3681815

Remove old arg

399c682

Add basic VCF export

b0aa68a

Add variant qc field selection

1e83259

mike-w-wilson assigned gtiao Nov 3, 2022

mike-w-wilson requested a review from gtiao November 3, 2022 17:09

jkgoodrich unassigned gtiao Nov 10, 2022

mike-w-wilson added the v3 label Apr 26, 2023

jkgoodrich marked this pull request as draft October 16, 2023 21:29

mike-w-wilson closed this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial subset script for v3 #48

Initial subset script for v3 #48

mike-w-wilson commented Mar 12, 2020

lfrancioli left a comment

lfrancioli Mar 16, 2020

mike-w-wilson Mar 19, 2020

lfrancioli Mar 16, 2020

mike-w-wilson Mar 19, 2020

lfrancioli Mar 16, 2020

mike-w-wilson Mar 19, 2020

mike-w-wilson commented Mar 19, 2020

lfrancioli left a comment

lfrancioli left a comment

mike-w-wilson commented Mar 24, 2020

mike-w-wilson commented Mar 24, 2020

jkgoodrich left a comment •

edited

Loading

mike-w-wilson commented Aug 19, 2020

jkgoodrich left a comment

jkgoodrich Aug 19, 2020

mike-w-wilson Aug 19, 2020 •

edited

Loading

mike-w-wilson Aug 19, 2020

jkgoodrich Aug 19, 2020

mike-w-wilson Aug 19, 2020

mike-w-wilson Aug 19, 2020

jkgoodrich Aug 19, 2020

mike-w-wilson Aug 19, 2020

jkgoodrich left a comment

gtiao commented Sep 21, 2020

gtiao commented Jun 17, 2021

gtiao commented Jul 8, 2021

mike-w-wilson commented Jun 10, 2024

Initial subset script for v3 #48

Initial subset script for v3 #48

Conversation

mike-w-wilson commented Mar 12, 2020

lfrancioli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-w-wilson commented Mar 19, 2020

lfrancioli left a comment

Choose a reason for hiding this comment

lfrancioli left a comment

Choose a reason for hiding this comment

mike-w-wilson commented Mar 24, 2020

mike-w-wilson commented Mar 24, 2020

jkgoodrich left a comment • edited Loading

Choose a reason for hiding this comment

mike-w-wilson commented Aug 19, 2020

jkgoodrich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-w-wilson Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkgoodrich left a comment

Choose a reason for hiding this comment

gtiao commented Sep 21, 2020

gtiao commented Jun 17, 2021

gtiao commented Jul 8, 2021

mike-w-wilson commented Jun 10, 2024

jkgoodrich left a comment •

edited

Loading

mike-w-wilson Aug 19, 2020 •

edited

Loading