Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial subset script for v3 #48

Closed
wants to merge 25 commits into from
Closed

Initial subset script for v3 #48

wants to merge 25 commits into from

Conversation

mike-w-wilson
Copy link
Contributor

This is an initial PR for the gnomAD subsetting script. @ch-kr told me she is working on a more general subset script that will be committed to the general repo using similar logic. Once that is in, this script will be updated. However, since the subsets requests are coming in I wanted to make sure this was reviewed before handing over more VCFs to collaborators.

Copy link
Contributor

@lfrancioli lfrancioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments in the current code. Also, this will produce a VCF without Fields definitions -- I can see how this may not be crucial atm but since we generally want field definitions I think this should be added. Most of it is in gnomad_qc.v2.variant_qc.perpare_release_data.py. I also have a v3 version but I realized it's not in yet -- I'll make sure to do that and happy to complement your code with these fields as well.

}
)
)
elif ft == hl.dtype("array<array<int32>>"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely not generic and won't be right in many cases. Often times, array<array<...>> are converted to pipe-delimited arrays in VCF-land. If this is correct for a specific field, I think it'd be best to have it outside this function and apply it to this field directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this. This was for the SB field and I realize you handle that separately in the ht_to_vcf_mt. I've add that in.

mt = mt.cols().drop("pop")

if subset:
mt = subset_samples_and_variants(mt, subset, sparse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add gt_expr = "LGT" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

mt = mt.drop(mt.gvcf_info)
mt = mt.annotate_rows(info=info_ht[mt.row_key].info)

mt = mt.naive_coalesce(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a param or a function of the resulting size of the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a shot using the recommended 128MB per partition but there isnt currently a built in way to get the actual size of the MT so I also used an entry estimate I calculated from the size of vcf shard.

@mike-w-wilson
Copy link
Contributor Author

I've added in the vcf header. I used the info and format descriptions from v2 that overlapped and also what I found here: https://github.com/broadinstitute/gatk/blob/master/src/test/resources/org/broadinstitute/hellbender/tools/walkers/GnarlyGenotyper/twoSampleASDB.vcf
I took a shot at making the number of partitions for naive coalesce a function. I set a minimum to 20 because anything lower than that was taking a very long time. Back to you @lfrancioli !

Copy link
Contributor

@lfrancioli lfrancioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, but I think we need to roll much of this code into gnomad_methods so that it's accessible for future uses. I think a gnomad_methods.utils.vcf would be a great idea and a place to put all the headers for example. I realize this is probably more than you signed up for with this subsetting but I think it'd be worth it. Let me know what you think and you feel like starting this -- otherwise I guess we'd just add a comment in the script that this should be generalized.

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lfrancioli lfrancioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more optimization point

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
@mike-w-wilson
Copy link
Contributor Author

@lfrancioli , I agree with moving much of this code to gnomad_methods and happy to take on the subsetting/vcf module. For now I think it makes sense to merge this here and once the gnomad_methods reorg is complete, which I'm working on now, I will move the sensible portions of this code to a gnomad_methods.utils.vcf submodule. Is there any downside to merging this to gnomad_qc now? If not, I can add the comment but will also work on getting this into gnomad_methods very soon.

@mike-w-wilson
Copy link
Contributor Author

Changed my mind on this. I'm going to reorg gnomad_methods, move the header, compute partitions and format function to a vcf module in gnomad_method and then update this script. Stay tuned!

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and requested changes, but nothing big. I know we will need to make all of this use more of the code in gnomad_methods (and update it), but we can do that after this current subset that we need to make

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
@mike-w-wilson
Copy link
Contributor Author

Thanks for the quick feedback @jkgoodrich , back to you!

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small comments and a question. Very close. Thank you Mike!

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
gnomAD_AC_raw=freq_ht[mt.row_key].freq[1].AC,
gnomAD_AN_raw=freq_ht[mt.row_key].freq[1].AN,
gnomAD_AF_raw=hl.float32(freq_ht[mt.row_key].freq[1].AF)))
header_dict['info'].update({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these and the AC, AN, AF... in the HEADER_DICT above?

Copy link
Contributor Author

@mike-w-wilson mike-w-wilson Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I've reached out to Laurent for help answering this. I'm thinking of dropping the above AC and AC_raw to avoid any confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After speaking with Laurent and investigating the AC and AC_raw in the info file compared to the frequencies file, AC and AC_raw are computed on all samples while the frequencies ACs are computed on release only. I am dropping AC and AC_raw from the info_ht before annotating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, because this is the raw table. For the sites table I think he resets these to only the release samples (and high quality unrelated). I am assuming AN and AF also should be dropped. The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC,... Also I think we can drop the '_adj' on everything. The way it is in gnomAD and UKBB is that we specify if it is raw, but otherwise it is adj and it is described in the header what it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a AN or AF in the raw table. I'm not sure I understand this sentence "The only question is for subsets will we want AC, AN, and AF to now be reset to the numbers in the subset rather than having a subset_AC" Are you suggesting renaming the subset_AC annotation to AC so on for the other callstat annotations? dropping _adj from the annotations and updating headers makes sense to me, especially if it is consistent with gnomAD and UKBB. Thanks again!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, reread the comment and now I understand changing the subset_AC to AC. I think we'll just have to be clear in the email but that works for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wondering if you think we instead of subset_AC we name it AC, subset_AN we name it AN, and subset_AF we name it AF. Since they will be a reflection of the samples in the VCF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I've updated the annotations so if you wouldnt mind one last look, I'd appreciate it

gnomad_qc/v3/subset/subset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Mike! I think this looks good. Now you can get that subset going!!

@gtiao
Copy link
Contributor

gtiao commented Sep 21, 2020

Clarifying note: this script is more focused on creating separate sub-callsets of gnomAD and not about creating frequency array annotations for subsets of gnomAD

@gtiao
Copy link
Contributor

gtiao commented Jun 17, 2021

Mike says this needs to be looked at again for another round of review, and we should re-evaluate priority.

@gtiao
Copy link
Contributor

gtiao commented Jul 8, 2021

May no longer be relevant

@jkgoodrich jkgoodrich marked this pull request as draft September 8, 2021 18:31
@mike-w-wilson mike-w-wilson requested review from lfrancioli and removed request for lfrancioli October 28, 2022 14:51
@mike-w-wilson mike-w-wilson marked this pull request as ready for review October 28, 2022 14:52
@mike-w-wilson
Copy link
Contributor Author

Replaced by #566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants