Skip to content

Conversation

@mike-w-wilson
Copy link
Contributor

Summary

This PR is pretty massive but since it is a reorganization of the entire repo, I couldn't think of another way to do it. This organization was drafted and previously discussed here:

https://docs.google.com/spreadsheets/d/1pVpru0acCCr1AvSRyf89X-LxPKhwK-mYCBtZvICl9fQ/edit#gid=0

The changes include import statements, loggers, and where functions live. Six functions have also been removed as they are built into hail or unnecessary for other reasons. They are:

gnomad.utils.generic.reverse_complement_bases(bases)
gnomad.utils.generic.get_sample_data(mt, …)
gnomad.utils.generic.phase_by_transmission(…)
gnomad.utils.generic.phase_trio_matrix_by_transmission(tm)
gnomad.utils.generic.explode_trio_matrix(tm)
gnomad.utils.generic.sample_pcs_uniformly(…)
gnomad.utils.liftover.main(args)

Once this is approved, I can run Black.

Copy link
Contributor

@nawatts nawatts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 based on the spreadsheet.

For the __init__.py files, we might consider importing the submodules themselves instead of everything in them to avoid having so many things all in the same namespace. Or using __all__ to define what's exported.

https://docs.python.org/3/tutorial/modules.html#importing-from-a-package

mike-w-wilson and others added 4 commits March 26, 2020 09:37
- removed field_utils.py (functions were moved elsewhere)
- moved vcf-related functionalities to vcf.py (I envision this will become the place for all VCF needs)
Copy link
Contributor

@lfrancioli lfrancioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch @mike-w-wilson for getting this going, looks great!
A few things:

  1. I committed a few changes myself -- please take a look at that last commit
  2. I'd like to move pop colors to ancestry.py; thoughts? Also ancestry.py vs populations.py ?
  3. How about moving random_forest.pretty_print_runs to the gnomad_qc repo? Seems very specialized and possibly not used in other projects. @jkgoodrich, did you use it for UKBB ?
  4. should we rename rf.py to random_forest.py ? it'll break some code and I'm unsure it really makes a difference But happy to adapt if consensus is that it's desirable.

Copy link
Contributor Author

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Just a couple comments but this looks good to me. The moves from field_utils to annotations make sense.
  2. Putting pop colors in ancestry is a good idea and seems like it would be used. I don't have strong feelings on ancestry.py vs. populations.py. I think we discussed this in one meeting but I'll let @jkgoodrich and @gtiao weigh in.
  3. Fine with me if it is okay with Julia.
  4. I went back and forth on this. I spelt it out for readability in the sphinx docs but went against my own reasoning in keeping ld.py the same....so again no strong feelings here. I guess most people using this package will know what rf stands for.

@mike-w-wilson
Copy link
Contributor Author

Regarding @nawatts 's comment on __init__.py, do we want to adopt __all__? If so, are there any submodules we would want to leave out of the imports? I can also make the change suggested and just import the submodules themselves from gnomad.sample_qc import ancestry I do like that this approach clears the namespace up. Any thoughts there @lfrancioli ?

@lfrancioli
Copy link
Contributor

I like importing the submodules too

Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you Mike so much for doing this. I pretty much just confirmed that the functions were moved where expected, imports updated (thanks for redoing the import organization!) and looked at Laurent’s changes.

  • Laurent, thanks for moving/removing the field_utils I feel like we weren’t sure about them and this makes more sense.

  • For ancestry.py vs. populations.py. I still lean towards ancestry for some reason, but I am not sure why since we refer to everything in there with pops or populations so that might be better.

  • I do currently use random_forest.pretty_print_runs in UKBB actually.

  • I personally like random_forest.py better, but fine with either.

)


def bi_allelic_expr(t: Union[hl.Table, hl.MatrixTable]) -> hl.expr.BooleanExpression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this makes sense to move here from field utils, thanks for moving

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

ANNOTATIONS_HISTS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little odd here since it ends up being more specific to the given QC pipeline (granted not for all) @lfrancioli what do you think?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine with having this here as in my experience these values are good starting points at least and the different metrics are pretty standard. Totally open to suggestions though -- did you have another module in mind?
I think adding a comment about this is would be nice (but I can see this being outside the scope of this PR and part of another effort to document constants)

v.dtype in {hl.tstr, hl.tint32, hl.tfloat32, hl.tint64, hl.tfloat64, hl.tbool}})


def ht_to_vcf_mt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess it does make since to move this to a utils.vcf because it is specific to writing to a vcf and since we are planning to add more to the vcf module at somepoint. Good catch


# Create an MT with no cols so that we acn export to VCF
info_mt = info_ht.to_matrix_table_row_major(columns=['s'], entry_field_name='s')
return info_mt.filter_cols(False) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like github is complaining about no newline at end of file

@lfrancioli
Copy link
Contributor

Thanks for weighing in @jkgoodrich ! A couple of things based on your comments:
1 . Let's keep ancestry.py :)
2. I'm in for random_forest.py
3. Regarding random_forest.pretty_print_runs, let's keep it there then if you're using it

So I guess what I'm saying is that the current state of things is good with me ;)

@mike-w-wilson
Copy link
Contributor Author

@laurent, one last thing before I run Black and merge! I've moved pop_colors to ancestry. I left all pops and subpops in the dictionary and kept the lower case pop names. I thought someone had mentioned switching to uppercase, which would match the pop_names dictionary, but I could not find it in slack..I'm fine either way but do think matching the pop_names keys would make sense, though I know this goes against history.

@lfrancioli
Copy link
Contributor

This is awesome @mike-w-wilson

So I think we should go all lowercase. Also, could you update the names with the more complete ones (and double check the colors match) those here:
https://github.com/macarthur-lab/gnomad_lof/blob/master/R/constants.R

Note that the Amish isn't in the R code as it's new for v3, so needs to be added. If you see anything else that's interesting to have w.r.t. pops in these constants please transfer over too!

@mike-w-wilson
Copy link
Contributor Author

@laurent, added the complete pop names and pop colors but removed any unused pops with clashes.

Copy link
Contributor

@lfrancioli lfrancioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

@mike-w-wilson mike-w-wilson merged commit bf9d93d into master Apr 3, 2020
@mike-w-wilson mike-w-wilson deleted the mw/repo_reorg branch April 3, 2020 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants