Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple resource buckets and switch to Google Cloud Public Datasets bucket #265

Closed
wants to merge 11 commits into from

Conversation

nawatts
Copy link
Contributor

@nawatts nawatts commented Oct 1, 2020

This adds support for multiple resource buckets in order to provide access to resources hosted on different cloud providers through our resources framework (#255, #263, #264). This also sets the default resource bucket to the Google Cloud Public Datasets bucket: gs://gcp-public-data--gnomad (resolves #264).

Some design goals:

  1. Do not change interfaces. Everyone has more pressing things to do right now than update pipelines with changes to resources.
  2. The resource "directory" structure is the same in each bucket. Because of this, we can define resource paths in one place and parameterize the URL schema (gs://, s3://, etc.) and the bucket name instead of having different resource modules for each cloud provider.
  3. Easily select a resource bucket to use for all resources in a pipeline.

Because our resource modules export objects, my initial idea of adding a cloud provider parameter to Resource classes was unworkable since it would require changing the library's interface (against goal 1) or adding separate modules for each provider (against goals 2 and 3).

To meet goal 2, this changes the path attribute of Resource objects from an absolute URL to a relative (within a bucket) path and adds a resources_root parameter to Resource methods that read the resource (TableResource.ht, PedigreeResource.pedigree, etc.) and import_resource. This provides flexibility: users can mix and match resource buckets within a pipeline (for example, if a file in gnomad-public hasn't yet been copied to gcp-public-data--gnomad) or use resource buckets unknown to us (for example, if someone sets up a copy of gnomad-public in Australia). resources_root is optional (goal 1) and the expectation is that most users will leave it unspecified.

Setting a default behavior is not as simple as setting "gs://gnomad-public" as the default value for resources_root. This partly because our resource files are split between two buckets: gnomad-public and gnomad-public-requester-pays and partly because of goal 3: a pipeline using resources from AWS shouldn't have to pass resources_root every time a resource is read.

Thus, Resource methods call get_resource_url with the Resource's path attribute, the given resources_root, and a new Resource attribute: gnomad_bucket. The gnomad_bucket attribute is necessary to distinguish resources stored in gnomad-public vs gnomad-public-requester-pays.

In turn, get_resource_url looks at a new setting: the "default resource provider". This controls how resource URLs are constructed when no resources_root is provided (which is expected to usually be the case). With this, the resource bucket to use for all resources in a pipeline can be set by changing the default_resource_provider attribute on gnomad.resources.config.gnomad_resource_configuration. And support for other cloud providers can be added by adding to the GnomadResourceProvider enum and updating get_resource_url to accordingly.

Import paths

The gnomad/resources/import_resources.py script is used to convert VCF, TSV, etc. files into Hail formats. The import_args attribute on Resource is not aware of resources_root and thus URLs there for files in gnomad-public or gnomad-public-requester-pays are not changed. This is not likely to be a problem, since we will have limited access to the various cloud providers' buckets and thus the workflow will probably be to import resources in our own buckets and then copy from there to other buckets. Likewise, in import_resources.py, the resources_root is hard coded to gs://gnomad-public.

Usage

To load resources from the Google Cloud Public Datasets bucket:

from gnomad.resources.grch37 import gnomad

ds = gnomad.public_release("exomes").ht()

To load resources from gnomAD buckets:

from gnomad.resources.grch37 import gnomad
from gnomad.resources import GnomadResourceProvider, gnomad_resource_configuration

gnomad_resource_configuration.default_resource_provider = GnomadResourceProvider.GNOMAD

ds = gnomad.public_release("exomes").ht()

or

from gnomad.resources.grch37 import gnomad

ds = gnomad.public_release("exomes").ht(resources_root="gs://gnomad-public-requester-pays")

@nawatts nawatts changed the title Support multiple resource buckets Support multiple resource buckets and switch to Google Cloud Public Datasets bucket Oct 1, 2020
@nawatts nawatts force-pushed the resource-buckets branch 2 times, most recently from e8be3e2 to cdd35b6 Compare October 6, 2020 12:39
@nawatts nawatts mentioned this pull request Feb 22, 2021
@nawatts nawatts marked this pull request as draft March 18, 2021 18:54
@nawatts
Copy link
Contributor Author

nawatts commented Mar 18, 2021

Realized that, as is, this would break uses of resource classes in gnomad_qc.

@jkgoodrich
Copy link
Contributor

ooh, didn't know about marking PRs as draft, thanks Nick!

@gtiao
Copy link
Contributor

gtiao commented Apr 23, 2021

What do we need to do to make it work?

@nawatts
Copy link
Contributor Author

nawatts commented Apr 23, 2021

I have a different version of this working but it requires #359.

@nawatts
Copy link
Contributor Author

nawatts commented Apr 26, 2021

Replaced by #373

@nawatts nawatts closed this Apr 26, 2021
@nawatts nawatts deleted the resource-buckets branch May 5, 2021 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update gnomAD resources to reflect free Google Cloud-hosted paths
3 participants