-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple resource buckets and switch to Google Cloud Public Datasets bucket #265
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e8be3e2
to
cdd35b6
Compare
58e738d
to
3f89f92
Compare
Merged
* Change all resources paths to be relative to bucket * Add resources_root argument to resource read and import methods * Construct resource URL from resources_root argument and resource path
So that they use the same default resources root as Resource objects
Remove flags to allow requester pays access to gnomAD bucket
3f89f92
to
6ff233b
Compare
Realized that, as is, this would break uses of resource classes in gnomad_qc. |
ooh, didn't know about marking PRs as draft, thanks Nick! |
What do we need to do to make it work? |
I have a different version of this working but it requires #359. |
Replaced by #373 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for multiple resource buckets in order to provide access to resources hosted on different cloud providers through our resources framework (#255, #263, #264). This also sets the default resource bucket to the Google Cloud Public Datasets bucket:
gs://gcp-public-data--gnomad
(resolves #264).Some design goals:
gs://
,s3://
, etc.) and the bucket name instead of having different resource modules for each cloud provider.Because our resource modules export objects, my initial idea of adding a cloud provider parameter to Resource classes was unworkable since it would require changing the library's interface (against goal 1) or adding separate modules for each provider (against goals 2 and 3).
To meet goal 2, this changes the
path
attribute ofResource
objects from an absolute URL to a relative (within a bucket) path and adds aresources_root
parameter toResource
methods that read the resource (TableResource.ht
,PedigreeResource.pedigree
, etc.) andimport_resource
. This provides flexibility: users can mix and match resource buckets within a pipeline (for example, if a file ingnomad-public
hasn't yet been copied togcp-public-data--gnomad
) or use resource buckets unknown to us (for example, if someone sets up a copy ofgnomad-public
in Australia).resources_root
is optional (goal 1) and the expectation is that most users will leave it unspecified.Setting a default behavior is not as simple as setting
"gs://gnomad-public"
as the default value forresources_root
. This partly because our resource files are split between two buckets:gnomad-public
andgnomad-public-requester-pays
and partly because of goal 3: a pipeline using resources from AWS shouldn't have to passresources_root
every time a resource is read.Thus,
Resource
methods callget_resource_url
with theResource
'spath
attribute, the givenresources_root
, and a newResource
attribute:gnomad_bucket
. Thegnomad_bucket
attribute is necessary to distinguish resources stored ingnomad-public
vsgnomad-public-requester-pays
.In turn,
get_resource_url
looks at a new setting: the "default resource provider". This controls how resource URLs are constructed when noresources_root
is provided (which is expected to usually be the case). With this, the resource bucket to use for all resources in a pipeline can be set by changing thedefault_resource_provider
attribute ongnomad.resources.config.gnomad_resource_configuration
. And support for other cloud providers can be added by adding to theGnomadResourceProvider
enum and updatingget_resource_url
to accordingly.Import paths
The
gnomad/resources/import_resources.py
script is used to convert VCF, TSV, etc. files into Hail formats. Theimport_args
attribute onResource
is not aware ofresources_root
and thus URLs there for files ingnomad-public
orgnomad-public-requester-pays
are not changed. This is not likely to be a problem, since we will have limited access to the various cloud providers' buckets and thus the workflow will probably be to import resources in our own buckets and then copy from there to other buckets. Likewise, inimport_resources.py
, theresources_root
is hard coded togs://gnomad-public
.Usage
To load resources from the Google Cloud Public Datasets bucket:
To load resources from gnomAD buckets:
or