Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use local copy of VEP config #231

Merged
merged 1 commit into from
Jun 8, 2020
Merged

Use local copy of VEP config #231

merged 1 commit into from
Jun 8, 2020

Conversation

nawatts
Copy link
Contributor

@nawatts nawatts commented Jun 3, 2020

Currently, VEP configuration is downloaded from Hail's US bucket.

VEP_REFERENCE_DATA = {
"GRCh37": {
"vep_config": "gs://hail-us-vep/vep85-loftee-gcloud.json",
"all_possible": "gs://gnomad-public/papers/2019-flagship-lof/v1.0/context/Homo_sapiens_assembly19.fasta.snps_only.vep_20181129.ht",
},
"GRCh38": {
"vep_config": "gs://hail-us-vep/vep95-GRCh38-loftee-gcloud.json",
"all_possible": "gs://gnomad-public/resources/context/grch38_context_vep_annotated.ht",
},
}

This bucket is requestor pays, so reading from it requires that either --requester-pays-allow-all or --requester-pays-allow-buckets hail-us-vep is specified when starting the cluster. If someone isn't aware of that, it's easy to start a cluster (which takes a fair amount of time with VEP) that can't run the gnomAD VEP utilities.

hailctl dataproc's VEP init scripts download these configuration files and link them to /vep_data/vep-gcloud.json. If VEP configuration was loaded from the local copy downloaded by the init scripts instead of the hail-us-vep bucket, then the requestor pays arguments would not be necessary.

Resolves #226
Resolves #211

@nawatts nawatts requested a review from konradjk June 3, 2020 17:42
@konradjk
Copy link
Collaborator

konradjk commented Jun 3, 2020

wait, but the point of the requester pays flag there is because if you start a cluster with --vep it not only uses the config but also localizes tons of files that VEP uses (and these are also in a requester pays bucket). so if you can't access them, you won't get the files either?

@konradjk
Copy link
Collaborator

konradjk commented Jun 3, 2020

oh or is the issue if you do a bucket other than hail-us-vep (like eu)? anyway, i think its fine (though have you tested it? i vaguely recall using a local file was a minor issue in the past, but if this is what hail normally does now, it's probably fine)

@nawatts
Copy link
Contributor Author

nawatts commented Jun 3, 2020

The requestor pays arguments to hailctl dataproc start are used to configure the GCS connector.
https://github.com/hail-is/hail/blob/10e525a56787dd00b6aaed62d856c1eabd3c0f3a/hail/python/hailtop/hailctl/dataproc/start.py#L234-L254

The VEP data is downloaded using gsutil, which is not affected by the GCS Connector configuration.
https://github.com/hail-is/hail/blob/10e525a56787dd00b6aaed62d856c1eabd3c0f3a/hail/python/hailtop/hailctl/dataproc/resources/vep-GRCh37.sh#L26-L33

In the example, the requestor pays flag was necessary because our code loaded VEP config directly from gs://hail-us-vep (the example could have been --requester-pays-allow-buckets hail-us-vep).

I did start a cluster with this branch and was able to run VEP on an imported VCF.

hailctl dataproc start a-cluster --vep GRCh37 \
  --packages="git+https://github.com/broadinstitute/gnomad_methods.git@local-vep-config"

@nawatts
Copy link
Contributor Author

nawatts commented Jun 3, 2020

i vaguely recall using a local file was a minor issue in the past, but if this is what hail normally does now, it's probably fine)

This is why I wanted to get your review.

It looks like Hail always requires you to provide configuration. There is no default value for the config argument.

https://github.com/hail-is/hail/blob/10e525a56787dd00b6aaed62d856c1eabd3c0f3a/hail/python/hail/methods/qc.py#L485

And yet, some recently added documentation claims

If you are using hailctl dataproc as mentioned above, you can just use the default argument for config and everything will work.

https://github.com/hail-is/hail/blob/10e525a56787dd00b6aaed62d856c1eabd3c0f3a/hail/python/hail/methods/qc.py#L511-L516

@konradjk
Copy link
Collaborator

konradjk commented Jun 3, 2020

I did start a cluster with this branch and was able to run VEP on an imported VCF.

You were able to run it without a requester pays flag? I think that's probably not intentional behavior on Hail's part. @johnc1231?

@nawatts
Copy link
Contributor Author

nawatts commented Jun 3, 2020

You were able to run it without a requester pays flag?

Yes. The init script downloads VEP data using gsutil, which is not affected by the --requester-pays-allow-all or --requester-pays-allow-buckets arguments.

I think that's probably not intentional behavior on Hail's part.

I don't think it would make sense to always require an additional requester pays argument when starting a cluster with --vep. If you're using --vep, you definitely want to be able to download the VEP data. Also, requiring a requester pays argument along with --vep either forces users to learn about the region replicate buckets and figure out which one their cluster will be using or, if they don't want to deal with that, encourages them to always allow all requester pays buckets.

@nawatts
Copy link
Contributor Author

nawatts commented Jun 8, 2020

Hail is also moving to use file:///vep_data/vep-gcloud.json by default for VEP on Dataproc (hail-is/hail#8929).

@konradjk
Copy link
Collaborator

konradjk commented Jun 8, 2020

To this end, do we even need a config anymore? Or should we just drop it and go to the default None so that it's managed by Hail and not us?

@nawatts
Copy link
Contributor Author

nawatts commented Jun 8, 2020

I think we should do that at some point. But config=None is not supported in Hail versions before 0.2.45 (assuming that PR is included in the next release). Using the file:// path for config lets us remove our requirement for --requester-pays-allow-buckets hail-us-vep while continuing to work with current Hail versions.

@nawatts nawatts merged commit b75c62e into master Jun 8, 2020
@nawatts nawatts deleted the local-vep-config branch June 8, 2020 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use local copy of VEP configuration VEP configuration is tied to US region
2 participants