Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for GenomicsDB as a variant input #1975

Merged
merged 1 commit into from
Jul 28, 2016
Merged

adding support for GenomicsDB as a variant input #1975

merged 1 commit into from
Jul 28, 2016

Conversation

lbergelson
Copy link
Member

this is enabled for the driving variants of VariantWalker as well as any auxiliary FeatureInput

a genomicsdb workspace is referenced by putting the loader.json that was used to create the arrays as well as a query.json into a directory
this is then specified with a url of the form gendb://path/to/directory
i.e
/myfiles/mygendbfiles/loader.json
/myfiles/mygendbfiles/query.json

SomeVariantWalker -V gendb:///myfiles/mygendbfiles

FeatureWalker isn't yet wired to support gendb urls
performance is untested

invalid input files are likely to result in Segfaults or non-helpful errors

resolves #1647

@droazen
Copy link
Contributor

droazen commented Jul 11, 2016

Ping me once you've rebased onto latest master and have tests passing in travis.

@kgururaj
Copy link
Collaborator

kgururaj commented Jul 11, 2016

@lbergelson - missing -y in the "apt-get install "command added in the last commit.

@lbergelson
Copy link
Member Author

@kgururaj Thanks. I saw that when it failed.

@davidbernick
Copy link
Contributor

Can one of the admins verify this patch?

@lbergelson
Copy link
Member Author

jenkins test please

@lbergelson
Copy link
Member Author

@droazen this is ready for review now. The tests failures are all due to accidentally getting htjsdk 2.5.1 and are fixed in the commit that's running now.

@davidbernick
Copy link
Contributor

jenkins test please

@davidbernick
Copy link
Contributor

@lbergelson i didn't have you on the whitelist for that command. now i do. we want to explicitly authorize who can run that because it's essentially allowing pull-requests to run on our servers so we need a human thumbs-up.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.07%) to 81.95% when pulling 0d31d0f on lb_tiledb into bcb1808 on master.

@lbergelson
Copy link
Member Author

@davidbernick Can we inherit permissions to launch things from github like our jenkins edit permissions do?

@droazen
Copy link
Contributor

droazen commented Jul 19, 2016

Add a github wiki page with instructions on using TileDB with GATK4

@@ -19,6 +21,13 @@ env:
- GCLOUD=$HOME/gcloud/google-cloud-sdk/bin
- CLOUDSDK_PYTHON_SITEPACKAGES=1
- GOOGLE_APPLICATION_CREDENTIALS=$TRAVIS_BUILD_DIR/servicekey.json
#GenomicsDB Stuff
- GATK_GENOMICSDB_BIN=src/test/resources/large/tiledb_bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a ticket to replace checked-in binaries with binaries packaged in a jar pulled down via maven, like our other native dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also create a ticket to ask intel to statically link TileDB dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all the travis stuff is in order to run tiledb as an executable from Runtime.exec. It's necessary to do so for the workspace / database creation. Intel has said they're working on a write api, so hopefully it will not be necessary to have a separate binary for anything when running gatk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still be getting the binaries via maven rather than checking them into our repo directly, don't you agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're working with 2 binaries here. One is properly statically linked and contained in the jar we pull from the maven, the others are ones we checked into the repo. Eventually, the jar will include the functionality of the ones we checked into the repo so we can remove those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen : As @lbergelson mentioned, there are two types of binaries being used here:

  1. Shared library packaged inside the GenomicsDB jar - this binary is portable (dependencies statically linked in as much as possible) across different versions of Linux and Macs and is distributed on Maven central. Users who build GATK will pull this jar file as part of the build process.
  2. Executables packaged inside the GATK repo - these executables are for GATK-GenomicsDB CI testing in Travis only. The GATK repo could pack pre-loaded TileDB test arrays; however, TileDB requires that the full absolute path of arrays be the same on the loader system and the query system. Hence, Louis had to distribute the GenomicsDB binaries and load TileDB test arrays from the test gVCFs in Travis.
    The second set of binaries are not portable (since they depend on MPI, the specific gcc compiler used etc) - we have no means to distribute portable versions of these executables. However, most users will/should NOT be using these binaries directly from the GATK repo - they should follow instructions from the GenomicsDB wiki (in the future, we might distribute platform specific rpms/dpkgs with the dependencies enumerated).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @lbergelson said, hopefully when we have a write API it will eliminate the need to check in these extra binaries.

}

@SuppressWarnings("unchecked")
private static <T extends Feature> FeatureReader<T> getFeatureReader(final FeatureInput<T> featureInput, final Class<? extends Feature> featureType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is T always the same class as featureType here?

@droazen
Copy link
Contributor

droazen commented Jul 27, 2016

Final review complete, back to @lbergelson. Looks really good now, just minor comments + some additional unit tests needed for gendb FeatureInputs. Merge after addressing comments.

@droazen droazen assigned lbergelson and unassigned droazen Jul 27, 2016
@droazen
Copy link
Contributor

droazen commented Jul 28, 2016

Taking this over in @lbergelson's absence in the interests of getting it merged -- I'll address the remaining minor comments.

@droazen droazen assigned droazen and unassigned lbergelson Jul 28, 2016
@droazen
Copy link
Contributor

droazen commented Jul 28, 2016

All comments addressed -- will squash and merge once tests pass.

this is enabled for the driving variants of VariantWalker as well as any auxiliary FeatureInput

a genomicsdb workspace is referenced by putting the loader.json that was used to create the arrays as well as a query.json into a directory
  this is then specified with a url of the form gendb://path/to/directory
    i.e
      /myfiles/mygendbfiles/loader.json
      /myfiles/mygendbfiles/query.json

    SomeVariantWalker -V gendb:///myfiles/mygendbfiles

FeatureWalker isn't yet wired to support gendb urls
performance is untested

invalid input files are likely to result in Segfaults or non-helpful errors

Resolves #1647
@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.1%) to 82.23% when pulling fa6e06e on lb_tiledb into 796497f on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.1%) to 82.23% when pulling fa6e06e on lb_tiledb into 796497f on master.

@droazen droazen merged commit 4376cb9 into master Jul 28, 2016
@droazen droazen deleted the lb_tiledb branch July 28, 2016 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype TileDB integration (for CombineGVCFs and GenotypeGVCFs)
5 participants