-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for GenomicsDB as a variant input #1975
Conversation
Ping me once you've rebased onto latest master and have tests passing in travis. |
@lbergelson - missing -y in the "apt-get install "command added in the last commit. |
@kgururaj Thanks. I saw that when it failed. |
Can one of the admins verify this patch? |
jenkins test please |
@droazen this is ready for review now. The tests failures are all due to accidentally getting htjsdk 2.5.1 and are fixed in the commit that's running now. |
jenkins test please |
@lbergelson i didn't have you on the whitelist for that command. now i do. we want to explicitly authorize who can run that because it's essentially allowing pull-requests to run on our servers so we need a human thumbs-up. |
@davidbernick Can we inherit permissions to launch things from github like our jenkins edit permissions do? |
Add a github wiki page with instructions on using TileDB with GATK4 |
@@ -19,6 +21,13 @@ env: | |||
- GCLOUD=$HOME/gcloud/google-cloud-sdk/bin | |||
- CLOUDSDK_PYTHON_SITEPACKAGES=1 | |||
- GOOGLE_APPLICATION_CREDENTIALS=$TRAVIS_BUILD_DIR/servicekey.json | |||
#GenomicsDB Stuff | |||
- GATK_GENOMICSDB_BIN=src/test/resources/large/tiledb_bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a ticket to replace checked-in binaries with binaries packaged in a jar pulled down via maven, like our other native dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also create a ticket to ask intel to statically link TileDB dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all the travis stuff is in order to run tiledb as an executable from Runtime.exec. It's necessary to do so for the workspace / database creation. Intel has said they're working on a write api, so hopefully it will not be necessary to have a separate binary for anything when running gatk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still be getting the binaries via maven rather than checking them into our repo directly, don't you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're working with 2 binaries here. One is properly statically linked and contained in the jar we pull from the maven, the others are ones we checked into the repo. Eventually, the jar will include the functionality of the ones we checked into the repo so we can remove those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen : As @lbergelson mentioned, there are two types of binaries being used here:
- Shared library packaged inside the GenomicsDB jar - this binary is portable (dependencies statically linked in as much as possible) across different versions of Linux and Macs and is distributed on Maven central. Users who build GATK will pull this jar file as part of the build process.
- Executables packaged inside the GATK repo - these executables are for GATK-GenomicsDB CI testing in Travis only. The GATK repo could pack pre-loaded TileDB test arrays; however, TileDB requires that the full absolute path of arrays be the same on the loader system and the query system. Hence, Louis had to distribute the GenomicsDB binaries and load TileDB test arrays from the test gVCFs in Travis.
The second set of binaries are not portable (since they depend on MPI, the specific gcc compiler used etc) - we have no means to distribute portable versions of these executables. However, most users will/should NOT be using these binaries directly from the GATK repo - they should follow instructions from the GenomicsDB wiki (in the future, we might distribute platform specific rpms/dpkgs with the dependencies enumerated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @lbergelson said, hopefully when we have a write API it will eliminate the need to check in these extra binaries.
} | ||
|
||
@SuppressWarnings("unchecked") | ||
private static <T extends Feature> FeatureReader<T> getFeatureReader(final FeatureInput<T> featureInput, final Class<? extends Feature> featureType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is T
always the same class as featureType
here?
Final review complete, back to @lbergelson. Looks really good now, just minor comments + some additional unit tests needed for gendb FeatureInputs. Merge after addressing comments. |
Taking this over in @lbergelson's absence in the interests of getting it merged -- I'll address the remaining minor comments. |
All comments addressed -- will squash and merge once tests pass. |
this is enabled for the driving variants of VariantWalker as well as any auxiliary FeatureInput a genomicsdb workspace is referenced by putting the loader.json that was used to create the arrays as well as a query.json into a directory this is then specified with a url of the form gendb://path/to/directory i.e /myfiles/mygendbfiles/loader.json /myfiles/mygendbfiles/query.json SomeVariantWalker -V gendb:///myfiles/mygendbfiles FeatureWalker isn't yet wired to support gendb urls performance is untested invalid input files are likely to result in Segfaults or non-helpful errors Resolves #1647
this is enabled for the driving variants of VariantWalker as well as any auxiliary FeatureInput
a genomicsdb workspace is referenced by putting the loader.json that was used to create the arrays as well as a query.json into a directory
this is then specified with a url of the form gendb://path/to/directory
i.e
/myfiles/mygendbfiles/loader.json
/myfiles/mygendbfiles/query.json
FeatureWalker isn't yet wired to support gendb urls
performance is untested
invalid input files are likely to result in Segfaults or non-helpful errors
resolves #1647