Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Somatic variation fails because of inability to find GATK #47

Closed
malachig opened this issue Nov 27, 2013 · 12 comments
Closed

Somatic variation fails because of inability to find GATK #47

malachig opened this issue Nov 27, 2013 · 12 comments

Comments

@malachig
Copy link
Collaborator

The first test of somatic-variation hit an error after about 1 min. The top of the error log looks like this:

2013-11-27 16:41:10-0600 clia1: Executing detect variants step
2013-11-27 16:41:14-0600 clia1: 2013/11/27 16:41:14 Genome::Sys: Failed to find jar GenomeAnalysisTK at version 2.4
2013-11-27 16:41:14-0600 clia1: ERROR: Failed to find jar GenomeAnalysisTK at version 2.4
2013-11-27 16:41:14-0600 clia1: 2013/11/27 16:41:14 Genome::Model::Tools::DetectVariants2::Strategy id((gatk-somatic-indel 5336 filtered by false-indel v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]) unique union (pindel 0.5 filtered by pindel-somatic-calls v1 then pindel-vaf-filter v1 [--variant-freq-cutoff=0.08] then pindel-read-support v1) unique union (varscan-somatic 2.2.6 filtered by varscan-high-confidence-indel v1 then false-indel v1 [--bam-readcount-version 0.4 --bam-readcount-min-base-quality 15]) unique union (strelka 0.4.6.2 [isSkipDepthFilters = 1])): Could not call has_version on the class Genome::Model::Tools::DetectVariants2::GatkSomaticIndel
2013-11-27 16:41:14-0600 clia1: ERROR: Could not call has_version on the class Genome::Model::Tools::DetectVariants2::GatkSomaticIndel
2013-11-27 16:41:14-0600 clia1: Command module died or returned undef.

We need to determine how software versions are found in the part of DV2...

@gatoravi
Copy link
Contributor

gatoravi commented Dec 1, 2013

Genome/Sys.pm tries to find GATK version 2.4 using the environment variable $ENV{GENOME_JAR_PATH},

385    my @dirs = split(':', $ENV{GENOME_JAR_PATH});

This variable is set to /usr/share/java on the standalone and within TGI.

Inside the TGI, the /usr/share/java has a symlink GenomeAnalysisTK.jar -> GenomeAnalysisTK-2.4.jar and the jar file GenomeAnalysisTK-2.4.jar
On the standalone both the symlink and the jar file are missing.

Looks like this directory has to be replicated on the standalone install, is it just a matter of pointing the environment variable to a different folder where these jars exist already or does the whole directory need to be replicated ?

There are quite a few things that are present in this directory on the TGI end but seem to be missing on the standalone install (picardtools, weka etc).

@malachig
Copy link
Collaborator Author

malachig commented Dec 1, 2013

Note from Scott on this issue:

Most of those envs that are set to a global network path at TGI are set to something inside the sw directory for the standalone box. There should be some tgz with some Java stuff next-to the apps*.tgz.

The java tgz may not have the latest stuff. If so there needs to be a new java tgz made, with he added stuff, with a different date, and the makefile should be updated to download it too .

Some of the java stuff has been packaged as debs. I think Allison did this for gatk. If it is, a fresh genome-snapshot-deps package should solve the problem. The best people to talk with about the state of that are Matt Callaway and Nathan Nutter. Matt was trying to take it past the state in which I left it. There were directories for Ubuntu Lucid and Precise, and the precise directory should have equivalents of the same packages. Where this was not possible, the files listing those deps are broken out into a *.missing file.

@malachig
Copy link
Collaborator Author

malachig commented Dec 1, 2013

Currently the only things in 'java-2013-08-27.tgz' are: rdp-classifier_, samtools_, VarScan*, and weka.jar

The only environment variable related to JAVA that is defined globally is as far as I can tell is:
GENOME_SW_LEGACY_JAVA=/opt/gms/4K8W670/sw/java

GENOME_JAR_PATH appears to be specified in:
Genome/Env/GENOME_JAR_PATH.pm

This does seem to work in the standalone GMS:
% perl -e 'use Genome; $test=$ENV{GENOME_JAR_PATH}; print "\n$test\n"'

/usr/share/java

Even if we add GATK to the JAVA archive, it is not obvious to me from the makefile how the contents are meant to be found... If the system looks for them in '/usr/share/java' ... I see no active attempt to place them there. Perhaps that is only where properly packaged JAVA stuff goes?

@malachig
Copy link
Collaborator Author

malachig commented Dec 1, 2013

To see what is currently in genome-snapshot-deps within TGI, go to the top level of a 'genome' checkout and run:

% cd /gscuser/mgriffit/git/genome/
% git submodule update --init genome-snapshot-deps
% cd genome-snapshot-deps/precise/
% grep -i GATK *

All I see is this:
genome-snapshot-deps-apps-external.depends.missing:libgatk-protected-java (>= 2.4-1)

Debian packages get into the standalone GMS something like this:

  • Dependencies are setup in the genome-snaphot-deps repo
  • These are periodically used to create an apt repo
  • To get a copy of this apt-repo ... from within the TGI, someone runs the Makefile in the gms repo here: '/gms/setup/stage/Makefile'
  • This creates a static version of the packages needed at that time called something like this: apt-mirror-min-ubuntu-12.04-2013.10.19.tgz
  • This gets staged to the FTP server
  • During the installation of the gms (gms/Makefile) the version of this package specified by $APT_DUMP_VERSION is installed.

@malachig
Copy link
Collaborator Author

malachig commented Dec 1, 2013

It looks like 'libgatk-protected-java' is marked as 'missing' in the precise (ubuntu 12.04) version of genome-snapshot-deps but as 'depends' in the lucid (Ubuntu 10.04) version.

One thing we could try is to attempt to install it directly in the GMS, if it works, then we could take it out of the missing list for precise, put it in the regular list, and rebuild the apt repo for the standalone box.

Another option is to use one of the 8 versions of GATK that are currently in the 'apps' repo from /gsc/pkg/bio in the TGI.

These seem to be very old versions of GATK though. And the one expected in the test analysis is GATK version 2.4. IT seems that Genome::Sys is expecting this tool to be installed as a package in a standard way so that the version can be resolved as well.

@malachig
Copy link
Collaborator Author

malachig commented Dec 2, 2013

One thing about that package is that it should not actually be distributed outside of TGI. It contains a patch that disables the "phone home" behavior. Even though the modification is in the "public" source tree, the package includes code from the "protected" source tree, which is under a license that does not allow redistribution.

Our GATK wrappers depend on the phone home behavior being disabled because they always insert the "-et NO_ET" argument into the command line. If you try this against a jar file without our patch it will give an error because they require an additional argument with a key provided by the Broad to allow the phone home override. Our patch skips this check.

@malachig
Copy link
Collaborator Author

malachig commented Dec 2, 2013

Since we can not re-distribute GATK we will need to setup the standalone GMS in such a way that allows the user the manually install GATK after obtaining the appropriate permissions from the Broad.

  • In the standalone GMS we will install GATK in /opt/gms/$sys_id/sw/apps/gatk/$version/
  • Can we download specific versions of GATK?
  • Create instructions for the user to do this during the installation. We will need a section in the install docs explaining that some software we are not allowed to distribute and that therefore it needs to be installed manually
  • The genome code will also need to be updated to look for GATK in the 'apps' dir instead of expecting a debian package to be installed. This change is legitimately specific to installing the GMS outside TGI, so this change should happen only in the 'pub' branch of gms-core. This is in: Genome/Model/Tools/Gatk.pm
  • If deemed useful, improve the warnings generated by the genome code when 'gatk' is not found. That kind of change can be done in both master and pub branches
  • Since our wrappers, insert the "-et NO_ET" argument and this argument is not valid in the version of GATK downloaded from the Broad, we will also need to change the code to not use this option in the standalone GMS. This is also in: Genome/Model/Tools/Gatk.pm

@gatoravi
Copy link
Contributor

gatoravi commented Dec 5, 2013

This is how you get earlier versions of the GATK,
Note - older version binaries are not distributed by the Broad, they have to compiled from source.

"Get the package for the version you want from this page: https://github.com/broadgsa/gatk-protected/tags

From your terminal/console, navigate to the directory containing the source code. There, you run the command:

ant clean dist

This will do everything for you. The compiled binary will be in the newly-created dist directory."

@ghost ghost assigned gatoravi Jan 3, 2014
@gatoravi gatoravi reopened this Mar 15, 2015
@gatoravi
Copy link
Contributor

I'm seeing this same issue on a somatic-variation build. Looks like the gms-pub specific change was reverted in the recent merge/refactor in master. Compare genome/genome@1b8a40c and
https://github.com/genome/genome/blob/master/lib/perl/Genome/Model/Tools/Gatk/Base.pm#L64

Figure out if master can be modified to use consistent paths.

@gatoravi
Copy link
Contributor

not sure where to bring this up on master, Issues seem to be disabled for that repo. cc'ing @nnutter for ideas.

@gatoravi
Copy link
Contributor

This was added to the SGMS branch here, genome/genome@b62e2b0
We are divergent from master on this and might come up again.

@gatoravi
Copy link
Contributor

gatoravi commented Apr 3, 2015

At closer look, it looks like master has an improved way of looking for JAR paths. We are still using the old method which we think is ok for now.

@gatoravi gatoravi closed this as completed Apr 3, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants