New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBase backend for Genotypes #1246

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@jpdna
Member

jpdna commented Nov 7, 2016

Requesting here a first pass review for the integration of HBase input and output functions into Master.

The code in this PR excludes the "custom encoding" options for clarity and thus sticks with saving the full Avro schema of a Genotype as the values stored in HBase.

Todo:

  • Add Unit tests using an embedded mini-hbase cluster
  • Add more parameter checking and informative error messages, such as if database table doesn't exist
  • Integrate and rebase against upcoming changes to Genotype schemas when available

Questions:

  • This PR has a new hbase/HBaseFunctions.scala module parallel to RDD in the package structure in adam-core, is this OK, or other suggestions?

  • For the first round of HBase code integration I suggest we target CDH HBase only as that is where we are doing testing and deployment. In future we can add a profile for POM for Apache Hbase 2.0, though this would seem to require use of HBase 2.0 snapshot as Spark support doesn't exist in the last official non-cloudera apache release.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1558/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1246/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1246/merge^{commit} # timeout=10Checking out Revision f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7 (origin/pr/1246/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Nov 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1558/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1246/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1246/merge^{commit} # timeout=10Checking out Revision f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7 (origin/pr/1246/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f8aac4ff958d24d171e0a7edf3fe9bc92d6589b7First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 7, 2016

Member

Can someone help point me to which log I should be looking at to diagnose this build failure? Thx!

Member

jpdna commented Nov 7, 2016

Can someone help point me to which log I should be looking at to diagnose this build failure? Thx!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 7, 2016

Member

See https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/1558/HADOOP_VERSION=2.6.0,SCALAVER=2.10,SPARK_VERSION=1.5.2,label=centos/console. Specifically:

++ git status --porcelain
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure

A nice easy one to fix ;).

I'll make a review pass later today.

Member

fnothaft commented Nov 7, 2016

See https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/1558/HADOOP_VERSION=2.6.0,SCALAVER=2.10,SPARK_VERSION=1.5.2,label=centos/console. Specifically:

++ git status --porcelain
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/hbase/HBaseFunctions.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure

A nice easy one to fix ;).

I'll make a review pass later today.

Update HBase dependencies in POM
Added HBaseFunctions

Added VCF genotype save and load Hbase functions

Added saveHBaseGenotypesSingleSample packing multiple genotypes at same position and samle allele into same hbase value

Added multisample capable loadHBaseGenotypes()

Removed commented-out earlier versions of save and load genotypes

removed more dead code

Clean up formatting - limit line length

Added saveHbaseSampleMetadata function

Added save and load SequenceDictionary functions to HBaseFunctions

Added createHbaseGenotypeTable

Adding loadGenotypeRddFromHBase

in progress updates to multi-sample hbase save

multi sample VCF save and load now working

Added repartitioning parameter tp hbase genotype load

Added comments identifying the public api vs helper functions

COB Aug 25

Added genomic start and stop parameters to loadGenotypesFromHBaseToGenotypeRDD

Added boolean saveSequenceDictionary toggle parameter to saveVariantContextRddToHBase

fixed start, stop null ptr exeception

first steps in adding hbaseToVariantcontextRDD

Changed region query to use ADAM ReferenceRegion

Added custom HBaseEncoder1 save function

Added custom Encoder1 Hbase loader

Added Encoder1 hbase variant context loader

Working - before rowkey int

Changed end in key to be size, added data block encoding in create table

Added create table splits

Removed dead code of encoder2

Added option to repartion vcrdd before saving to HBase

Added bulk load save to Hbase option

changed to cdh hbase api depedencies in POM

allow sample name file list as input to load functions

made sample_ids lis parameter  in load optional

Added deleteSamplesFromHBase function

Fixed bulk delete and made loadVariantContext work even when requested samplids are missing

Removed code from  failed version of sample delete function

Moved delete function up with test of genotype code

Fixed errors after rebase against master

small formatting cleanup

first pass hbase cli demo

second pass hbase cli demo

remove saveSeqDict check

add seq dict id to cli

import clean up

removed undeeded demo and temp.js code

Ran ./scripts/format-source due to build failure on Jenkins
@Args4jOption(required = false, name = "-use_existing_seqdict", usage = "Use an existing sequence dictionary, don't write a new one")
var useExistingSeqDict: Boolean = false
@Args4jOption(required = false, name = "-repartition_num", usage = "Reparition into N partitions prior to writing to HBase")

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

typo in usage ReparitionRepartition. Also, might -partitions be a better name for this argument?

@heuermh

heuermh Nov 7, 2016

Member

typo in usage ReparitionRepartition. Also, might -partitions be a better name for this argument?

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

IIRC, we've used -repartition across other commands, e.g., Transform (which has both -coalesce and -repartition). I like -repartition because it implies the cost of the operation: we take a shuffle.

@fnothaft

fnothaft Nov 7, 2016

Member

IIRC, we've used -repartition across other commands, e.g., Transform (which has both -coalesce and -repartition). I like -repartition because it implies the cost of the operation: we take a shuffle.

@Args4jOption(required = false, name = "-repartition_num", usage = "Reparition into N partitions prior to writing to HBase")
var repartitionNum: Int = 0
@Args4jOption(required = true, name = "-staging_folder", usage = "location for temporary files during bulk load")

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

locationLocation in usage

@heuermh

heuermh Nov 7, 2016

Member

locationLocation in usage

@@ -137,6 +137,10 @@
<artifactId>spark-core_${scala.version.prefix}</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

use ${scala.version.prefix}

@heuermh

heuermh Nov 7, 2016

Member

use ${scala.version.prefix}

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

OOC, why do we depend on Spark streaming here?

@fnothaft

fnothaft Nov 7, 2016

Member

OOC, why do we depend on Spark streaming here?

This comment has been minimized.

@jpdna

jpdna Nov 22, 2016

Member

Not sure, but commenting it out of pom causes build to fail.

@jpdna

jpdna Nov 22, 2016

Member

Not sure, but commenting it out of pom causes build to fail.

@@ -185,5 +189,37 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

remove extra blank lines

@heuermh

heuermh Nov 7, 2016

Member

remove extra blank lines

@@ -0,0 +1,508 @@
package org.bdgenomics.adam.hbase

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

missing license header

@heuermh

heuermh Nov 7, 2016

Member

missing license header

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

The license header shows up as there for me?

@fnothaft

fnothaft Nov 7, 2016

Member

The license header shows up as there for me?

<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

Oh, this could be a major problem. We can't* deploy to Maven Central with non-Maven Central repository dependencies.

  • - "can't" here may strictly be too strong a word; I'll have to investigate
@heuermh

heuermh Nov 7, 2016

Member

Oh, this could be a major problem. We can't* deploy to Maven Central with non-Maven Central repository dependencies.

  • - "can't" here may strictly be too strong a word; I'll have to investigate
@@ -1,5 +1,15 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

remove extra blank lines

@heuermh

heuermh Nov 7, 2016

Member

remove extra blank lines

@@ -464,6 +474,22 @@
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

use ${scala.version.prefix}

@heuermh

heuermh Nov 7, 2016

Member

use ${scala.version.prefix}

@@ -564,6 +590,74 @@
<artifactId>scala-guice_${scala.version.prefix}</artifactId>
<version>4.0.1</version>
</dependency>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

remove commented out code, create an issue to add these back in later

@heuermh

heuermh Nov 7, 2016

Member

remove commented out code, create an issue to add these back in later

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1

@fnothaft
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>1.2.0-cdh5.8.0</version>

This comment has been minimized.

@heuermh

heuermh Nov 7, 2016

Member

refactor version to a variable ${hbase.version} here and below

@heuermh

heuermh Nov 7, 2016

Member

refactor version to a variable ${hbase.version} here and below

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1, and remove spaces between dependencies

@fnothaft

fnothaft Nov 7, 2016

Member

+1, and remove spaces between dependencies

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 7, 2016

Member

Nice work! I made a review pass focused on API and style.

If the external Maven repository is a problem, we may have to stick with Apache HBase rather than CDH versions, even if that means waiting for them to release Spark support.

In order to test & validate the code changes, I'll need some doc on how to run and test it. :)

Member

heuermh commented Nov 7, 2016

Nice work! I made a review pass focused on API and style.

If the external Maven repository is a problem, we may have to stick with Apache HBase rather than CDH versions, even if that means waiting for them to release Spark support.

In order to test & validate the code changes, I'll need some doc on how to run and test it. :)

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1559/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Nov 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1559/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

Thanks @jpdna! This is great work, and exciting to see out there. I need more inline docs in places to make a thorough review; when you make a cleanup pass, I'll make another review pass.

BTW, @heuermh and I both brought this up inline, but I would like to:

  • Strike all VariantContext methods and funnel through GenotypeRDD.toVariantContextRDD or VariantContextRDD.toGenotypeRDD before/after calling.
  • Strike all code for deleting from HBase.
  • Chose one of bulk/non-bulk puts as our strategy.
package org.bdgenomics.adam.cli
import org.apache.spark.SparkContext

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

Nit: remove whitespace.

@fnothaft

fnothaft Nov 7, 2016

Member

Nit: remove whitespace.

@Argument(required = true, metaVar = "VCF", usage = "The VCF file to convert", index = 0)
var vcfPath: String = _
@Args4jOption(required = true, name = "-hbase_table", usage = "HBase table name in which to load VCF file")

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

By convention, we typically make required = true options as @Arguments.

@fnothaft

fnothaft Nov 7, 2016

Member

By convention, we typically make required = true options as @Arguments.

@Args4jOption(required = false, name = "-use_existing_seqdict", usage = "Use an existing sequence dictionary, don't write a new one")
var useExistingSeqDict: Boolean = false
@Args4jOption(required = false, name = "-repartition_num", usage = "Reparition into N partitions prior to writing to HBase")

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

IIRC, we've used -repartition across other commands, e.g., Transform (which has both -coalesce and -repartition). I like -repartition because it implies the cost of the operation: we take a shuffle.

@fnothaft

fnothaft Nov 7, 2016

Member

IIRC, we've used -repartition across other commands, e.g., Transform (which has both -coalesce and -repartition). I like -repartition because it implies the cost of the operation: we take a shuffle.

@Args4jOption(required = true, name = "-seq_dict_id", usage = "User defined name to apply to the sequence dictionary create from this VCF, or name of existing sequence dictionary to be used")
var seqDictId: String = null
@Args4jOption(required = false, name = "-use_existing_seqdict", usage = "Use an existing sequence dictionary, don't write a new one")

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

This argument seems to be unused?

@fnothaft

fnothaft Nov 7, 2016

Member

This argument seems to be unused?

HBaseFunctions.saveVariantContextRDDToHBaseBulk(sc,
vcRdd,
args.hbaseTable,
sequenceDictionaryId = args.seqDictId,

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

I'd prefer to wrap all of the optional arguments in Options, as some of them will be null if not provided.

@fnothaft

fnothaft Nov 7, 2016

Member

I'd prefer to wrap all of the optional arguments in Options, as some of them will be null if not provided.

}
def deleteGenotypeSamplesFromHBase(sc: SparkContext,

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

Agreed; can't we just rely on the HBase CLI for this?

@fnothaft

fnothaft Nov 7, 2016

Member

Agreed; can't we just rely on the HBase CLI for this?

}
def loadGenotypesFromHBaseToGenotypeRDD(sc: SparkContext,

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1

@fnothaft
GenotypeRDD(genotypes, sequenceDictionary, sampleMetadata)
}
def loadGenotypesFromHBaseToVariantContextRDD(sc: SparkContext,

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1, it is purely for going to/from htsjdk.variant.VariantContext.

@fnothaft

fnothaft Nov 7, 2016

Member

+1, it is purely for going to/from htsjdk.variant.VariantContext.

@@ -564,6 +590,74 @@
<artifactId>scala-guice_${scala.version.prefix}</artifactId>
<version>4.0.1</version>
</dependency>

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1

@fnothaft
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop2-compat</artifactId>
<version>1.2.0-cdh5.8.0</version>

This comment has been minimized.

@fnothaft

fnothaft Nov 7, 2016

Member

+1, and remove spaces between dependencies

@fnothaft

fnothaft Nov 7, 2016

Member

+1, and remove spaces between dependencies

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 8, 2016

Member

If I add <scope>provided</scope>
to the hbase dependencies in the project root pom.xml
it compiles fine, but at runtime on the bdg CDH cluster when trying to connect to hbase I get:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

The needed jars do exist on cluster like at:
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat.jar

I am not sure what is the correct way to get these included in the classpath such that provided would work, and how we would explain what is needed to users.

Member

jpdna commented Nov 8, 2016

If I add <scope>provided</scope>
to the hbase dependencies in the project root pom.xml
it compiles fine, but at runtime on the bdg CDH cluster when trying to connect to hbase I get:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

The needed jars do exist on cluster like at:
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat.jar

I am not sure what is the correct way to get these included in the classpath such that provided would work, and how we would explain what is needed to users.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 8, 2016

Member

I think you'd just provide a --conf spark.jars=/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat.jar before -- in the adam-submit command line. If this works, we should roll this in to the #493 docs.

Member

fnothaft commented Nov 8, 2016

I think you'd just provide a --conf spark.jars=/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat.jar before -- in the adam-submit command line. If this works, we should roll this in to the #493 docs.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 9, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1572/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Nov 9, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1572/

Build result: FAILURE

[...truncated 38 lines...]Triggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 9, 2016

Member

for me the --conf spark.jars=.... appeared to not actually be changing the spark.jars parameter when checking the spark GUI environment page.

I got closer with the following, but then ran into a problem I can't resolve:

I start the adam-shell with:

../adam/bin/adam-shell --master yarn-client --num-executors 1 \
 --executor-cores 2 --executor-memory 20g --jars /spark/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat-1.2.0-cdh5.8.0.jar,\
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-server-1.2.0-cdh5.8.0.jar, 
\
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-it-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-it-1.2.0-cdh5.8.0.jar, \ 
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-client-1.2.0-cdh5.8.0.jar, \ 
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-spark-1.2.0-cdh5.8.0.jar,/home/eecs/jpaschall/work/hbase/v12/adam/adam-assembly/target/adam_2.10-0.20.1-SNAPSHOT.jar, \
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/hbase-common-1.2.0-cdh5.8.0.jar,/ \
opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/protobuf-java-2.5.0.jar

and looking in the UI spark.jars does contain all these paths.

I am even able to:

import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.hbase.HBaseFunctions
import org.apache.hadoop.hbase.HBaseConfiguration

the last of which wouldn't work until adding those jars for hbase
However when I then run the first function that uses hbase:

HBaseFunctions.createHBaseGenotypeTable("nov8_2", "/home/eecs/jpaschall/work/hbase/v3/run1/splits.chr22")

I get error below:

scala> HBaseFunctions.createHBaseGenotypeTable("nov8_2", "/home/eecs/jpaschall/work/hbase/v3/run1/splits.chr22")
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/protobuf/generated/MasterProtos$MasterService$BlockingInterface
    at java.lang.Class.forName0(Native Method)

which despite including that jar /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/protobuf-java-2.5.0.jar

I can't seem to resolve. Seems especially hair that it has to do with native method and and generated protobuf code.

For the moment I think I will work on other aspects of the PR and see if others have a suggestion, as it works fine when dependencies are included without <scope>provided</scope>
I know it is unfortunate to package these dependencies like this when the should be able to be provided, and may motivate us to keep the hbase work a separate module or repo as I understand if these dependencies must be included.

Member

jpdna commented Nov 9, 2016

for me the --conf spark.jars=.... appeared to not actually be changing the spark.jars parameter when checking the spark GUI environment page.

I got closer with the following, but then ran into a problem I can't resolve:

I start the adam-shell with:

../adam/bin/adam-shell --master yarn-client --num-executors 1 \
 --executor-cores 2 --executor-memory 20g --jars /spark/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-hadoop2-compat-1.2.0-cdh5.8.0.jar,\
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-server-1.2.0-cdh5.8.0.jar, 
\
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-it-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-it-1.2.0-cdh5.8.0.jar, \ 
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-client-1.2.0-cdh5.8.0.jar, \ 
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/hbase-spark-1.2.0-cdh5.8.0.jar,/home/eecs/jpaschall/work/hbase/v12/adam/adam-assembly/target/adam_2.10-0.20.1-SNAPSHOT.jar, \
/opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/hbase-common-1.2.0-cdh5.8.0.jar,/ \
opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/protobuf-java-2.5.0.jar

and looking in the UI spark.jars does contain all these paths.

I am even able to:

import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.hbase.HBaseFunctions
import org.apache.hadoop.hbase.HBaseConfiguration

the last of which wouldn't work until adding those jars for hbase
However when I then run the first function that uses hbase:

HBaseFunctions.createHBaseGenotypeTable("nov8_2", "/home/eecs/jpaschall/work/hbase/v3/run1/splits.chr22")

I get error below:

scala> HBaseFunctions.createHBaseGenotypeTable("nov8_2", "/home/eecs/jpaschall/work/hbase/v3/run1/splits.chr22")
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/protobuf/generated/MasterProtos$MasterService$BlockingInterface
    at java.lang.Class.forName0(Native Method)

which despite including that jar /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/protobuf-java-2.5.0.jar

I can't seem to resolve. Seems especially hair that it has to do with native method and and generated protobuf code.

For the moment I think I will work on other aspects of the PR and see if others have a suggestion, as it works fine when dependencies are included without <scope>provided</scope>
I know it is unfortunate to package these dependencies like this when the should be able to be provided, and may motivate us to keep the hbase work a separate module or repo as I understand if these dependencies must be included.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 9, 2016

Member

@jpdna The class you are looking for is in /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/hbase-protocol-1.2.0-cdh5.8.0.jar.

Member

fnothaft commented Nov 9, 2016

@jpdna The class you are looking for is in /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/hbase-protocol-1.2.0-cdh5.8.0.jar.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 9, 2016

Member

Also the way I found the missing class was using the command:

find /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/ -name "*.jar" -print -exec jar Ptvf {} \; | grep -e /opt/ -e <classname>

This command is terrible, but if it finds the class in one of the jarfiles, the path to the jarfile will immediately precede the output of the jar command. Alas.

Member

fnothaft commented Nov 9, 2016

Also the way I found the missing class was using the command:

find /opt/cloudera/parcels/CDH-5.8.0-1.cdh5.8.0.p0.42/lib/hbase/lib/ -name "*.jar" -print -exec jar Ptvf {} \; | grep -e /opt/ -e <classname>

This command is terrible, but if it finds the class in one of the jarfiles, the path to the jarfile will immediately precede the output of the jar command. Alas.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 10, 2016

Member

I get closer with /hbase-protocol-1.2.0-cdh5.8.0.jar
and also adding htrace-core-3.2.0-incubating.jar
as

 ../adam/bin/adam-shell --master yarn-client --num-executors 1 --executor-cores 2 --executor-memory 20g --jars /opt/cloudera/parcels/CDH/jars/hbase-hadoop2-compat-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-server-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-it-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-client-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-spark-1.2.0-cdh5.8.0.jar,/home/eecs/jpaschall/work/hbase/v12/adam/adam-assembly/target/adam_2.10-0.20.1-SNAPSHOT.jar,/opt/cloudera/parcels/CDH/jars/hbase-common-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-protocol-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar

however, now when running the create table, and presumably other hbase functions, spark seems to not be able to connect to hbase server, and times out. However, it works properly when the hbase library dependencies are built with the project as before, just doesn't work with scope provided.

as discussed, I won't block on this as we have the WAR of not using scope provided for now.

Member

jpdna commented Nov 10, 2016

I get closer with /hbase-protocol-1.2.0-cdh5.8.0.jar
and also adding htrace-core-3.2.0-incubating.jar
as

 ../adam/bin/adam-shell --master yarn-client --num-executors 1 --executor-cores 2 --executor-memory 20g --jars /opt/cloudera/parcels/CDH/jars/hbase-hadoop2-compat-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-server-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-it-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-client-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-spark-1.2.0-cdh5.8.0.jar,/home/eecs/jpaschall/work/hbase/v12/adam/adam-assembly/target/adam_2.10-0.20.1-SNAPSHOT.jar,/opt/cloudera/parcels/CDH/jars/hbase-common-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/hbase-protocol-1.2.0-cdh5.8.0.jar,/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar

however, now when running the create table, and presumably other hbase functions, spark seems to not be able to connect to hbase server, and times out. However, it works properly when the hbase library dependencies are built with the project as before, just doesn't work with scope provided.

as discussed, I won't block on this as we have the WAR of not using scope provided for now.

@jpdna jpdna referenced this pull request Nov 27, 2016

Closed

HBase as a separate repo #1293

@jpdna jpdna closed this Jan 3, 2017

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jan 3, 2017

Member

superseded by #1335

Member

jpdna commented Jan 3, 2017

superseded by #1335

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment