Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3880] HBase as data source to SparkSQL #4084

Closed
wants to merge 106 commits into from

Conversation

yzhou2001
Copy link
Contributor

No description provided.

bomeng and others added 30 commits January 6, 2015 18:04
…uster".

The reason the SBT test framework failed to put the HBase test artifacts in its classpath is because of an IVY bug:  sbt/sbt#861.

SBT/IVY can fail to resolve transitive dependencies defined in a 'pom-only' module, and 'hbase-testing-util' is a pom-only module.  (Maven resolves them without any problem.)
The workaround is to replace the hbase-testing-util dependency definition in spark/sql/hbase/pom.xml with the dependencies listed in that pom-only module, at the cost of developers having to edit this pom file to make it work for unprofiled combinations of HBase/Hadoop releases.

=================================

This works only for hadoop2-compat hbase distributions.  The spark/sql/hbase/pom.xml must be edited to make it work for hadoop1-compat hbase distributions (more on that later).

The fix has been tested against HBase 0.98.5-hadoop2 and 0.98.7-hadoop2, with hadoop 2.2.0, 2.3.0, and 2.4.0.

=================
SBT test commands
=================

sbt/sbt -Phive,hbase,yarn,hadoop-2.2 -Dhadoop.version=2.2.0  "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite"
sbt/sbt -Phive,hbase,yarn,hadoop-2.3 -Dhadoop.version=2.3.0   "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite"
sbt/sbt -Phive,hbase,yarn,hadoop-2.4 -Dhadoop.version=2.4.0   "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite"

=================
MVN test commands
=================

mvn -e -Pyarn,hbase,hadoop-2.2  -Dhadoop.version=2.2.0  -pl sql/hbase clean  test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite
mvn -e -Pyarn,hbase,hadoop-2.3  -Dhadoop.version=2.3.0  -pl sql/hbase clean  test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite
mvn -e -Pyarn,hbase,hadoop-2.4  -Dhadoop.version=2.4.0  -pl sql/hbase clean  test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite

===========================

To build against a hadoop1-compatible HBase release (for example: Hadoop 1.2.1), you will need to edit the pom to your needs.
Add the appropriate hadoop profile to import the hadoop 1.2.1 libraries, and redefine the hbase dependencies to match the module structure of you chosen hbase release.

The SBT libraryDependencies declaration below describes the mvn dependencies you would need to define in spark/sql/hbase/pom.xml to build
Spark against HBase v0.98.5-hadoop1 over Hadoop v1.2.1.

libraryDependencies ++= Seq(
    "org.scalatest" %% "scalatest" % "2.2.1" % "test",
    "com.novocode" % "junit-interface" % "0.9" % "test",
    "org.apache.hbase" % "hbase-common" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-common" % "0.98.5-hadoop1" % "test" classifier "tests",
    // no hbase-annotation module before v0.98.7
    //"org.apache.hbase" % "hbase-annotations" % "0.98.5-hadoop1",
    //"org.apache.hbase" % "hbase-annotations" % "0.98.5-hadoop1" % "test" classifier "tests",
    "org.apache.hbase" % "hbase-protocol" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-client" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-server" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-server" % "0.98.5-hadoop1" % "test" classifier "tests",
    "org.apache.hbase" % "hbase-hadoop-compat" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-hadoop-compat" % "0.98.5-hadoop1" % "test" classifier "tests",
    "org.apache.hbase" % "hbase-hadoop1-compat" % "0.98.5-hadoop1",
    "org.apache.hbase" % "hbase-hadoop1-compat" % "0.98.5-hadoop1" % "test" classifier "tests",
    "org.slf4j" % "slf4j-log4j12" % "1.6.4",
    "org.apache.hadoop" % "hadoop-core" % "1.2.1",
    "org.apache.hadoop" % "hadoop-client" % "1.2.1",
    "org.apache.hadoop" % "hadoop-minicluster" % "1.2.1",
    "org.apache.hadoop" % "hadoop-test" % "1.2.1",
    "org.apache.hadoop" % "hadoop-tools" % "1.2.1"
)

Note that there are no 'hbase-annotations' modules in HBase releases prior to v0.98.7.
sparksburnitt and others added 24 commits January 13, 2015 15:45
… same mapreduce hfile output dir during the bulkloader's text->hfile transform, resulting in broken HFile imports after the 1st test. The workaround is to make sure the physical HTable names are unique across test cases in the suite.
…classloading problem and fix typo: ExludedDependencies -> ExcludedDependencies
…' to fix 3 test cases expecting parser to return ParallelizedBulkLoadIntoTableCommand.
…le on an nonexistent HBase table created a HBase table with only one presplit region
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@scwf
Copy link
Contributor

scwf commented Jan 17, 2015

@yzhou2001, we need rebase our branch

@OopsOutOfMemory
Copy link
Contributor

and we need check the coding style.

@pwendell
Copy link
Contributor

Hi - thanks for working on this... it looks interesting. I'd like to close this issue (i.e. the PR) and discuss more on the JIRA/dev list rather than having a big pull request like this. For very large features this is the way we do it. If you look on your wiki it says "If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it."

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@pwendell
Copy link
Contributor

Also one thing that would help is if you could create a standalone project for this on github (see spark-avro).

@asfgit asfgit closed this in e12b5b6 Jan 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants