[SPARK-3880] HBase as data source to SparkSQL #4084

yzhou2001 · 2015-01-17T06:42:49Z

No description provided.

… HW_HBase

…uster". The reason the SBT test framework failed to put the HBase test artifacts in its classpath is because of an IVY bug: sbt/sbt#861. SBT/IVY can fail to resolve transitive dependencies defined in a 'pom-only' module, and 'hbase-testing-util' is a pom-only module. (Maven resolves them without any problem.) The workaround is to replace the hbase-testing-util dependency definition in spark/sql/hbase/pom.xml with the dependencies listed in that pom-only module, at the cost of developers having to edit this pom file to make it work for unprofiled combinations of HBase/Hadoop releases. ================================= This works only for hadoop2-compat hbase distributions. The spark/sql/hbase/pom.xml must be edited to make it work for hadoop1-compat hbase distributions (more on that later). The fix has been tested against HBase 0.98.5-hadoop2 and 0.98.7-hadoop2, with hadoop 2.2.0, 2.3.0, and 2.4.0. ================= SBT test commands ================= sbt/sbt -Phive,hbase,yarn,hadoop-2.2 -Dhadoop.version=2.2.0 "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite" sbt/sbt -Phive,hbase,yarn,hadoop-2.3 -Dhadoop.version=2.3.0 "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite" sbt/sbt -Phive,hbase,yarn,hadoop-2.4 -Dhadoop.version=2.4.0 "hbase/test-only org.apache.spark.sql.hbase.BasicQueriesSuite" ================= MVN test commands ================= mvn -e -Pyarn,hbase,hadoop-2.2 -Dhadoop.version=2.2.0 -pl sql/hbase clean test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite mvn -e -Pyarn,hbase,hadoop-2.3 -Dhadoop.version=2.3.0 -pl sql/hbase clean test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite mvn -e -Pyarn,hbase,hadoop-2.4 -Dhadoop.version=2.4.0 -pl sql/hbase clean test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite =========================== To build against a hadoop1-compatible HBase release (for example: Hadoop 1.2.1), you will need to edit the pom to your needs. Add the appropriate hadoop profile to import the hadoop 1.2.1 libraries, and redefine the hbase dependencies to match the module structure of you chosen hbase release. The SBT libraryDependencies declaration below describes the mvn dependencies you would need to define in spark/sql/hbase/pom.xml to build Spark against HBase v0.98.5-hadoop1 over Hadoop v1.2.1. libraryDependencies ++= Seq( "org.scalatest" %% "scalatest" % "2.2.1" % "test", "com.novocode" % "junit-interface" % "0.9" % "test", "org.apache.hbase" % "hbase-common" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-common" % "0.98.5-hadoop1" % "test" classifier "tests", // no hbase-annotation module before v0.98.7 //"org.apache.hbase" % "hbase-annotations" % "0.98.5-hadoop1", //"org.apache.hbase" % "hbase-annotations" % "0.98.5-hadoop1" % "test" classifier "tests", "org.apache.hbase" % "hbase-protocol" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-client" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-server" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-server" % "0.98.5-hadoop1" % "test" classifier "tests", "org.apache.hbase" % "hbase-hadoop-compat" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-hadoop-compat" % "0.98.5-hadoop1" % "test" classifier "tests", "org.apache.hbase" % "hbase-hadoop1-compat" % "0.98.5-hadoop1", "org.apache.hbase" % "hbase-hadoop1-compat" % "0.98.5-hadoop1" % "test" classifier "tests", "org.slf4j" % "slf4j-log4j12" % "1.6.4", "org.apache.hadoop" % "hadoop-core" % "1.2.1", "org.apache.hadoop" % "hadoop-client" % "1.2.1", "org.apache.hadoop" % "hadoop-minicluster" % "1.2.1", "org.apache.hadoop" % "hadoop-test" % "1.2.1", "org.apache.hadoop" % "hadoop-tools" % "1.2.1" ) Note that there are no 'hbase-annotations' modules in HBase releases prior to v0.98.7.

… HW_HBase

…or JOIN

… HW_HBase

…w null col-value load/select test

… string

… same mapreduce hfile output dir during the bulkloader's text->hfile transform, resulting in broken HFile imports after the 1st test. The workaround is to make sure the physical HTable names are unique across test cases in the suite.

…classloading problem and fix typo: ExludedDependencies -> ExcludedDependencies

…' to fix 3 test cases expecting parser to return ParallelizedBulkLoadIntoTableCommand.

…le on an nonexistent HBase table created a HBase table with only one presplit region

AmplabJenkins · 2015-01-17T06:47:10Z

Can one of the admins verify this patch?

scwf · 2015-01-17T06:52:40Z

@yzhou2001, we need rebase our branch

OopsOutOfMemory · 2015-01-17T08:54:45Z

and we need check the coding style.

pwendell · 2015-01-18T00:20:33Z

Hi - thanks for working on this... it looks interesting. I'd like to close this issue (i.e. the PR) and discuss more on the JIRA/dev list rather than having a big pull request like this. For very large features this is the way we do it. If you look on your wiki it says "If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it."

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

pwendell · 2015-01-18T00:21:03Z

Also one thing that would help is if you could create a standalone project for this on github (see spark-avro).

bomeng and others added 30 commits January 6, 2015 18:04

hbase support - initial checkin

5c7f25f

hbase support - initial checkin

990038a

Merge branch 'HW_HBase' of https://github.com/Huawei-Spark/spark into…

a9d0e6b

… HW_HBase

Modify HBaseMainTest and Add testcase for JavaApi

3ec2004

remove the duplicate folder

d223b79

fix for JavaAPI 'show tables'

1d214fd

fix java API issue on 'Describe Table'

7afc8e6

fix a Describe Table issue

2618294

some code cleanup

1b4ede8

more cleanup of leftover files

090b215

fix the testcase

34f326d

Merge branch 'HW_HBase' of https://github.com/Huawei-Spark/spark into…

b4ccd6b

… HW_HBase

remove println

98109f7

remove println

575c778

remove println

b3ada4a

remove println

b35db4a

remove println

d7737d7

Modify the pom and Remove the useless testcases

2af3727

remove buggy codes

435184e

Merge branch 'HW_HBase' of https://github.com/Huawei-Spark/spark into…

ca02383

… HW_HBase

clean up the codes

5dbe2b7

use of SparkHome in BulkLoadIntoTableSuite

ce5ad97

Merge branch 'HW_HBase' of https://github.com/Huawei-Spark/spark into…

64e59ca

… HW_HBase

fix the issue of using minicluster and subquery

4e72f70

Modify the HBaseSQLContext in some testcases and Add more testcases f…

5ccefe7

…or JOIN

fix bulkload issues

9976fa6

Merge branch 'HW_HBase' of https://github.com/Huawei-Spark/spark into…

d4ffa00

… HW_HBase

Modify the path for BulkLoadSuite

0d5c3a6

add assert

3087c5c

sparksburnitt and others added 24 commits January 13, 2015 15:45

support insertion and bulkloading of nullable columnns

525bea6

forgot to uncomment old test cases, and use unique htable name for ne…

477eef4

…w null col-value load/select test

add test cases to verify key column is string type and value is empty…

2db8852

… string

Reduce the call to getMetadataTable

83cf5bc

Modify createTable

59be1cc

fix the test cases failure and style check issues

c438a8b

make parallelize bulkload an option

00899e1

reuse val in hadoopreader

8b46235

compute time tabken correctly in cli

9022a74

hot fix for compile error

6f732c4

style fix

8545195

Use set to gather the families

7f85ac8

Add 'hbase' module to allProjects list to fix sbt's sql/hbase module …

2dc1bb3

…classloading problem and fix typo: ExludedDependencies -> ExcludedDependencies

Change sql 'LOAD DATA LOCAL INPATH' to 'LOAD PARALL DATA LOCAL INPATH…

4049151

…' to fix 3 test cases expecting parser to return ParallelizedBulkLoadIntoTableCommand.

test query containing '... where col is not null'

0e05d30

fix test warning

1b5f8ee

more refinements

eba0c62

remove unused codes

a3637a7

fix an partial evaluation error in NOT; ane the issue that create tab…

c811717

…le on an nonexistent HBase table created a HBase table with only one presplit region

Fix the error of OR case in ScanPredClassifier

3bb6ccf

Add HTable close before exit the function

0c71c55

Modify the htable in HBaseRelation

f2a6c16

fix the warnings in the compilation

4e8e72a

asfgit closed this in e12b5b6 Jan 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3880] HBase as data source to SparkSQL #4084

[SPARK-3880] HBase as data source to SparkSQL #4084

yzhou2001 commented Jan 17, 2015

AmplabJenkins commented Jan 17, 2015

scwf commented Jan 17, 2015

OopsOutOfMemory commented Jan 17, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015

[SPARK-3880] HBase as data source to SparkSQL #4084

[SPARK-3880] HBase as data source to SparkSQL #4084

Conversation

yzhou2001 commented Jan 17, 2015

AmplabJenkins commented Jan 17, 2015

scwf commented Jan 17, 2015

OopsOutOfMemory commented Jan 17, 2015

pwendell commented Jan 18, 2015

pwendell commented Jan 18, 2015