into(Spark/HDFS, SQL DBs) #31

cpcloud · 2015-01-18T16:17:26Z

migration from blaze

reposting for ease of use:

As Spark is becoming a popular backend, it's a common use case for people to want to transfer their datasets in DBs to a Spark/HDFS cluster.

It would be nice to have an easy interface for end-users to transfer their tables in DBs to a Cluster.

into(Spark/HDFS, SQL DBs)

A lot of people are talking now about tachyon, maybe worth taking a look:
http://tachyon-project.org/
http://ampcamp.berkeley.edu/big-data-mini-course/tachyon.html

This might be related with @quasiben work on SparkSQL. Maybe a barrier for people to star using SparkSQL is how they should make that transfer since:

A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

But I'm not able to find how you make that connection from existing SQL DBs:
http://spark.apache.org/docs/latest/sql-programming-guide.html

cc: @mrocklin

cpcloud · 2015-04-03T12:04:05Z

FWIW, Spark can use the JDBC connector to do computation. It looks like this:

sc = SparkContext(...)
sql = HiveContext(...)
df = sql.load("jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

I don't think this would be that tricky. The function would look something like this:

@append.register(SQLContext, sa.Table)
def sql_to_sparksql(ctx, tb, **kwargs):
    url = connection_string_from_engine(tb.bind)  # <- not implemented, but i think easy
    dbtable = [tb.name]
    if tb.schema is not None:
        dbtable.insert(0, tb.schema)
    return ctx.load('jdbc', url=url, dbtable='.'.join(dbtable))

cpcloud · 2015-04-03T12:12:10Z

here's the spark docs on using jdbc

https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#jdbc-to-other-databases

cpcloud · 2015-04-03T12:15:53Z

If anyone is interested in implementing this, you'll have to sudo apt-get install libpostgresql-jdbc-java for testing on Travis-CI and then set SPARK_CLASSPATH to one of the following

/usr/share/java/postgresql-jdbc3-9.1.jar
/usr/share/java/postgresql-jdbc4-9.1.jar

because according to the travis docs the default install is postgres 9.1

By default, the build environment will have version 9.1 running already.

cpcloud · 2015-04-03T12:27:37Z

The postgres connection string syntax can be found here:

https://jdbc.postgresql.org/documentation/91/connect.html

one really annoying thing is that it looks like jdbc connection strings aren't standard and can be different for different vendors

here's oracle's:

http://docs.oracle.com/cd/B28359_01/java.111/b31224/urls.htm#BEIJFHHB

cpcloud self-assigned this Feb 7, 2015

cpcloud added the enhancement label Feb 7, 2015

cpcloud modified the milestones: 0.3.1, 0.3.2 Mar 6, 2015

cpcloud modified the milestone: 0.3.2 Mar 19, 2015

cpcloud added this to the 0.3.2 milestone Apr 3, 2015

cpcloud modified the milestones: 0.3.3, 0.3.2 Apr 17, 2015

cpcloud modified the milestones: 0.3.3, 0.3.4 Jul 1, 2015

cpcloud modified the milestones: 0.3.4, 0.4.0 Sep 15, 2015

cpcloud modified the milestones: 0.4.0, 0.3.5 Oct 5, 2015

cpcloud removed this from the 0.4.0 milestone Dec 4, 2015

cpcloud modified the milestones: 0.4.1, 0.4.0 Dec 4, 2015

kwmsmith modified the milestones: 0.4.1, 0.4.2, 0.5.0 Feb 2, 2016

kszucs mentioned this issue Feb 24, 2016

Odo Parquet Format daskos/epos#2

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

into(Spark/HDFS, SQL DBs) #31

into(Spark/HDFS, SQL DBs) #31

cpcloud commented Jan 18, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015

into(Spark/HDFS, SQL DBs) #31

into(Spark/HDFS, SQL DBs) #31

Comments

cpcloud commented Jan 18, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015

cpcloud commented Apr 3, 2015