Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

into(Spark/HDFS, SQL DBs) #31

Open
cpcloud opened this issue Jan 18, 2015 · 4 comments
Open

into(Spark/HDFS, SQL DBs) #31

cpcloud opened this issue Jan 18, 2015 · 4 comments
Assignees
Milestone

Comments

@cpcloud
Copy link
Member

cpcloud commented Jan 18, 2015

migration from blaze

blaze/blaze#582

reposting for ease of use:

from @chdoig

As Spark is becoming a popular backend, it's a common use case for people to want to transfer their datasets in DBs to a Spark/HDFS cluster.

It would be nice to have an easy interface for end-users to transfer their tables in DBs to a Cluster.

into(Spark/HDFS, SQL DBs)

A lot of people are talking now about tachyon, maybe worth taking a look:
http://tachyon-project.org/
http://ampcamp.berkeley.edu/big-data-mini-course/tachyon.html

This might be related with @quasiben work on SparkSQL. Maybe a barrier for people to star using SparkSQL is how they should make that transfer since:

A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

But I'm not able to find how you make that connection from existing SQL DBs:
http://spark.apache.org/docs/latest/sql-programming-guide.html

cc: @mrocklin

@cpcloud cpcloud self-assigned this Feb 7, 2015
@cpcloud cpcloud modified the milestones: 0.3.1, 0.3.2 Mar 6, 2015
@cpcloud cpcloud modified the milestone: 0.3.2 Mar 19, 2015
@cpcloud
Copy link
Member Author

cpcloud commented Apr 3, 2015

FWIW, Spark can use the JDBC connector to do computation. It looks like this:

sc = SparkContext(...)
sql = HiveContext(...)
df = sql.load("jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")

I don't think this would be that tricky. The function would look something like this:

@append.register(SQLContext, sa.Table)
def sql_to_sparksql(ctx, tb, **kwargs):
    url = connection_string_from_engine(tb.bind)  # <- not implemented, but i think easy
    dbtable = [tb.name]
    if tb.schema is not None:
        dbtable.insert(0, tb.schema)
    return ctx.load('jdbc', url=url, dbtable='.'.join(dbtable))

@cpcloud
Copy link
Member Author

cpcloud commented Apr 3, 2015

@cpcloud
Copy link
Member Author

cpcloud commented Apr 3, 2015

If anyone is interested in implementing this, you'll have to sudo apt-get install libpostgresql-jdbc-java for testing on Travis-CI and then set SPARK_CLASSPATH to one of the following

/usr/share/java/postgresql-jdbc3-9.1.jar
/usr/share/java/postgresql-jdbc4-9.1.jar

because according to the travis docs the default install is postgres 9.1

By default, the build environment will have version 9.1 running already.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 3, 2015

The postgres connection string syntax can be found here:

https://jdbc.postgresql.org/documentation/91/connect.html

one really annoying thing is that it looks like jdbc connection strings aren't standard and can be different for different vendors

here's oracle's:

http://docs.oracle.com/cd/B28359_01/java.111/b31224/urls.htm#BEIJFHHB

Alt text

@cpcloud cpcloud added this to the 0.3.2 milestone Apr 3, 2015
@cpcloud cpcloud modified the milestones: 0.3.3, 0.3.2 Apr 17, 2015
@cpcloud cpcloud modified the milestones: 0.3.3, 0.3.4 Jul 1, 2015
@cpcloud cpcloud modified the milestones: 0.3.4, 0.4.0 Sep 15, 2015
@cpcloud cpcloud modified the milestones: 0.4.0, 0.3.5 Oct 5, 2015
@cpcloud cpcloud removed this from the 0.4.0 milestone Dec 4, 2015
@cpcloud cpcloud modified the milestones: 0.4.1, 0.4.0 Dec 4, 2015
@kwmsmith kwmsmith modified the milestones: 0.4.1, 0.4.2, 0.5.0 Feb 2, 2016
@kszucs kszucs mentioned this issue Feb 24, 2016
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants