Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to connect to a local spark install #31

Closed
aaronsteers opened this issue Sep 24, 2019 · 14 comments
Closed

How to connect to a local spark install #31

aaronsteers opened this issue Sep 24, 2019 · 14 comments
Labels
enhancement New feature or request good_first_issue Good for newcomers Stale

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Sep 24, 2019

Is there any way to connect using a locally installed spark instance, rather than to a remote service via http/thrift?

The code I'm trying to migrate uses the following imports to run SQL-based transforms locally using spark/hive already on the container:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.config(conf=conf)
    .master("local")
    .appName("My Spark App")
    .enableHiveSupport()
    .getOrCreate()
)
spark.sparkContext.setLogLevel(SPARK_LOG_LEVEL)
sc = spark.sparkContext

# ...

df = spark.sql(f"CREATE TABLE AS SELECT * FROM {my_source_table}")

And if not supported currently, is there any chance we could build this and/or add the feature? For CI/CD pipelines especially, it seems we would want to be able to run dbt pipelines even without access to an external cluster.

@drewbanin
Copy link
Contributor

Hey @aaronsteers - really cool idea! This definitely is not currently supported, but I can imagine adding a new method to the Spark target config, local, which will attach to a locally running Spark context.

In your experience, are there are major differences between the SparkSQL that should run over http/thrift vs. calling spark.sql directly? It shouldn't be a problem if the SQL is identical.

@drewbanin drewbanin added enhancement New feature or request good_first_issue Good for newcomers labels Sep 25, 2019
@aaronsteers
Copy link
Contributor Author

Hey, @drewbanin. As far as I'm aware, the SQL should be identical whether connecting via spark.sql() or via thrift - but it would probably require some testing, tbh. The only difference I could see is if the hive/thrift adapter were sending to a slightly different interpreter.

Thanks!

@dmateusp
Copy link
Contributor

Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment. I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have.

First, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark.

So I went down the route that @aaronsteers was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql).

Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!)

I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc..)). Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal.

The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL.

I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running Spark)

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Feb 24, 2020

@dmateusp - I have gotten this working successfully in a docker container and I have gotten these two options to work:

  1. Run a docker container locally that hosts spark and thrift, then you can run DBT locally using the container's thrift port.
  2. Run dbt-spark from within a customized spark container. The container launches spark and then thrift and then runs some dbt tasks connecting to it's own thrift endpoint.

I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: https://github.com/slalom-ggp/dataops-tools/blob/1e36e3d09b99211e4223e436f2da825c117a92e8/slalom/dataops/sparkutils.py#L349-L352

In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000). From a pyspark SparkSession object, I think we could just run spark.sql("SELECT * FROM FOOBAR") with basically the same result as is achieved with Thrift. (I'm not sure honestly if the behavior would be different at all, but definitely we could send SQL statements directly in that manner without using Thrift.)

@dmateusp
Copy link
Contributor

hey @aaronsteers thanks for sharing your approach,

I think we could benefit from having the README updated with some instruction to run it locally (using the Dockerized Thrift container), I did not find a docker container that worked out of the box with Spark and Thrift, could you share that image? (or maybe we could host it in the dbt-spark repo for future integration testing?)

The pyspark approach could be worth exploring!

I have a draft PR on my fork to show what I've been playing with, I use pexpect to wrap SQL produced by DBT into spark.sql() calls to a shell session. I think I explored it enough to say that it has many problems with getting Exception details, transmitting data and that using pyspark would be better. Also, pexpect has compatibility issues with Windows. (https://github.com/dmateusp/dbt-spark/pull/1/files)

If the Docker + Thrift approach is good enough in your opinion to play with dbt-spark locally, should we consider documenting this approach instead ? Because as you said, pyspark should not behave differently, but, it's still additional code that needs to be supported, tested and documented

@aaronsteers
Copy link
Contributor Author

@dmateusp - I loved your idea of hosting a Dockerfile in dbt-spark and using this for containerized testing. I created the work-in-progress PR #55 which adds a Dockerfile based upon my past work and exploration on this topic.

@drewbanin - I'm very interested in your thoughts on this. Would you be interested in merging a Dockerfile and perhaps including that docker image (once complete) as a part of the repo?

@jtcohen6
Copy link
Contributor

@aaronsteers Now that we've merged #58, are you okay with closing this issue?

@aaronsteers
Copy link
Contributor Author

@jtcohen6 - The updates in #58 should do the trick in theory. That said, if it's okay with you, I'd still like to keep this open a little longer to test usability and documentation around this use case. I can try to get to it this week so we're not keeping this outstanding too long.

@jtcohen6
Copy link
Contributor

Sure! No rush on my end

@chinwobble
Copy link

@aaronsteers
this PR looks really good.
https://github.com/dmateusp/dbt-spark/pull/1/files

Is there anyway we can get this merged since I am looking for similar functionality so I can register UDFs in a sane way.

@Data-drone
Copy link

What is the status on this now?

@ninomllr
Copy link

Would really like to hear what the current state is on this? Basically how do I start thrift?

@ninomllr
Copy link

I basically tried to start a thrift server like this from my Jupyter: https://stackoverflow.com/a/54223260 and then adding this to my profiles.yml without any luck.

I get the following error:

Could not connect to any of [('127.0.0.1', 443), ('::1', 443, 0, 0)]
07:10:29  Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      failed to connect

profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: localhost
      schema: delta

Can someone tell me on how to start a local thrift server?

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Good for newcomers Stale
Projects
None yet
Development

No branches or pull requests

7 participants