-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to connect to a local spark install #31
Comments
Hey @aaronsteers - really cool idea! This definitely is not currently supported, but I can imagine adding a new method to the Spark target config, In your experience, are there are major differences between the SparkSQL that should run over http/thrift vs. calling |
Hey, @drewbanin. As far as I'm aware, the SQL should be identical whether connecting via spark.sql() or via thrift - but it would probably require some testing, tbh. The only difference I could see is if the hive/thrift adapter were sending to a slightly different interpreter. Thanks! |
Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment. I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have. First, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark. So I went down the route that @aaronsteers was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql). Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!) I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc..)). Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal. The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL. I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running Spark) |
@dmateusp - I have gotten this working successfully in a docker container and I have gotten these two options to work:
I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: https://github.com/slalom-ggp/dataops-tools/blob/1e36e3d09b99211e4223e436f2da825c117a92e8/slalom/dataops/sparkutils.py#L349-L352 In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000). From a pyspark |
hey @aaronsteers thanks for sharing your approach, I think we could benefit from having the README updated with some instruction to run it locally (using the Dockerized Thrift container), I did not find a docker container that worked out of the box with Spark and Thrift, could you share that image? (or maybe we could host it in the dbt-spark repo for future integration testing?) The pyspark approach could be worth exploring! I have a draft PR on my fork to show what I've been playing with, I use If the Docker + Thrift approach is good enough in your opinion to play with dbt-spark locally, should we consider documenting this approach instead ? Because as you said, pyspark should not behave differently, but, it's still additional code that needs to be supported, tested and documented |
@dmateusp - I loved your idea of hosting a Dockerfile in dbt-spark and using this for containerized testing. I created the work-in-progress PR #55 which adds a Dockerfile based upon my past work and exploration on this topic. @drewbanin - I'm very interested in your thoughts on this. Would you be interested in merging a Dockerfile and perhaps including that docker image (once complete) as a part of the repo? |
@aaronsteers Now that we've merged #58, are you okay with closing this issue? |
Sure! No rush on my end |
@aaronsteers Is there anyway we can get this merged since I am looking for similar functionality so I can register UDFs in a sane way. |
What is the status on this now? |
Would really like to hear what the current state is on this? Basically how do I start thrift? |
I basically tried to start a thrift server like this from my Jupyter: https://stackoverflow.com/a/54223260 and then adding this to my profiles.yml without any luck. I get the following error:
profiles.yml
Can someone tell me on how to start a local thrift server? |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Is there any way to connect using a locally installed spark instance, rather than to a remote service via http/thrift?
The code I'm trying to migrate uses the following imports to run SQL-based transforms locally using spark/hive already on the container:
And if not supported currently, is there any chance we could build this and/or add the feature? For CI/CD pipelines especially, it seems we would want to be able to run dbt pipelines even without access to an external cluster.
The text was updated successfully, but these errors were encountered: