How to connect to a local spark install #31

aaronsteers · 2019-09-24T21:12:48Z

Is there any way to connect using a locally installed spark instance, rather than to a remote service via http/thrift?

The code I'm trying to migrate uses the following imports to run SQL-based transforms locally using spark/hive already on the container:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.config(conf=conf)
    .master("local")
    .appName("My Spark App")
    .enableHiveSupport()
    .getOrCreate()
)
spark.sparkContext.setLogLevel(SPARK_LOG_LEVEL)
sc = spark.sparkContext

# ...

df = spark.sql(f"CREATE TABLE AS SELECT * FROM {my_source_table}")

And if not supported currently, is there any chance we could build this and/or add the feature? For CI/CD pipelines especially, it seems we would want to be able to run dbt pipelines even without access to an external cluster.

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-09-25T13:41:01Z

Hey @aaronsteers - really cool idea! This definitely is not currently supported, but I can imagine adding a new method to the Spark target config, local, which will attach to a locally running Spark context.

In your experience, are there are major differences between the SparkSQL that should run over http/thrift vs. calling spark.sql directly? It shouldn't be a problem if the SQL is identical.

aaronsteers · 2019-09-26T23:15:45Z

Hey, @drewbanin. As far as I'm aware, the SQL should be identical whether connecting via spark.sql() or via thrift - but it would probably require some testing, tbh. The only difference I could see is if the hive/thrift adapter were sending to a slightly different interpreter.

Thanks!

dmateusp · 2020-02-22T19:52:19Z

Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment. I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have.

First, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark.

So I went down the route that @aaronsteers was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql).

Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!)

I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc..)). Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal.

The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL.

I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running Spark)

aaronsteers · 2020-02-24T19:11:12Z

@dmateusp - I have gotten this working successfully in a docker container and I have gotten these two options to work:

Run a docker container locally that hosts spark and thrift, then you can run DBT locally using the container's thrift port.
Run dbt-spark from within a customized spark container. The container launches spark and then thrift and then runs some dbt tasks connecting to it's own thrift endpoint.

I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: https://github.com/slalom-ggp/dataops-tools/blob/1e36e3d09b99211e4223e436f2da825c117a92e8/slalom/dataops/sparkutils.py#L349-L352

In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000). From a pyspark SparkSession object, I think we could just run spark.sql("SELECT * FROM FOOBAR") with basically the same result as is achieved with Thrift. (I'm not sure honestly if the behavior would be different at all, but definitely we could send SQL statements directly in that manner without using Thrift.)

dmateusp · 2020-02-25T10:16:30Z

hey @aaronsteers thanks for sharing your approach,

I think we could benefit from having the README updated with some instruction to run it locally (using the Dockerized Thrift container), I did not find a docker container that worked out of the box with Spark and Thrift, could you share that image? (or maybe we could host it in the dbt-spark repo for future integration testing?)

The pyspark approach could be worth exploring!

I have a draft PR on my fork to show what I've been playing with, I use pexpect to wrap SQL produced by DBT into spark.sql() calls to a shell session. I think I explored it enough to say that it has many problems with getting Exception details, transmitting data and that using pyspark would be better. Also, pexpect has compatibility issues with Windows. (https://github.com/dmateusp/dbt-spark/pull/1/files)

If the Docker + Thrift approach is good enough in your opinion to play with dbt-spark locally, should we consider documenting this approach instead ? Because as you said, pyspark should not behave differently, but, it's still additional code that needs to be supported, tested and documented

aaronsteers · 2020-02-26T08:39:49Z

@dmateusp - I loved your idea of hosting a Dockerfile in dbt-spark and using this for containerized testing. I created the work-in-progress PR #55 which adds a Dockerfile based upon my past work and exploration on this topic.

@drewbanin - I'm very interested in your thoughts on this. Would you be interested in merging a Dockerfile and perhaps including that docker image (once complete) as a part of the repo?

jtcohen6 · 2020-03-17T14:20:08Z

@aaronsteers Now that we've merged #58, are you okay with closing this issue?

aaronsteers · 2020-03-17T16:41:27Z

@jtcohen6 - The updates in #58 should do the trick in theory. That said, if it's okay with you, I'd still like to keep this open a little longer to test usability and documentation around this use case. I can try to get to it this week so we're not keeping this outstanding too long.

jtcohen6 · 2020-03-17T16:49:18Z

Sure! No rush on my end

chinwobble · 2021-01-05T13:47:33Z

@aaronsteers
this PR looks really good.
https://github.com/dmateusp/dbt-spark/pull/1/files

Is there anyway we can get this merged since I am looking for similar functionality so I can register UDFs in a sane way.

Data-drone · 2021-07-07T13:06:52Z

What is the status on this now?

ninomllr · 2022-01-13T22:30:19Z

Would really like to hear what the current state is on this? Basically how do I start thrift?

ninomllr · 2022-01-14T07:18:28Z

I basically tried to start a thrift server like this from my Jupyter: https://stackoverflow.com/a/54223260 and then adding this to my profiles.yml without any luck.

I get the following error:

Could not connect to any of [('127.0.0.1', 443), ('::1', 443, 0, 0)]
07:10:29  Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      failed to connect

profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: localhost
      schema: delta

Can someone tell me on how to start a local thrift server?

github-actions · 2022-07-14T02:12:10Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

drewbanin added enhancement New feature or request good_first_issue Good for newcomers labels Sep 25, 2019

aaronsteers mentioned this issue Feb 26, 2020

(WIP) Adds Dockerfile with containerized spark #55

Closed

dmateusp mentioned this issue Mar 7, 2020

Add a docker-compose environment for local/integration testing #58

Merged

github-actions bot added the Stale label Jul 14, 2022

github-actions bot closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to connect to a local spark install #31

How to connect to a local spark install #31

aaronsteers commented Sep 24, 2019 •

edited

Loading

drewbanin commented Sep 25, 2019

aaronsteers commented Sep 26, 2019

dmateusp commented Feb 22, 2020

aaronsteers commented Feb 24, 2020 •

edited

Loading

dmateusp commented Feb 25, 2020

aaronsteers commented Feb 26, 2020

jtcohen6 commented Mar 17, 2020

aaronsteers commented Mar 17, 2020

jtcohen6 commented Mar 17, 2020

chinwobble commented Jan 5, 2021

Data-drone commented Jul 7, 2021

ninomllr commented Jan 13, 2022

ninomllr commented Jan 14, 2022

github-actions bot commented Jul 14, 2022

How to connect to a local spark install #31

How to connect to a local spark install #31

Comments

aaronsteers commented Sep 24, 2019 • edited Loading

drewbanin commented Sep 25, 2019

aaronsteers commented Sep 26, 2019

dmateusp commented Feb 22, 2020

aaronsteers commented Feb 24, 2020 • edited Loading

dmateusp commented Feb 25, 2020

aaronsteers commented Feb 26, 2020

jtcohen6 commented Mar 17, 2020

aaronsteers commented Mar 17, 2020

jtcohen6 commented Mar 17, 2020

chinwobble commented Jan 5, 2021

Data-drone commented Jul 7, 2021

ninomllr commented Jan 13, 2022

ninomllr commented Jan 14, 2022

github-actions bot commented Jul 14, 2022

aaronsteers commented Sep 24, 2019 •

edited

Loading

aaronsteers commented Feb 24, 2020 •

edited

Loading