-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with Dataproc Serverless #259
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-core contributing guide, and the dbt-bigquery contriubuting guide. |
Hi, great to see spark serverless support on dbt!
2- Is is possible to add a timeout for the operation as we have with BQ? Thanks |
@velascoluis Thanks for catching this so quickly! This is definitely still "experimental" code while we're investigating. We're relatively new to this tool set, so very very happy to have advice from actual practitioners at GCP :) Rather than add many many additional configurations to version: 2
models:
- name: my_python_model
config:
pyspark_batch:
args:
- ...
runtime_config:
spark.executor.instances: 2
execution_config:
subnetwork_uri: ... Similar logic could apply for "traditional" Dataproc (managed clusters), too — we could have a default cluster configured in Re: timeout/retry: We'd like to start using Google's built-in retry capabilities as much as possible (#230), rather than rolling this logic ourselves. Could this be as simple as adding the |
gcs_location = "gs://{}/{}".format(self.credential.gcs_bucket, filename) | ||
batch.pyspark_batch.main_python_file_uri = gcs_location | ||
batch.pyspark_batch.jar_file_uris = [ | ||
"gs://spark-lib/bigquery/spark-3.1-bigquery-0.26.0-preview.jar" # how to keep this up to date? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if a user wants to add new python libraries, they will have to specify a new jar file or overwrite the existing one?
This makes a lot of sense, and will allow - as you mentioned before - to use dataproc serverless for workloads such as development / iterative or exceptional heavy one-shots (e.g. generating job templates with opinionated defaults) , leaving standard dataproc for daily/stable ELTs where cost/config is more predictable.
Yes, I was thinking on crafting spark properties to the job itself and control specific timeouts (network, idle workers ..), but delegating the responsibility seems more sensible and general approach. Btw, as Im aware of some deployments of dbt core on GKE, I did some tests with dataproc on GKE, and it works fine as well (same API) |
Closed in favor of #303 |
resolves #248
Description
What's cool? No need to provision a Dataproc Cluster in advance! No need to pay to keep it on 24/7! Sensible defaults around cluster sizing, and auto-scaling to handle concurrency.
The only required setup is enabling Dataproc APIs for a region, and some networking setup to enable "Private Google Access."
Screenshot
What's less cool?
Maybe a dedicated Dataproc cluster in development, and Dataproc Serverless in production...?
Checklist
changie new
to create a changelog entry