Experiment with Dataproc Serverless #259

jtcohen6 · 2022-08-07T15:42:42Z

resolves #248

Description

What's cool? No need to provision a Dataproc Cluster in advance! No need to pay to keep it on 24/7! Sensible defaults around cluster sizing, and auto-scaling to handle concurrency.

The only required setup is enabling Dataproc APIs for a region, and some networking setup to enable "Private Google Access."

Screenshot

What's less cool?

SLOW. The Dataproc UI very clearly states that "Spark jobs take ~60 seconds to initialize resources." Additionally, because Dataproc Serverless submits batch jobs asyncronously, I noticed an additional delay between the job finishing, and dbt figuring that out. All in, the dbt model took 223.19s (3+ minutes) to run, the Dataproc elapsed time was 1 min 22 sec, and the "actual" run time was 34s.
Third-party packages: Using the default container, which only contains a small number of desirable packages. No really good way to add more, short of having dbt try to run Docker for you—I strongly feel that we should avoid doing this!! We could enable users to spin up their own custom containers, and pass the pointer to that container as a model configuration. Also, it is possible to pass the GCS locations of additional Python files, which could be good for first-party packages.

Maybe a dedicated Dataproc cluster in development, and Dataproc Serverless in production...?

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have run changie new to create a changelog entry

github-actions · 2022-08-07T15:43:02Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-core contributing guide, and the dbt-bigquery contriubuting guide.

velascoluis · 2022-08-08T10:25:46Z

Hi, great to see spark serverless support on dbt!
A few of comments If I may
1 - Do you think the network should be also configurable at the profile level? As per this, it is very common to use a specific network for prod. workloads

batch.pyspark_batch.main_python_file_uri = gcs_location
        batch.pyspark_batch.jar_file_uris = [
            "gs://spark-lib/bigquery/spark-3.1-bigquery-0.26.0-preview.jar"  # how to keep this up to date?
        ]

      
        batch.runtime_config.properties = {
            "spark.executor.instances": "2",
        }

batch.environment_config. execution_config. subnetwork_uri = self.credential.dataproc_subnet

2- Is is possible to add a timeout for the operation as we have with BQ?

Thanks

jtcohen6 · 2022-08-08T12:41:20Z

@velascoluis Thanks for catching this so quickly! This is definitely still "experimental" code while we're investigating. We're relatively new to this tool set, so very very happy to have advice from actual practitioners at GCP :)

Rather than add many many additional configurations to profiles.yml, I'm thinking that we'd probably want to make each of these configurations available per model. That would also take advantage of another value prop of Dataproc Serverless — each model is its own independent "batch job," and can declare the properties of the infrastructure it wants to run on. We'll hard-code some things (always PySpark file URI, always use latest spark-bigquery-connector JAR), and leave the rest up to the user:

version: 2

models:
  - name: my_python_model
    config:
      pyspark_batch:
        args:
          - ...
      runtime_config:
        spark.executor.instances: 2
      execution_config:
        subnetwork_uri: ...

Similar logic could apply for "traditional" Dataproc (managed clusters), too — we could have a default cluster configured in profiles.yml, but then also give users the ability to override the cluster used for one particular model. It seems like we may want to enable both patterns, given the relative advantages of each.

Re: timeout/retry: We'd like to start using Google's built-in retry capabilities as much as possible (#230), rather than rolling this logic ourselves. Could this be as simple as adding the Retry() decorator where we're submitting the Dataproc job/batch?

ChenyuLInx · 2022-08-08T17:36:34Z

dbt/adapters/bigquery/impl.py

+        gcs_location = "gs://{}/{}".format(self.credential.gcs_bucket, filename)
+        batch.pyspark_batch.main_python_file_uri = gcs_location
+        batch.pyspark_batch.jar_file_uris = [
+            "gs://spark-lib/bigquery/spark-3.1-bigquery-0.26.0-preview.jar"  # how to keep this up to date?


Does this mean that if a user wants to add new python libraries, they will have to specify a new jar file or overwrite the existing one?

velascoluis · 2022-08-09T10:05:17Z

Rather than add many many additional configurations to profiles.yml, I'm thinking that we'd probably want to make each of these configurations available per model.

This makes a lot of sense, and will allow - as you mentioned before - to use dataproc serverless for workloads such as development / iterative or exceptional heavy one-shots (e.g. generating job templates with opinionated defaults) , leaving standard dataproc for daily/stable ELTs where cost/config is more predictable.

Re: timeout/retry: We'd like to start using Google's built-in retry capabilities as much as possible (#230), rather than rolling this logic ourselves. Could this be as simple as adding the Retry() decorator where we're submitting the Dataproc job/batch?

Yes, I was thinking on crafting spark properties to the job itself and control specific timeouts (network, idle workers ..), but delegating the responsibility seems more sensible and general approach.

Btw, as Im aware of some deployments of dbt core on GKE, I did some tests with dataproc on GKE, and it works fine as well (same API)

jtcohen6 · 2022-09-16T12:21:44Z

Closed in favor of #303

Experiment with Dataproc Serverless

526b934

cla-bot bot added the cla:yes label Aug 7, 2022

ChenyuLInx reviewed Aug 8, 2022

View reviewed changes

ChenyuLInx mentioned this pull request Aug 9, 2022

[CT-1021] Avoid creating notebook as the default way of running python model dbt-labs/dbt-spark#424

Closed

jtcohen6 closed this Sep 16, 2022

mikealfare deleted the jerco/dataproc-serverless-experiment branch February 27, 2023 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with Dataproc Serverless #259

Experiment with Dataproc Serverless #259

jtcohen6 commented Aug 7, 2022

github-actions bot commented Aug 7, 2022

velascoluis commented Aug 8, 2022 •

edited

Loading

jtcohen6 commented Aug 8, 2022

ChenyuLInx Aug 8, 2022

velascoluis commented Aug 9, 2022

jtcohen6 commented Sep 16, 2022

Experiment with Dataproc Serverless #259

Experiment with Dataproc Serverless #259

Conversation

jtcohen6 commented Aug 7, 2022

Description

Checklist

github-actions bot commented Aug 7, 2022

velascoluis commented Aug 8, 2022 • edited Loading

jtcohen6 commented Aug 8, 2022

ChenyuLInx Aug 8, 2022

Choose a reason for hiding this comment

velascoluis commented Aug 9, 2022

jtcohen6 commented Sep 16, 2022

velascoluis commented Aug 8, 2022 •

edited

Loading