Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with Dataproc Serverless #259

Closed
wants to merge 1 commit into from

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Aug 7, 2022

resolves #248

Description

What's cool? No need to provision a Dataproc Cluster in advance! No need to pay to keep it on 24/7! Sensible defaults around cluster sizing, and auto-scaling to handle concurrency.

The only required setup is enabling Dataproc APIs for a region, and some networking setup to enable "Private Google Access."

Screenshot Screenshot 2022-08-07 at 17 32 07

What's less cool?

  • SLOW. The Dataproc UI very clearly states that "Spark jobs take ~60 seconds to initialize resources." Additionally, because Dataproc Serverless submits batch jobs asyncronously, I noticed an additional delay between the job finishing, and dbt figuring that out. All in, the dbt model took 223.19s (3+ minutes) to run, the Dataproc elapsed time was 1 min 22 sec, and the "actual" run time was 34s.
  • Third-party packages: Using the default container, which only contains a small number of desirable packages. No really good way to add more, short of having dbt try to run Docker for you—I strongly feel that we should avoid doing this!! We could enable users to spin up their own custom containers, and pass the pointer to that container as a model configuration. Also, it is possible to pass the GCS locations of additional Python files, which could be good for first-party packages.

Maybe a dedicated Dataproc cluster in development, and Dataproc Serverless in production...?

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have run changie new to create a changelog entry

@cla-bot cla-bot bot added the cla:yes label Aug 7, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2022

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the dbt-core contributing guide, and the dbt-bigquery contriubuting guide.

@velascoluis
Copy link

velascoluis commented Aug 8, 2022

Hi, great to see spark serverless support on dbt!
A few of comments If I may
1 - Do you think the network should be also configurable at the profile level? As per this, it is very common to use a specific network for prod. workloads

batch.pyspark_batch.main_python_file_uri = gcs_location
        batch.pyspark_batch.jar_file_uris = [
            "gs://spark-lib/bigquery/spark-3.1-bigquery-0.26.0-preview.jar"  # how to keep this up to date?
        ]

      
        batch.runtime_config.properties = {
            "spark.executor.instances": "2",
        }

batch.environment_config. execution_config. subnetwork_uri = self.credential.dataproc_subnet
                

2- Is is possible to add a timeout for the operation as we have with BQ?

Thanks

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Aug 8, 2022

@velascoluis Thanks for catching this so quickly! This is definitely still "experimental" code while we're investigating. We're relatively new to this tool set, so very very happy to have advice from actual practitioners at GCP :)

Rather than add many many additional configurations to profiles.yml, I'm thinking that we'd probably want to make each of these configurations available per model. That would also take advantage of another value prop of Dataproc Serverless — each model is its own independent "batch job," and can declare the properties of the infrastructure it wants to run on. We'll hard-code some things (always PySpark file URI, always use latest spark-bigquery-connector JAR), and leave the rest up to the user:

version: 2

models:
  - name: my_python_model
    config:
      pyspark_batch:
        args:
          - ...
      runtime_config:
        spark.executor.instances: 2
      execution_config:
        subnetwork_uri: ...

Similar logic could apply for "traditional" Dataproc (managed clusters), too — we could have a default cluster configured in profiles.yml, but then also give users the ability to override the cluster used for one particular model. It seems like we may want to enable both patterns, given the relative advantages of each.

Re: timeout/retry: We'd like to start using Google's built-in retry capabilities as much as possible (#230), rather than rolling this logic ourselves. Could this be as simple as adding the Retry() decorator where we're submitting the Dataproc job/batch?

gcs_location = "gs://{}/{}".format(self.credential.gcs_bucket, filename)
batch.pyspark_batch.main_python_file_uri = gcs_location
batch.pyspark_batch.jar_file_uris = [
"gs://spark-lib/bigquery/spark-3.1-bigquery-0.26.0-preview.jar" # how to keep this up to date?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if a user wants to add new python libraries, they will have to specify a new jar file or overwrite the existing one?

@velascoluis
Copy link

Rather than add many many additional configurations to profiles.yml, I'm thinking that we'd probably want to make each of these configurations available per model.

This makes a lot of sense, and will allow - as you mentioned before - to use dataproc serverless for workloads such as development / iterative or exceptional heavy one-shots (e.g. generating job templates with opinionated defaults) , leaving standard dataproc for daily/stable ELTs where cost/config is more predictable.

Re: timeout/retry: We'd like to start using Google's built-in retry capabilities as much as possible (#230), rather than rolling this logic ourselves. Could this be as simple as adding the Retry() decorator where we're submitting the Dataproc job/batch?

Yes, I was thinking on crafting spark properties to the job itself and control specific timeouts (network, idle workers ..), but delegating the responsibility seems more sensible and general approach.

Btw, as Im aware of some deployments of dbt core on GKE, I did some tests with dataproc on GKE, and it works fine as well (same API)

@jtcohen6
Copy link
Contributor Author

Closed in favor of #303

@jtcohen6 jtcohen6 closed this Sep 16, 2022
@mikealfare mikealfare deleted the jerco/dataproc-serverless-experiment branch February 27, 2023 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster
3 participants