Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

Closed
6 tasks
jtcohen6 opened this issue Aug 3, 2022 · 7 comments · Fixed by #303
Closed
6 tasks

[CT-980] [feature] Use dataproc serverless instead of dataproc cluster #248

jtcohen6 opened this issue Aug 3, 2022 · 7 comments · Fixed by #303
Labels
enhancement New feature or request python_models

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 3, 2022

Description

This issue is updated from a spike to a normal issue, original description below

Update python model submission methods in dataproc to support both Dataproc cluster and Dataproc serverless. When a cluster ID is provided, we are going to use the dataproc cluster by default. User can overwrite it by specify a separate config in the yml file of yml file. When cluster ID is not provided, we will submit using serverless by default, user can overwrite it by provide a cluster ID and submission method in the config for model.

  • being able to run with cluster and serverless at the same time
  • enable tests for python again(running with serverless)

Original description

Good context in this Slack thread

Docs: https://cloud.google.com/dataproc-serverless/docs/overview

Dataproc Serverless:

  • does not require an always-on cluster
  • does not require end users to provide a dataproc_cluster_name + dataproc_region in their connection profile
  • may have better handling for concurrency / auto-scaling

Risks to investigate:

  • Less straightforward to set up initially — we would need to clearly document any pitfalls
  • Potentially more costly per vCPU hour (but much more cost-effective overall)
  • Longer runtimes, because dbt would have to submit "batch" PySpark jobs
  • Installing third-party packages is less straightforward, since they cannot be configured on the cluster at creation time. The mechanisms here seem to be a custom image + conda-pack (docs). The default image for Dataproc Serverless (docs) seems to include a few popular packages (koalas, numpy, pandas, regex, scikit-learn, ...), but this is far from everything a user would need.
@jtcohen6 jtcohen6 added enhancement New feature or request python_models labels Aug 3, 2022
@github-actions github-actions bot changed the title [Spike] Use dataproc serverless instead of dataproc cluster [CT-980] [Spike] Use dataproc serverless instead of dataproc cluster Aug 3, 2022
@jtcohen6
Copy link
Contributor Author

Additional ask in this slack thread: Could we avoid the need for the gcs_bucket configuration?

Currently, we write the compiled code to GCS as text:

def upload_to_gcs(self, filename: str, compiled_code: str):
bucket = self.storage_client.get_bucket(self.credential.gcs_bucket)
blob = bucket.blob(filename)
blob.upload_from_string(compiled_code)

It seems possible to pass in a pointer to a local file for Dataproc PySpark jobs instead. We'd need to write to a local file and upload that, which has its downsides (different OS, execution environments, ...)

@lostmygithubaccount
Copy link

we should support both if possible, but default to serverless. user would likely need to specify cluster to use non-serverless

@leahwicz
Copy link
Contributor

@ChenyuLInx let's convert this from a spike to what we actually are planning to do

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 7, 2022

@ChenyuLInx @lostmygithubaccount

Sharing my opinions:

  1. If the user specifies a cluster_name, they want to use a specific Dataproc cluster. If they don't specify a dataproc_cluster_name, or set dataproc_cluster_name to a falsy value (None, ""), they want to use Dataproc serverless.
  2. It should be possible to configure dataproc_cluster_name + dataproc_region + gcs_bucket for specific models, in addition to configuring defaults in profiles.yml. (This enables the same behavior as snowflake_warehouse. It also removes dbt-cloud changes to connection profiles as a true dependency.) If it's configured for a specific model, it takes precedence over the profile config. This would require an implementation change to how DataprocHelper is being instantiated, since different models could use different configuration, and properties need to be pulled off of parsed_model in addition to self.config.credentials.

Feel free to disagree — just want to make sure we have the alignment we need to move forward with this

@ChenyuLInx
Copy link
Contributor

ChenyuLInx commented Sep 7, 2022

@jtcohen6 in dbt-spark I added submission_method to make things a bit more specific, I think we can do similar things here.

  • default submission_method is serverless, user can also configure it to be cluster for the project/model.
  • this config will follow current overwriting config rule: project, config in yml, config in model.
  • we will raise an error if dataproc_cluster_name is not specified while cluster submission method is specified.

Two questions:

  • Should we also allow the user to overwrite the dataproc_cluster_name in config for specific models
  • User can have dataproc_cluster_name in their profile specified, but if they don't specify submission_method to be cluster, dbt will still run things as serverless. This could feel odd, what do we want to do? (This is not a problem right now as we require cluste_id for all types of submission method)

Tag @lostmygithubaccount also

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 8, 2022

@ChenyuLInx

added submission_method to make things a bit more specific, I think we can do similar things here.

Ok, I'm aligned! More explicit is good. I also prefer using cluster + serverless rather than GCP's language (job, batch), which feels less clear.

Should we also allow the user to overwrite the dataproc_cluster_name in config for specific models

Yes, I think so. I think that will just require us to pull the same configs off the parsed_model node if they're available. I know that means we can't build once + reuse DataprocHelper in the way we are currently, though.

User can have dataproc_cluster_name in their profile specified, but if they don't specify submission_method to be cluster, dbt will still run things as serverless. This could feel odd, what do we want to do?

I see: so we'll be asking users to define both:

dataproc_cluster_name: my-cluster
submission_method: cluster

If the first is missing, it will raise an error. If the second is missing, dbt will keep using serverless. I think that's fine, as long as we document it clearly.

Among acceptance criteria to closing this issue, could we also include re-enabling automated tests for dbt Python models on GCP? I'm hopeful that our use of Dataproc Serverless will make this more reliable (if also slow). If we need to open a separate ticket for retry / connection timeout, let's do it.

@ChenyuLInx
Copy link
Contributor

Sounds great!! I think we are all aligned!

@ChenyuLInx ChenyuLInx changed the title [CT-980] [Spike] Use dataproc serverless instead of dataproc cluster [CT-980] [feature] Use dataproc serverless instead of dataproc cluster Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_models
Projects
None yet
4 participants