Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/python model v1 #209

Merged
merged 23 commits into from
Jul 28, 2022
Merged

Feature/python model v1 #209

merged 23 commits into from
Jul 28, 2022

Conversation

ChenyuLInx
Copy link
Contributor

@ChenyuLInx ChenyuLInx commented Jun 29, 2022

This change currently includes table materialization.

Also super happy to hear any feedback and what you think we missed.
TODO:

@cla-bot cla-bot bot added the cla:yes label Jun 29, 2022
@ChenyuLInx ChenyuLInx marked this pull request as draft June 29, 2022 04:38
models__simple_python_model_v2 = """
import pandas

def model(dbt):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the model code looks like

spark = SparkSession.builder.appName('smallTest').getOrCreate()

spark.conf.set("viewsEnabled","true")
spark.conf.set("temporaryGcsBucket","python-model-test")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO in the comment above

Comment on lines 27 to 28
from google.cloud import dataproc_v1
from google.cloud import storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's add these as an extras_require — e.g. pip install dbt-bigquery[dataproc]
  • Let's raise a clearer error below if a user wants to use Python models, and hasn't installed the extra

Comment on lines 888 to 894
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)
output = (
storage.Client()
.get_bucket(matches.group(1))
.blob(f"{matches.group(2)}.000000000")
.download_as_string()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way to pass a full GCS resource URL into storage.Client()? That's sorta surprising to me. This isn't the worst regex, so I'm not strictly opposed to taking this approach if it's necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure! I just followed a tutorial google provided for this, we can look into it more if needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ope ok makes sense! :)

# Dataproc job output is saved to the Cloud Storage bucket
# allocated to the job. Use regex to obtain the bucket and blob info.
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)
output = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it that these are Spark logs being streamed back to GCS: https://cloud.google.com/dataproc/docs/guides/dataproc-job-output

If we felt motivated, we could try parsing these logs to infer metadata for the result. Not a priority right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I looked into it briefly, didn't found much things to surface up right now. We can take another look later on

jtcohen6 and others added 5 commits July 21, 2022 15:10
* Use same auth for Dataproc + GCS

* remove fetch result

Co-authored-by: Chenyu Li <chenyu.li@dbtlabs.com>
* fix test and add clear install instruction

* rename and fix format
@ChenyuLInx ChenyuLInx marked this pull request as ready for review July 26, 2022 04:34
Copy link
Contributor

@stu-k stu-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing stands out 👍

@emmyoop emmyoop closed this Jul 27, 2022
@emmyoop emmyoop reopened this Jul 27, 2022
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the dataproc_region wasn't found. Once that is working, looks fine.

@ChenyuLInx ChenyuLInx merged commit 55bbd3d into main Jul 28, 2022
@ChenyuLInx ChenyuLInx deleted the feature/python-model-v1 branch July 28, 2022 21:12
@dataders dataders mentioned this pull request Apr 25, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants