Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for creating/dropping schema's #40

Merged

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 10, 2019

We needed support for creating and dropping schema's.

@drewbanin
Copy link
Contributor

I think you're going to want to delete the two methods shown here too:
https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/adapters/spark/impl.py#L36-L46

We disabled this functionality intentionally for two reasons:

  1. We wanted to be extra sure that dbt didn't errantly drop the wrong schemas while this plugin was in early development
  2. We imagined most applications of this plugin creating external schemas, ie. supplying a LOCATION like s3://path/to/folder when creating schemas

Are you running dbt-spark against an s3 datastore, or against something else? I bet there's a good way for us to configure the location in the create schema query fwiw

@Fokko
Copy link
Contributor Author

Fokko commented Dec 21, 2019

Thanks for the background information. We're using DBT with Databricks on the Azure cloud. We're pleasantly surprised how well DBT works out of the box (with a few patches, such as #34).

We use Azure Data Lake (ADLS) as persistent storage. By setting hive.metastore.warehouse.dir to the ADLS storage bucket, all the data will be persisted on this location. We're using temporary clusters to process the pipeline, and this property is simply part of the cluster configuration. For AWS this will be very similar, you set this property on EMR/Databricks, and point it to a s3:// bucket.

 into fd-support-for-dropping-and-creating-support
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Thanks for this! For both reasons @drewbanin mentioned, we were hesitant to build database/schema creation and deletion as native dbt methods when we were just starting on Spark and had a testing sample size of 1. This is a totally sane addition, and it sounds like it hasn't caused any issues in testing.

I don't know enough about all the different Spark vendors/implementations to fully understand the possible mappings between a schema/database and a location in external storage, if a file path location is not specified when running create schema. It seems that there are reasonable ways to set default values in the cluster config or associated metastore. When I run this branch in Databricks (backed by AWS), it just works.

Outside of integration tests and CI builds (in dbt Cloud), I don't think dbt ever initiates the deletion of a schema. If we wanted to be extra safe, we could get away without drop_schema. Given my goal of getting dbt-spark to feature parity with the core adapters, though, we may as well define it all the same.

I'm immensely appreciative of your contributions to this plugin over the past few months. I know you have a few PRs open, which I'm hoping to work through later this week.

@Fokko
Copy link
Contributor Author

Fokko commented Jan 28, 2020

Thanks @jtcohen6 for the extensive response. For now, I've reverted the drop schema. I see how we need to be defensive here. In the cloud, almost every table is an external table since you don't want to have any tables on Hive itself, on the cluster. In the cloud, the clusters are temporary, so you don't want to keep it on HDFS, but store it on S3, GCS, or ADLS.

I've been extensively testing DBT with Azure/Databricks/ADLS. And @Dandandan is testing extensively on AWS using EMR and Glue. We're happy to share the experiences. This change helps us to create databases for users.

@Dandandan
Copy link
Contributor

Thanks @jtcohen for testing this out and reviewing the open PRs!

For EMR, we will also use setting hive.metastore.warehouse.dir, though overriding this location in a config setting might also be useful for us for targeting different target locations, maybe we will look into adding support for that at a later moment.

@Dandandan
Copy link
Contributor

I saw the configurable location is also being added by Niels in this PR #43 ; these two together would be perfect additions to the spark plugin for us!

@jtcohen6
Copy link
Contributor

@Fokko Sorry, I should have been more clear! I think it's fine to include drop_schema here. I was justifying (mostly to myself) why it's not as dangerous now as we felt it may have been several months ago.

I'm immensely grateful to both of you for testing so extensively, and so glad to have you as contributors!

Related work:

@Fokko Fokko force-pushed the fd-support-for-dropping-and-creating-support branch from c5c4285 to 85b671a Compare January 29, 2020 18:53
@Fokko
Copy link
Contributor Author

Fokko commented Jan 29, 2020

Thanks for the clarification @jtcohen6. I've reverted my latest commit. I think this is good to go :-)

@jtcohen6 jtcohen6 merged commit 615c273 into dbt-labs:master Jan 29, 2020
@Fokko Fokko deleted the fd-support-for-dropping-and-creating-support branch February 16, 2020 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants