Add support for creating/dropping schema's #40

Fokko · 2019-12-10T12:59:27Z

We needed support for creating and dropping schema's.

drewbanin · 2019-12-20T21:25:31Z

I think you're going to want to delete the two methods shown here too:
https://github.com/fishtown-analytics/dbt-spark/blob/b4db0548266f260c3b75903ce08ad140805b355a/dbt/adapters/spark/impl.py#L36-L46

We disabled this functionality intentionally for two reasons:

We wanted to be extra sure that dbt didn't errantly drop the wrong schemas while this plugin was in early development
We imagined most applications of this plugin creating external schemas, ie. supplying a LOCATION like s3://path/to/folder when creating schemas

Are you running dbt-spark against an s3 datastore, or against something else? I bet there's a good way for us to configure the location in the create schema query fwiw

Fokko · 2019-12-21T12:34:50Z

Thanks for the background information. We're using DBT with Databricks on the Azure cloud. We're pleasantly surprised how well DBT works out of the box (with a few patches, such as #34).

We use Azure Data Lake (ADLS) as persistent storage. By setting hive.metastore.warehouse.dir to the ADLS storage bucket, all the data will be persisted on this location. We're using temporary clusters to process the pipeline, and this property is simply part of the cluster configuration. For AWS this will be very similar, you set this property on EMR/Databricks, and point it to a s3:// bucket.

into fd-support-for-dropping-and-creating-support

jtcohen6

@Fokko Thanks for this! For both reasons @drewbanin mentioned, we were hesitant to build database/schema creation and deletion as native dbt methods when we were just starting on Spark and had a testing sample size of 1. This is a totally sane addition, and it sounds like it hasn't caused any issues in testing.

I don't know enough about all the different Spark vendors/implementations to fully understand the possible mappings between a schema/database and a location in external storage, if a file path location is not specified when running create schema. It seems that there are reasonable ways to set default values in the cluster config or associated metastore. When I run this branch in Databricks (backed by AWS), it just works.

Outside of integration tests and CI builds (in dbt Cloud), I don't think dbt ever initiates the deletion of a schema. If we wanted to be extra safe, we could get away without drop_schema. Given my goal of getting dbt-spark to feature parity with the core adapters, though, we may as well define it all the same.

I'm immensely appreciative of your contributions to this plugin over the past few months. I know you have a few PRs open, which I'm hoping to work through later this week.

Fokko · 2020-01-28T20:37:06Z

Thanks @jtcohen6 for the extensive response. For now, I've reverted the drop schema. I see how we need to be defensive here. In the cloud, almost every table is an external table since you don't want to have any tables on Hive itself, on the cluster. In the cloud, the clusters are temporary, so you don't want to keep it on HDFS, but store it on S3, GCS, or ADLS.

I've been extensively testing DBT with Azure/Databricks/ADLS. And @Dandandan is testing extensively on AWS using EMR and Glue. We're happy to share the experiences. This change helps us to create databases for users.

Dandandan · 2020-01-28T20:57:01Z

Thanks @jtcohen for testing this out and reviewing the open PRs!

For EMR, we will also use setting hive.metastore.warehouse.dir, though overriding this location in a config setting might also be useful for us for targeting different target locations, maybe we will look into adding support for that at a later moment.

Dandandan · 2020-01-28T21:25:40Z

I saw the configurable location is also being added by Niels in this PR #43 ; these two together would be perfect additions to the spark plugin for us!

jtcohen6 · 2020-01-29T15:16:07Z

@Fokko Sorry, I should have been more clear! I think it's fine to include drop_schema here. I was justifying (mostly to myself) why it's not as dangerous now as we felt it may have been several months ago.

I'm immensely grateful to both of you for testing so extensively, and so glad to have you as contributors!

Related work:

I agree that, since Expose location, clustered_by to dbt-spark #43 will allow overriding the location of specific tables, the location defined in the schema/database is not quite as all-important as we originally thought.
There's a current bug with partial runs (including dbt seed) if you have custom schemas defined, where dbt tries to inspect metadata for a schema it has not yet created. My proposed changes in Leverage "show table extended" for Relation type metadata #50 would fix.

Co-Authored-By: Jeremy Cohen <jtcohen6@gmail.com>

Fokko · 2020-01-29T18:54:37Z

Thanks for the clarification @jtcohen6. I've reverted my latest commit. I think this is good to go :-)

Add support for creating/dropping schema's

e4e3e2a

Remove exceptions

09dc029

Merge branch 'master' of https://github.com/fishtown-analytics/dbt-spark

9f5bab6

into fd-support-for-dropping-and-creating-support

jtcohen6 approved these changes Jan 28, 2020

View reviewed changes

aaronsteers and others added 4 commits January 29, 2020 19:53

instructions for installing from master branch

70e50b7

auto-include PyHive 'hive' extras

d4c82af

Update README.md

7e189db

Co-Authored-By: Jeremy Cohen <jtcohen6@gmail.com>

removing reference to pip from master

85b671a

Fokko force-pushed the fd-support-for-dropping-and-creating-support branch from c5c4285 to 85b671a Compare January 29, 2020 18:53

jtcohen6 merged commit 615c273 into dbt-labs:master Jan 29, 2020

aaronsteers mentioned this pull request Feb 7, 2020

Materialized tables creation fails on EMR #21

Closed

Fokko deleted the fd-support-for-dropping-and-creating-support branch February 16, 2020 13:24

jtcohen6 mentioned this pull request Mar 7, 2020

Support schema/database creation #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for creating/dropping schema's #40

Add support for creating/dropping schema's #40

Fokko commented Dec 10, 2019

drewbanin commented Dec 20, 2019

Fokko commented Dec 21, 2019

jtcohen6 left a comment •

edited

Loading

Fokko commented Jan 28, 2020

Dandandan commented Jan 28, 2020

Dandandan commented Jan 28, 2020

jtcohen6 commented Jan 29, 2020

Fokko commented Jan 29, 2020

Add support for creating/dropping schema's #40

Add support for creating/dropping schema's #40

Conversation

Fokko commented Dec 10, 2019

drewbanin commented Dec 20, 2019

Fokko commented Dec 21, 2019

jtcohen6 left a comment • edited Loading

Choose a reason for hiding this comment

Fokko commented Jan 28, 2020

Dandandan commented Jan 28, 2020

Dandandan commented Jan 28, 2020

jtcohen6 commented Jan 29, 2020

Fokko commented Jan 29, 2020

jtcohen6 left a comment •

edited

Loading