-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for creating/dropping schema's #40
Add support for creating/dropping schema's #40
Conversation
I think you're going to want to delete the two methods shown here too: We disabled this functionality intentionally for two reasons:
Are you running dbt-spark against an s3 datastore, or against something else? I bet there's a good way for us to configure the |
Thanks for the background information. We're using DBT with Databricks on the Azure cloud. We're pleasantly surprised how well DBT works out of the box (with a few patches, such as #34). We use Azure Data Lake (ADLS) as persistent storage. By setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Thanks for this! For both reasons @drewbanin mentioned, we were hesitant to build database/schema creation and deletion as native dbt methods when we were just starting on Spark and had a testing sample size of 1. This is a totally sane addition, and it sounds like it hasn't caused any issues in testing.
I don't know enough about all the different Spark vendors/implementations to fully understand the possible mappings between a schema/database and a location in external storage, if a file path location
is not specified when running create schema
. It seems that there are reasonable ways to set default values in the cluster config or associated metastore. When I run this branch in Databricks (backed by AWS), it just works.
Outside of integration tests and CI builds (in dbt Cloud), I don't think dbt ever initiates the deletion of a schema. If we wanted to be extra safe, we could get away without drop_schema
. Given my goal of getting dbt-spark
to feature parity with the core adapters, though, we may as well define it all the same.
I'm immensely appreciative of your contributions to this plugin over the past few months. I know you have a few PRs open, which I'm hoping to work through later this week.
Thanks @jtcohen6 for the extensive response. For now, I've reverted the drop schema. I see how we need to be defensive here. In the cloud, almost every table is an external table since you don't want to have any tables on Hive itself, on the cluster. In the cloud, the clusters are temporary, so you don't want to keep it on HDFS, but store it on S3, GCS, or ADLS. I've been extensively testing DBT with Azure/Databricks/ADLS. And @Dandandan is testing extensively on AWS using EMR and Glue. We're happy to share the experiences. This change helps us to create databases for users. |
Thanks @jtcohen for testing this out and reviewing the open PRs! For EMR, we will also use setting hive.metastore.warehouse.dir, though overriding this location in a config setting might also be useful for us for targeting different target locations, maybe we will look into adding support for that at a later moment. |
I saw the configurable location is also being added by Niels in this PR #43 ; these two together would be perfect additions to the spark plugin for us! |
@Fokko Sorry, I should have been more clear! I think it's fine to include I'm immensely grateful to both of you for testing so extensively, and so glad to have you as contributors! Related work:
|
Co-Authored-By: Jeremy Cohen <jtcohen6@gmail.com>
c5c4285
to
85b671a
Compare
Thanks for the clarification @jtcohen6. I've reverted my latest commit. I think this is good to go :-) |
We needed support for creating and dropping schema's.