-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialized tables creation fails on EMR #21
Comments
@rhousewright You can create a database in advance with |
Thanks for the report @rhousewright and for some good insight @tromika. We've designed this plugin to require that the target database is created manually and is initialized with a
You'd be able to override part (or all?) of this path in config. I haven't done any testing, so I actually don't know if that's desirable. What do you guys think? |
For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work. |
Agreed @aaronsteers @Dandandan. I'm going to close this issue, given that those two features will ship in the next release. |
When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:
If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):
If I run the same query with the addition of a
location
statement, however, I do not get an error and the table is created successfully - e.g.:I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn't work on more vanilla Spark, at least in the context of EMR. It's possible that fiddling with some Spark configs could mitigate this, but in general I'd think that specifying an s3 path for a table would be a fairly normal thing to want to do.
There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:
s3://my-bucket/prod/models/
and havemodel_1
get automatically put intos3://my-bucket/prod/models/model_1/
,model_2
automatically go intos3://my-bucket/prod/models/model_2/
, and so on.model_1
ats3://bucket-a/some-model
andmodel_2
ats3://bucket-b/some-model-also
The text was updated successfully, but these errors were encountered: