Materialized tables creation fails on EMR #21

rhousewright · 2019-05-22T00:31:14Z

When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:

> dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 100 macros, 0 operations, 0 seed files, 1 sources

16:25:55 | Concurrency: 1 threads (target='dev')
16:25:55 |
16:25:55 | 1 of 1 START table model data_lake_intgn_test.test_incremental_dt.... [RUN]
16:25:58 | 1 of 1 ERROR creating table model data_lake_intgn_test.test_incremental_dt [ERROR in 2.37s]
16:25:58 |
16:25:58 | Finished running 1 table models in 6.23s.

Completed with 1 errors:

Runtime Error in model data_lake_intgn_current|test_incremental_dt|current (models/data_lake_intgn/current/data_lake_intgn_current|test_incremental_dt|current.sql)
  java.lang.IllegalArgumentException: Can not create a Path from an empty string

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):

py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
	at org.apache.hadoop.fs.Path.<init>(Path.java:175)
	at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:339)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:345)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
	at org.apache.spark.sql.hive.client.HiveClientImpl.getDatabase(HiveClientImpl.scala:338)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.getDatabase(HiveExternalCatalog.scala:210)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:233)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.defaultTablePath(SessionCatalog.scala:472)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:327)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

If I run the same query with the addition of a location statement, however, I do not get an error and the table is created successfully - e.g.:

create table [name]
using parquet 
location 's3://[bucket]/[path]' 
as SELECT * FROM [blah]"

I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn't work on more vanilla Spark, at least in the context of EMR. It's possible that fiddling with some Spark configs could mitigate this, but in general I'd think that specifying an s3 path for a table would be a fairly normal thing to want to do.

There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:

Set a 'root' location in your dbt_project.yml, and have dbt format model names into it, i.e. set root to be s3://my-bucket/prod/models/ and have model_1 get automatically put into s3://my-bucket/prod/models/model_1/, model_2 automatically go into s3://my-bucket/prod/models/model_2/, and so on.
Set a specific table location at the table level via config, so you arbitrarily place model_1 at s3://bucket-a/some-model and model_2 at s3://bucket-b/some-model-also

The text was updated successfully, but these errors were encountered:

tromika · 2019-05-23T08:04:52Z

@rhousewright You can create a database in advance with
CREATE DATABASE whatever LOCATION 's3://[bucket]/[path]' and point dbt with the database/schema option in the profile settings to this database in Hive so it automatically knows the location. The only problem with EMR, you will need an external hive metastore for persist this database.

drewbanin · 2019-05-23T12:59:36Z

Thanks for the report @rhousewright and for some good insight @tromika.

We've designed this plugin to require that the target database is created manually and is initialized with a location. This is analogous to how dbt works on other types of databases - it doesn't create a target database for you, instead operating at the schema-level. It's unclear how this should work on Spark, given that database == schema!

Set a 'root' location in your dbt_project.yml, and have dbt format model names into it
I like this idea! I imagine the location would be specified in the profiles.yml file, given that it will vary between environments, eg:

dev: location: s3://my-bucket/dbt-drew/models/
prod: location: s3://my-bucket/prod/models/

Set a specific table location at the table level via config
I think this is a good idea too! Do you think there's any issue with creating every object with the same prefix? I imagined that list calls on S3 would be slow if there were very many models rendered into the same s3 path. Instead, we might be able to generate a prefix by default, like:

s3://my-bucket/env-name/project-name/path/to/model

You'd be able to override part (or all?) of this path in config.

I haven't done any testing, so I actually don't know if that's desirable. What do you guys think?

aaronsteers · 2020-02-07T03:53:25Z

Just checking in on older issues. Should this still remain open? Seems like it might be resolved by #43 (ability to configure location_root) and/or #40 (support for creating schemas).

Dandandan · 2020-02-07T07:08:03Z

For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work.
Adding location_root support makes it much more flexible where to store individual tables in EMR.

jtcohen6 · 2020-02-07T19:22:19Z

Agreed @aaronsteers @Dandandan. I'm going to close this issue, given that those two features will ship in the next release.

jtcohen6 closed this as completed Feb 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialized tables creation fails on EMR #21

Materialized tables creation fails on EMR #21

rhousewright commented May 22, 2019

tromika commented May 23, 2019 •

edited

Loading

drewbanin commented May 23, 2019

aaronsteers commented Feb 7, 2020

Dandandan commented Feb 7, 2020 •

edited

Loading

jtcohen6 commented Feb 7, 2020

Materialized tables creation fails on EMR #21

Materialized tables creation fails on EMR #21

Comments

rhousewright commented May 22, 2019

tromika commented May 23, 2019 • edited Loading

drewbanin commented May 23, 2019

aaronsteers commented Feb 7, 2020

Dandandan commented Feb 7, 2020 • edited Loading

jtcohen6 commented Feb 7, 2020

tromika commented May 23, 2019 •

edited

Loading

Dandandan commented Feb 7, 2020 •

edited

Loading