Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialized tables creation fails on EMR #21

Closed
rhousewright opened this issue May 22, 2019 · 5 comments
Closed

Materialized tables creation fails on EMR #21

rhousewright opened this issue May 22, 2019 · 5 comments

Comments

@rhousewright
Copy link
Contributor

When I attempt to have dbt run a simple job that results in a materialized Spark table using EMR, I get an error as follows:

> dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 100 macros, 0 operations, 0 seed files, 1 sources

16:25:55 | Concurrency: 1 threads (target='dev')
16:25:55 |
16:25:55 | 1 of 1 START table model data_lake_intgn_test.test_incremental_dt.... [RUN]
16:25:58 | 1 of 1 ERROR creating table model data_lake_intgn_test.test_incremental_dt [ERROR in 2.37s]
16:25:58 |
16:25:58 | Finished running 1 table models in 6.23s.

Completed with 1 errors:

Runtime Error in model data_lake_intgn_current|test_incremental_dt|current (models/data_lake_intgn/current/data_lake_intgn_current|test_incremental_dt|current.sql)
  java.lang.IllegalArgumentException: Can not create a Path from an empty string

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

If I run the compiled query directly in PySpark on the EMR cluster, I get the same error message (with the following more complete stack trace):

py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163)
	at org.apache.hadoop.fs.Path.<init>(Path.java:175)
	at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.stringToURI(ExternalCatalogUtils.scala:236)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:343)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1$$anonfun$apply$2.apply(HiveClientImpl.scala:339)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:339)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getDatabase$1.apply(HiveClientImpl.scala:345)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
	at org.apache.spark.sql.hive.client.HiveClientImpl.getDatabase(HiveClientImpl.scala:338)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getDatabase$1.apply(HiveExternalCatalog.scala:211)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.getDatabase(HiveExternalCatalog.scala:210)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getDatabase(ExternalCatalogWithListener.scala:65)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:233)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.defaultTablePath(SessionCatalog.scala:472)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$4.apply(SessionCatalog.scala:327)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:327)
	at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
	at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

If I run the same query with the addition of a location statement, however, I do not get an error and the table is created successfully - e.g.:

create table [name]
using parquet 
location 's3://[bucket]/[path]' 
as SELECT * FROM [blah]"

I think that the root cause of this is that Databricks does some behind-the-scenes magic with default locations / managed tables / DBFS, which doesn't work on more vanilla Spark, at least in the context of EMR. It's possible that fiddling with some Spark configs could mitigate this, but in general I'd think that specifying an s3 path for a table would be a fairly normal thing to want to do.

There are a couple approaches that occur to me for dealing with this, which could probably be combined into a default / override kind of situation:

  • Set a 'root' location in your dbt_project.yml, and have dbt format model names into it, i.e. set root to be s3://my-bucket/prod/models/ and have model_1 get automatically put into s3://my-bucket/prod/models/model_1/, model_2 automatically go into s3://my-bucket/prod/models/model_2/, and so on.
  • Set a specific table location at the table level via config, so you arbitrarily place model_1 at s3://bucket-a/some-model and model_2 at s3://bucket-b/some-model-also
@tromika
Copy link
Contributor

tromika commented May 23, 2019

@rhousewright You can create a database in advance with
CREATE DATABASE whatever LOCATION 's3://[bucket]/[path]' and point dbt with the database/schema option in the profile settings to this database in Hive so it automatically knows the location. The only problem with EMR, you will need an external hive metastore for persist this database.

@drewbanin
Copy link
Contributor

Thanks for the report @rhousewright and for some good insight @tromika.

We've designed this plugin to require that the target database is created manually and is initialized with a location. This is analogous to how dbt works on other types of databases - it doesn't create a target database for you, instead operating at the schema-level. It's unclear how this should work on Spark, given that database == schema!

Set a 'root' location in your dbt_project.yml, and have dbt format model names into it
I like this idea! I imagine the location would be specified in the profiles.yml file, given that it will vary between environments, eg:

  • dev: location: s3://my-bucket/dbt-drew/models/
  • prod: location: s3://my-bucket/prod/models/

Set a specific table location at the table level via config
I think this is a good idea too! Do you think there's any issue with creating every object with the same prefix? I imagined that list calls on S3 would be slow if there were very many models rendered into the same s3 path. Instead, we might be able to generate a prefix by default, like:

s3://my-bucket/env-name/project-name/path/to/model

You'd be able to override part (or all?) of this path in config.

I haven't done any testing, so I actually don't know if that's desirable. What do you guys think?

@aaronsteers
Copy link
Contributor

Just checking in on older issues. Should this still remain open? Seems like it might be resolved by #43 (ability to configure location_root) and/or #40 (support for creating schemas).

@Dandandan
Copy link
Contributor

Dandandan commented Feb 7, 2020

For me, an older version was not working, dbt spark master + a location already provided in the glue data catalog (or creating it from dbt with schema creation) does work.
Adding location_root support makes it much more flexible where to store individual tables in EMR.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 7, 2020

Agreed @aaronsteers @Dandandan. I'm going to close this issue, given that those two features will ship in the next release.

@jtcohen6 jtcohen6 closed this as completed Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants