Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Can not create a Path from an empty string on unpartitioned table #2797

Closed
vansimonsen opened this issue Apr 9, 2021 · 6 comments

Comments

@vansimonsen
Copy link

vansimonsen commented Apr 9, 2021

Describe the problem you faced

  • Issue trying to create unpartitioned tables to hive metastore (in aws glue data catalog) using hudi (Tested on 0.6.0, 0.7.0 and 0.8.0 )

  • Using hudi on AWS EMR, with pyspark

  • Previous fix is implemented on new versions, but it continues failing

  • Hudi config for unpartitioned tables

hudiConfig = {
   "hoodie.datasource.write.precombine.field": <column>,
   "hoodie.datasource.write.recordkey.field": _PRIMARY_KEY_COLUMN,
   "hoodie.datasource.write.keygenerator.class": 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
   "hoodie.datasource.hive_sync.partition_extractor_class": 'org.apache.hudi.hive.NonPartitionedExtractor',
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "className": "org.apache.hudi",
   "hoodie.datasource.hive_sync.use_jdbc": "false",
   "hoodie.consistency.check.enabled": "true",
   "hoodie.datasource.hive_sync.database": DB_NAME,
   "hoodie.datasource.hive_sync.enable": "true",
   "hoodie.datasource.hive_sync.support_timestamp": "true",
}

To Reproduce

Steps to reproduce the behavior:

  1. Run hudi with hive integration
  2. Try to create an unpartitioned table, with config previously specified

Expected behavior

The table would be created without throw the exception, without any partition or default partitionpath

Environment Description

  • Hudi version : 0.6.0, 0.7.0 and 0.8.0

  • Spark version : 2.4.7

  • Hive version : Aws glue data catalog integration on EMR

  • Hadoop version : Amazon Hadoop distribution

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Stacktrace

org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210407181606
   at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:150)
   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
   at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:355)
   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:403)
   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:399)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
   at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
   at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:460)
   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:217)
   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
   at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
   at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
   at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:238)
   at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
   at org.apache.hadoop.fs.Path.<init>(Path.java:180)
   at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
   at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
   at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
   at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
   at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:62)
   at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:552)
   at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
   at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
   at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
   ... 46 more
@aditiwari01
Copy link
Contributor

aditiwari01 commented Apr 10, 2021

Issue (#2801) might be a duplicate.

However while creating an unpartitioned table, my dataframe.write is getting succeeded but I am not able to query the data via hive. Although spark read are working fine for me though. (Testing via spark shell and I am using jdbc to connect to hive)

@n3nash
Copy link
Contributor

n3nash commented Apr 13, 2021

@vansimonsen Can you check the issue that @aditiwari01 is pointing to and check if you are using the correct KeyGenerators as well as PartitionValueExtractor (check here -> https://hudi.apache.org/docs/configurations.html#HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY) ?

Additionally, this looks like the basePath might not have been correctly registered to Glue. Let me know after you check these configs, if they don't work, this may be a legit bug

@ismailsimsek
Copy link

its might be related to missing Glue database s3 path, the field is named "Amazon S3 path"(lakeformation) or "Location"(glue) in aws console

as far as i see at one point in code it is tryiong to construct a path like : getDatabasePath + tableName
in my case it was creating: s3://MyBucketMytable because of missing /. at the end of the database Location

@n3nash
Copy link
Contributor

n3nash commented Apr 15, 2021

@ismailsimsek Are you saying it was fixed after you fixed the databasePath / location in your glue metastore to include / ? Is the / expected always at the end of the path ? If yes, we can probably put in that fix in hudi hive sync.

@vansimonsen Can you check if this is the root cause for you ?

@n3nash n3nash added this to In progress in GI Tracker Board Apr 22, 2021
@n3nash
Copy link
Contributor

n3nash commented Jun 4, 2021

@ismailsimsek @vansimonsen Closing this due to inactivity, please re-open it or open a new one if you need further assistance.

@n3nash n3nash closed this as completed Jun 4, 2021
GI Tracker Board automation moved this from In progress to Done Jun 4, 2021
@pranotishanbhag
Copy link

I am facing the same issue. Please can you share the fix. I am using Hudi version 0.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants