Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] "Failed to get update last commit time synced to 20200804071144" #1909

Closed
mingujotemp opened this issue Aug 4, 2020 · 5 comments
Closed

Comments

@mingujotemp
Copy link

mingujotemp commented Aug 4, 2020

Describe the problem you faced

HUDI 0.5.0 (using on EMR)

I encounter org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20200804071144 when I try to write a non-partitioned table on Glue(S3) using HUDI.

To Reproduce

Steps to reproduce the behavior:

  1. create a pyspark dataframe
  2. Write a new df by runnning with the following options
hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'id',
  'hoodie.index.type': 'BLOOM',
  'hoodie.datasource.write.partitionpath.field': '',
  'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.NonpartitionedKeyGenerator',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'upsert',
  'hoodie.datasource.write.precombine.field': 'updated_at',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2,
  'hoodie.bulkinsert.shuffle.parallelism': 10,
  'hoodie.datasource.hive_sync.database': databaseName,
  'hoodie.datasource.hive_sync.table': tableName,
  'hoodie.datasource.hive_sync.enable': 'true',
  'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
  'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor',
  'hoodie.datasource.hive_sync.partition_fields': '',
}
df.write.format("org.apache.hudi"). \
  options(**hudi_options). \
  mode("overwrite"). \
  save(basePath)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.5.0

  • Spark version : 2.4.4

  • Hive version : 3.1.2 (Using Glue)

  • Hadoop version : 3.2.1-amzn-0

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

using the following jars
hudi-spark-bundle-0.5.0-incubating-amzn-1.jar
hudi-hive-bundle-0.5.0-incubating-amzn-1.jar
hudi-hadoop-mr-bundle-0.5.0-incubating-amzn-1.jar
spark-avro_2.12-2.4.4.jar
installed on EMR 6.0.0

Stacktrace

20/08/04 07:11:50 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 738, in save
    self._jwrite.save(path)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o273.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20200804071144
	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:667)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:109)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:67)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:236)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:156)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:84)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:172)
	at org.apache.hadoop.fs.Path.<init>(Path.java:184)
	at org.apache.hadoop.hive.metastore.Warehouse.getDatabasePath(Warehouse.java:172)
	at org.apache.hadoop.hive.metastore.Warehouse.getTablePath(Warehouse.java:184)
	at org.apache.hadoop.hive.metastore.Warehouse.getFileStatusesForUnpartitionedTable(Warehouse.java:520)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.updateUnpartitionedTableStatsFast(MetaStoreUtils.java:180)
	at com.amazonaws.glue.shims.AwsGlueSparkHiveShims.updateTableStatsFast(AwsGlueSparkHiveShims.java:75)
	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:538)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:374)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:359)
	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:665)
	... 35 more
@bvaradar
Copy link
Contributor

bvaradar commented Aug 4, 2020

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 738, in save
    self._jwrite.save(path)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o273.save.

It looks like hive-conf is not set correctly.

@mingujotemp
Copy link
Author

@bvaradar could you elaborate more? which part on hive-conf are you describing? is it hive-conf.xml on emr or hive configuration for hudi?

@bvaradar
Copy link
Contributor

bvaradar commented Aug 6, 2020

It appears like hive-site.xml may not be set correctly. Hive metastore client is not able to find hive.server2.thrift.url from config.

@bvaradar
Copy link
Contributor

Closing this ticket. Please reopen if this issue is specific to Hudi

@ismailsimsek
Copy link

probably related to #2797 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants