Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present #764

Closed
jackwang2 opened this issue Jun 27, 2019 · 26 comments
Assignees

Comments

@jackwang2
Copy link

jackwang2 commented Jun 27, 2019

19/06/27 08:38:26 WARN TaskSetManager: Lost task 5.0 in stage 26.0 (TID 13747, ip-172-19-101-41, executor 0): com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :5
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:271)
	at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: No value present
	at com.uber.hoodie.common.util.Option.get(Option.java:112)
	at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
	... 28 more

19/06/27 08:38:26 INFO TaskSetManager: Starting task 5.1 in stage 26.0 (TID 13749, ip-172-19-102-145, executor 4, partition 5, PROCESS_LOCAL, 7653 bytes)
19/06/27 08:38:26 WARN TaskSetManager: Lost task 4.0 in stage 26.0 (TID 13746, ip-172-19-102-145, executor 4): com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :4
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:271)
	at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: No value present
	at com.uber.hoodie.common.util.Option.get(Option.java:112)
	at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
	... 28 more

19/06/27 08:38:26 INFO TaskSetManager: Starting task 4.1 in stage 26.0 (TID 13750, ip-172-19-103-242, executor 2, partition 4, PROCESS_LOCAL, 7653 bytes)
19/06/27 08:38:26 WARN TaskSetManager: Lost task 6.0 in stage 26.0 (TID 13748, ip-172-19-102-162, executor 3): com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :6
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:271)
	at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:442)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1109)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: No value present
	at com.uber.hoodie.common.util.Option.get(Option.java:112)
	at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
	... 28 more

I use the AWS S3 as storage, and the piece of code likes below

      df.write
        .format("com.uber.hoodie")
        .mode(SaveMode.Append)
        .option(HoodieWriteConfig.TABLE_NAME, tableName)
        .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name)
        .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKey)
        .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionCol)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, storageType)
        .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, preCombineCol)
        .option("hoodie.consistency.check.enabled", "true")
        .option("hoodie.parquet.small.file.limit", 1024 * 1024 * 128)
        .save(tgtFilePath)`

Highly appreciate it if you show any ideas?

@jackwang2 jackwang2 changed the title Hoodie 0.4.7: Error upserting bucketType UPDATE for partition, No value present Hoodie 0.4.7: Error upserting bucketType UPDATE for partition #, No value present Jun 27, 2019
@vinothchandar
Copy link
Member

Looks similar to https://issues.apache.org/jira/browse/HUDI-116 , although 0.4.7 should have the fix already.. Seems like a tagging issue again? are you able to trim the data down and give me a reproducible case?

And for context, you upgraded to 0.4.7 and it started failing right away? were things working fine in prior version (can you share that?)

@bvaradar @n3nash as well..

@jackwang2
Copy link
Author

jackwang2 commented Jun 28, 2019

@vinothchandar Thanks for looking this. Yes, seems the version of 0.4.7 is not compatible with 0.4.6, and it throws exception like below, but I use the new version 0.4.7 and write data to a new location to skip the incompatible issue.

19/06/25 12:37:51 INFO HoodieCommitArchiveLog: Archiving instants [[20190618193930_cleanCOMPLETED], [20190618201752cleanCOMPLETED], [20190618204851cleanCOMPLETED], [20190618212741cleanCOMPLETED], [20190618215856cleanCOMPLETED], [20190618223733cleanCOMPLETED], [20190618231034cleanCOMPLETED], [20190618234543cleanCOMPLETED], [20190619002337cleanCOMPLETED], [20190619005833cleanCOMPLETED], [20190619013959clean_COMPLETED]][WARNING] Avro: Invalid default for field hoodieCommitMetadata: "null" not a ["null",{"type":"record","name":"HoodieCommitMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"partitionToWriteStats","type":["null",{"type":"map","values":{"type":"array","items":{"type":"record","name":"HoodieWriteStat","fields":[{"name":"fileId","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"path","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"prevCommit","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"numWrites","type":["null","long"],"default":null},{"name":"numDeletes","type":["null","long"],"default":null},{"name":"numUpdateWrites","type":["null","long"],"default":null},{"name":"totalWriteBytes","type":["null","long"],"default":null},{"name":"totalWriteErrors","type":["null","long"],"default":null},{"name":"partitionPath","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"totalLogRecords","type":["null","long"],"default":null},{"name":"totalLogFiles","type":["null","long"],"default":null},{"name":"totalUpdatedRecordsCompacted","type":["null","long"],"default":null},{"name":"numInserts","type":["null","long"],"default":null},{"name":"totalLogBlocks","type":["null","long"],"default":null},{"name":"totalCorruptLogBlock","type":["null","long"],"default":null},{"name":"totalRollbackBlocks","type":["null","long"],"default":null},{"name":"fileSizeInBytes","type":["null","long"],"default":null}]}},"avro.java.string":"String"}]},{"name":"extraMetadata","type":["null",{"type":"map","values":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}]}]}][WARNING] Avro: Invalid default for field hoodieCleanMetadata: "null" not a ["null",{"type":"record","name":"HoodieCleanMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"startCleanTime","type":{"type":"string","avro.java.string":"String"}},{"name":"timeTakenInMillis","type":"long"},{"name":"totalFilesDeleted","type":"int"},{"name":"earliestCommitToRetain","type":{"type":"string","avro.java.string":"String"}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieCleanPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"policy","type":{"type":"string","avro.java.string":"String"}},{"name":"deletePathPatterns","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"successDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"failedDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}][WARNING] Avro: Invalid default for field hoodieCompactionMetadata: "null" not a ["null",{"type":"record","name":"HoodieCompactionMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"partitionToCompactionWriteStats","type":["null",{"type":"map","values":{"type":"array","items":{"type":"record","name":"HoodieCompactionWriteStat","fields":[{"name":"partitionPath","type":["null",{"type":"string","avro.java.string":"String"}]},{"name":"totalLogRecords","type":["null","long"]},{"name":"totalLogFiles","type":["null","long"]},{"name":"totalUpdatedRecordsCompacted","type":["null","long"]},{"name":"hoodieWriteStat","type":["null",{"type":"record","name":"HoodieWriteStat","fields":[{"name":"fileId","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"path","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"prevCommit","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"numWrites","type":["null","long"],"default":null},{"name":"numDeletes","type":["null","long"],"default":null},{"name":"numUpdateWrites","type":["null","long"],"default":null},{"name":"totalWriteBytes","type":["null","long"],"default":null},{"name":"totalWriteErrors","type":["null","long"],"default":null},{"name":"partitionPath","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"totalLogRecords","type":["null","long"],"default":null},{"name":"totalLogFiles","type":["null","long"],"default":null},{"name":"totalUpdatedRecordsCompacted","type":["null","long"],"default":null},{"name":"numInserts","type":["null","long"],"default":null},{"name":"totalLogBlocks","type":["null","long"],"default":null},{"name":"totalCorruptLogBlock","type":["null","long"],"default":null},{"name":"totalRollbackBlocks","type":["null","long"],"default":null},{"name":"fileSizeInBytes","type":["null","long"],"default":null}]}]}]}},"avro.java.string":"String"}]}]}][WARNING] Avro: Invalid default for field hoodieRollbackMetadata: "null" not a ["null",{"type":"record","name":"HoodieRollbackMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"startRollbackTime","type":{"type":"string","avro.java.string":"String"}},{"name":"timeTakenInMillis","type":"long"},{"name":"totalFilesDeleted","type":"int"},{"name":"commitsRollback","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieRollbackPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"successDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"failedDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}][WARNING] Avro: Invalid default for field hoodieSavePointMetadata: "null" not a ["null",{"type":"record","name":"HoodieSavepointMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"savepointedBy","type":{"type":"string","avro.java.string":"String"}},{"name":"savepointedAt","type":"long"},{"name":"comments","type":{"type":"string","avro.java.string":"String"}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieSavepointPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"savepointDataFile","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}]
19/06/25 12:37:51 INFO HoodieCommitArchiveLog: Wrapper schema {"type":"record","name":"HoodieArchivedMetaEntry","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"hoodieCommitMetadata","type":["null",{"type":"record","name":"HoodieCommitMetadata","fields":[{"name":"partitionToWriteStats","type":["null",{"type":"map","values":{"type":"array","items":{"type":"record","name":"HoodieWriteStat","fields":[{"name":"fileId","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"path","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"prevCommit","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"numWrites","type":["null","long"],"default":null},{"name":"numDeletes","type":["null","long"],"default":null},{"name":"numUpdateWrites","type":["null","long"],"default":null},{"name":"totalWriteBytes","type":["null","long"],"default":null},{"name":"totalWriteErrors","type":["null","long"],"default":null},{"name":"partitionPath","type":["null",{"type":"string","avro.java.string":"String"}],"default":null},{"name":"totalLogRecords","type":["null","long"],"default":null},{"name":"totalLogFiles","type":["null","long"],"default":null},{"name":"totalUpdatedRecordsCompacted","type":["null","long"],"default":null},{"name":"numInserts","type":["null","long"],"default":null},{"name":"totalLogBlocks","type":["null","long"],"default":null},{"name":"totalCorruptLogBlock","type":["null","long"],"default":null},{"name":"totalRollbackBlocks","type":["null","long"],"default":null},{"name":"fileSizeInBytes","type":["null","long"],"default":null}]}},"avro.java.string":"String"}]},{"name":"extraMetadata","type":["null",{"type":"map","values":{"type":"string","avro.java.string":"String"},"avro.java.string":"String"}]}]}],"default":"null"},{"name":"hoodieCleanMetadata","type":["null",{"type":"record","name":"HoodieCleanMetadata","fields":[{"name":"startCleanTime","type":{"type":"string","avro.java.string":"String"}},{"name":"timeTakenInMillis","type":"long"},{"name":"totalFilesDeleted","type":"int"},{"name":"earliestCommitToRetain","type":{"type":"string","avro.java.string":"String"}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieCleanPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"policy","type":{"type":"string","avro.java.string":"String"}},{"name":"deletePathPatterns","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"successDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"failedDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}],"default":"null"},{"name":"hoodieCompactionMetadata","type":["null",{"type":"record","name":"HoodieCompactionMetadata","fields":[{"name":"partitionToCompactionWriteStats","type":["null",{"type":"map","values":{"type":"array","items":{"type":"record","name":"HoodieCompactionWriteStat","fields":[{"name":"partitionPath","type":["null",{"type":"string","avro.java.string":"String"}]},{"name":"totalLogRecords","type":["null","long"]},{"name":"totalLogFiles","type":["null","long"]},{"name":"totalUpdatedRecordsCompacted","type":["null","long"]},{"name":"hoodieWriteStat","type":["null","HoodieWriteStat"]}]}},"avro.java.string":"String"}]}]}],"default":"null"},{"name":"hoodieRollbackMetadata","type":["null",{"type":"record","name":"HoodieRollbackMetadata","fields":[{"name":"startRollbackTime","type":{"type":"string","avro.java.string":"String"}},{"name":"timeTakenInMillis","type":"long"},{"name":"totalFilesDeleted","type":"int"},{"name":"commitsRollback","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieRollbackPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"successDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"failedDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}],"default":"null"},{"name":"hoodieSavePointMetadata","type":["null",{"type":"record","name":"HoodieSavepointMetadata","fields":[{"name":"savepointedBy","type":{"type":"string","avro.java.string":"String"}},{"name":"savepointedAt","type":"long"},{"name":"comments","type":{"type":"string","avro.java.string":"String"}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieSavepointPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"savepointDataFile","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]}],"default":"null"},{"name":"commitTime","type":["null",{"type":"string","avro.java.string":"String"}]},{"name":"actionType","type":["null",{"type":"string","avro.java.string":"String"}]}]}
Exception in thread "main" java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
	at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: com.uber.hoodie.exception.HoodieCommitException: Failed to archive commitsat com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:254)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:117)
	at com.uber.hoodie.HoodieWriteClient.commit(HoodieWriteClient.java:531)
	at com.uber.hoodie.HoodieWriteClient.commit(HoodieWriteClient.java:491)
	at com.uber.hoodie.HoodieWriteClient.commit(HoodieWriteClient.java:482)
	at com.uber.hoodie.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:155)
	at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
	at com.vungle.malygos.dedup.SparkMain$$anonfun$run$2.apply(SparkMain.scala:133)
	at com.vungle.malygos.dedup.SparkMain$$anonfun$run$2.apply(SparkMain.scala:99)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at com.vungle.malygos.dedup.SparkMain$.run(SparkMain.scala:99)
	at com.vungle.malygos.BoilerplateSparkMain$class.main(Boilerplate.scala:861)
	at com.vungle.malygos.dedup.SparkMain$.main(SparkMain.scala:22)
	at com.vungle.malygos.dedup.SparkMain.main(SparkMain.scala)... 6 more
Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
	at com.uber.hoodie.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:215)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:279)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:247)... 41 more
19/06/25 12:37:52 INFO SparkContext: Invoking stop() from shutdown hook

@vinothchandar
Copy link
Member

@jackwang2 there should not be any compatibility issues between 0.4.6 and 0.4.7.
So there are two issues you are facing?

  • Do you still run into the No value present issue?
  • I will look at this new stack trace later today.

@amarnathv9
Copy link

amarnathv9 commented Jul 10, 2019

I am facing similar issue while creating the MOR tables.Please take a look.

ERROR Log :

 spark-submit --master yarn  --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls /mapr/user/avenka23/hoodie/incubator-hudi/packaging/hoodie-utilities-bundle/target/hoodie-utilities-bundle*-SNAPSHOT.jar`   --props /user/avenka23/delta-streamer/config/dfs-source_no_partition.properties   --schemaprovider-class com.uber.hoodie.utilities.schema.FilebasedSchemaProvider   --source-class com.uber.hoodie.utilities.sources.JsonDFSSource   --source-ordering-field ts   --target-base-path /........../stock_ticks_cow_no_part_DEMO_MR --target-table stock_ticks_cow_no_part_DEMO_MR  --storage-type MERGE_ON_READ --key-generator-class com.uber.hoodie.NonpartitionedKeyGenerator
19/07/09 22:01:15 WARN SchedulerConfGenerator: Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instatiation time. Continuing without scheduling configs
19/07/09 22:01:20 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
19/07/09 22:01:35 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
19/07/09 22:01:38 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, dsfsdf.sdfsd.com, executor 2): java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$5(HoodieMergeOnReadTable.java:510)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.rollback(HoodieMergeOnReadTable.java:505)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$328a965c$1(HoodieMergeOnReadTable.java:307)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

[Stage 1:>                                                          (0 + 2) / 2]19/07/09 22:01:39 ERROR TaskSetManager: Task 1 in stage 1.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 5, dbslt1835.uhc.com, executor 2): java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$5(HoodieMergeOnReadTable.java:510)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.rollback(HoodieMergeOnReadTable.java:505)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$328a965c$1(HoodieMergeOnReadTable.java:307)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
        at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
        at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.rollback(HoodieMergeOnReadTable.java:318)
        at com.uber.hoodie.HoodieWriteClient.doRollbackAndGetStats(HoodieWriteClient.java:887)
        at com.uber.hoodie.HoodieWriteClient.rollbackInternal(HoodieWriteClient.java:965)
        at com.uber.hoodie.HoodieWriteClient.rollback(HoodieWriteClient.java:776)
        at com.uber.hoodie.HoodieWriteClient.rollbackInflightCommits(HoodieWriteClient.java:1187)
        at com.uber.hoodie.HoodieWriteClient.startCommitWithTime(HoodieWriteClient.java:1053)
        at com.uber.hoodie.HoodieWriteClient.startCommit(HoodieWriteClient.java:1046)
        at com.uber.hoodie.utilities.deltastreamer.DeltaSync.startCommit(DeltaSync.java:404)
        at com.uber.hoodie.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:330)
        at com.uber.hoodie.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:227)
        at com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:125)
        at com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:289)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:780)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$5(HoodieMergeOnReadTable.java:510)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.rollback(HoodieMergeOnReadTable.java:505)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$328a965c$1(HoodieMergeOnReadTable.java:307)
        at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

@vinothchandar
Copy link
Member

@jackwang2 @amaranathv there are 3 issues here with different stack traces.

@jackwang2 first issue you reported . is that gone? or you waiting for 0.4.7 to be fixed with issue below to try that?

Caused by: java.util.NoSuchElementException: No value present
	at com.uber.hoodie.common.util.Option.get(Option.java:112)
	at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
	... 28 more

@jackwang2 second issue (probably same as what you mentioned on slack?) . are you able to reproduce it? https://github.com/apache/incubator-hudi/blob/hoodie-0.4.7/hoodie-client/src/main/java/com/uber/hoodie/io/HoodieCommitArchiveLog.java#L279 seems its related to archiving CLEAN action. The schema for this has not changed in years. it looks like some corruption. are you able to print out new String(commitTimeline.getInstantDetails(hoodieInstant).get()), so we can see the bytes..

Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
	at com.uber.hoodie.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:215)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:279)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:247)... 41 more
19/06/25 12:37:52 INFO SparkContext: Invoking stop() from shutdown hook

@amaranathv your stack trace is different..

Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$5(HoodieMergeOnReadTable.java:510)

can you print out partitionPath below? It seems like its null .. Please check if your input has valid non-null partition paths.

          writer = HoodieLogFormat.newWriterBuilder().onParentPath(
                new Path(this.getMetaClient().getBasePath(), partitionPath))
                .withFileId(wStat.getFileId()).overBaseCommit(baseCommitTime)
                .withFs(this.metaClient.getFs())
                .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();

@jackwang2
Copy link
Author

@vinothchandar for the issue java.util.NoSuchElementException: No value present, It was not reproduced after I changing to another column for partition. Anyway, I would try to print the info like you suggested and update you.

@amarnathv9
Copy link

amarnathv9 commented Jul 11, 2019 via email

@amarnathv9
Copy link

amarnathv9 commented Jul 11, 2019

I am getting same error.

scala> .save("/datalake/888/888/888/hive/warehouse/test_hudi_spark_no_part_1_mor")
19/07/11 12:31:45 WARN TaskSetManager: Lost task 0.0 in stage 304.0 (TID 464, 88888.uhc.com, executor 2): com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
        at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:274)
        at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:451)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.uber.hoodie.exception.HoodieUpsertException: Failed to initialize HoodieAppendHandle for FileId: 951d569b-188d-46e4-ad94-a32525fac797-0 on commit 20190711123144 on HDFS path /datalake/optum/optuminsight/udw/hive/warehouse/test_hudi_spark_no_part_1_mor
        at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
        at com.uber.hoodie.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:193)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.handleUpdate(HoodieMergeOnReadTable.java:118)
        at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:266)
        ... 28 more
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:277)
        at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:132)
        ... 31 more

19/07/11 12:31:45 ERROR TaskSetManager: Task 0 in stage 304.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 304.0 failed 4 times, most recent failure: Lost task 0.3 in stage 304.0 (TID 467, dbslt1829.uhc.com, executor 2): com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
        at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:274)
        at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:451)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.uber.hoodie.exception.HoodieUpsertException: Failed to initialize HoodieAppendHandle for FileId: 951d569b-188d-46e4-ad94-a32525fac797-0 on commit 20190711123144 on HDFS path /datalake/888/99999/9999/hive/warehouse/test_hudi_spark_no_part_1_mor
        at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
        at com.uber.hoodie.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:193)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.handleUpdate(HoodieMergeOnReadTable.java:118)
        at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:266)
        ... 28 more
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:277)
        at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:132)
        ... 31 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
  at com.uber.hoodie.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:149)
  at com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:90)
  at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
  ... 54 elided
Caused by: com.uber.hoodie.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
  at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:274)
  at com.uber.hoodie.HoodieWriteClient.lambda$upsertRecordsInternal$7ef77fd$1(HoodieWriteClient.java:451)
  at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
  at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
  at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
  at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
  at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
  at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
  at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: com.uber.hoodie.exception.HoodieUpsertException: Failed to initialize HoodieAppendHandle for FileId: 951d569b-188d-46e4-ad94-a32525fac797-0 on commit 20190711123144 on HDFS path /datalake/9999/999/999/hive/warehouse/test_hudi_spark_no_part_1_mor
  at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
  at com.uber.hoodie.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:193)
  at com.uber.hoodie.table.HoodieMergeOnReadTable.handleUpdate(HoodieMergeOnReadTable.java:118)
  at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:266)
  ... 28 more
Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
  at org.apache.hadoop.fs.Path.<init>(Path.java:138)
  at org.apache.hadoop.fs.Path.<init>(Path.java:92)
  at com.uber.hoodie.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:277)
  at com.uber.hoodie.io.HoodieAppendHandle.init(HoodieAppendHandle.java:132)
  ... 31 more

@vinothchandar
Copy link
Member

@amaranathv just to confirm you are using NonpartitionedKeyGenerator as the key generator?

@amarnathv9
Copy link

yes

@amarnathv9
Copy link

DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY-> "com.uber.hoodie.NonpartitionedKeyGenerator"

@vinothchandar
Copy link
Member

@amaranathv again the issue from this line .onParentPath(new Path(hoodieTable.getMetaClient().getBasePath(), partitionPath)) where I suspect partitionPath is null. Can you please ensure your dataset contains non-null partition paths? (I am going to add an explicit exception to flag this case in the KeyGenerator subclasses or somewhere else. )

@vinothchandar
Copy link
Member

To summarize

@n3nash is looking into the avro issue
and
@bhasudha is going to try repro the empty path exception, as a ramp up task.

@bhasudha
Copy link
Contributor

With PR 775 this issue seems to have been fixed. I was able to reproduce this error before the fix. After applying PR 775 could not reproduce it anymore. @amaranathv can you test this PR for empty path exception?

@bvaradar
Copy link
Contributor

@jackwang2 @amaranathv there are 3 issues here with different stack traces.

@jackwang2 first issue you reported . is that gone? or you waiting for 0.4.7 to be fixed with issue below to try that?

Caused by: java.util.NoSuchElementException: No value present
	at com.uber.hoodie.common.util.Option.get(Option.java:112)
	at com.uber.hoodie.io.HoodieMergeHandle.(HoodieMergeHandle.java:71)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:226)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:180)
	at com.uber.hoodie.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:263)
	... 28 more

@jackwang2 second issue (probably same as what you mentioned on slack?) . are you able to reproduce it? https://github.com/apache/incubator-hudi/blob/hoodie-0.4.7/hoodie-client/src/main/java/com/uber/hoodie/io/HoodieCommitArchiveLog.java#L279 seems its related to archiving CLEAN action. The schema for this has not changed in years. it looks like some corruption. are you able to print out new String(commitTimeline.getInstantDetails(hoodieInstant).get()), so we can see the bytes..

Caused by: java.io.IOException: Not an Avro data fileat org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
	at com.uber.hoodie.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:215)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:279)
	at com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:247)... 41 more
19/06/25 12:37:52 INFO SparkContext: Invoking stop() from shutdown hook

@amaranathv your stack trace is different..

Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:130)
        at org.apache.hadoop.fs.Path.<init>(Path.java:138)
        at org.apache.hadoop.fs.Path.<init>(Path.java:92)
        at com.uber.hoodie.table.HoodieMergeOnReadTable.lambda$rollback$5(HoodieMergeOnReadTable.java:510)

can you print out partitionPath below? It seems like its null .. Please check if your input has valid non-null partition paths.

          writer = HoodieLogFormat.newWriterBuilder().onParentPath(
                new Path(this.getMetaClient().getBasePath(), partitionPath))
                .withFileId(wStat.getFileId()).overBaseCommit(baseCommitTime)
                .withFs(this.metaClient.getFs())
                .withFileExtension(HoodieLogFile.DELTA_EXTENSION).build();

@jackwang2 : Are you still seeing this issue ?

@bvaradar bvaradar self-assigned this Jul 18, 2019
@amarnathv9
Copy link

I am still working on performance side of the copy of on write.Will do the testing again after the performance test complete.

@n3nash
Copy link
Contributor

n3nash commented Aug 1, 2019

It looks like the "Not an Avro data file" exception is thrown when there is a 0 byte stream read into the datafilereader as can be seen here : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L55 and here : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileConstants.java#L29

From the stack trace (by tracing the line numbers), it looks like the CLEAN file is failing to be archived. I looked at the clean logic and we do create clean files even when we don't have anything to clean but that does not result in a 0 bytes file, it still has some valid avro data. Although we need to fix not creating a clean file when there is nothing to clean, this still doesn't result into the error. I'm wondering if this has anything to do with any sort of race condition leading to archiving running when clean is a 0 sized file.

@jackwang2 How are you running the cleaner and the archival process ? Are you explicitly doing anything there ?

@jackwang2
Copy link
Author

jackwang2 commented Aug 1, 2019 via email

@vinothchandar vinothchandar assigned n3nash and unassigned bvaradar Aug 1, 2019
@vinothchandar
Copy link
Member

@jackwang2 biggest issue here is that we could not reproduce the error ourselves.. are you able to provide some reproducible case? is it easy to repro?

@smdahmed
Copy link

smdahmed commented Oct 17, 2019

I can confirm that I have been hit by this error. The occurrence of this error is in 2 tables that do not have any similarity in the pattern of data ingestion or sizes of the data they hold.

As @n3nash has mentioned there are clean file of size 0 bytes. Deleting the file manually clears the problem to the table (albeit with the first attempt failing and things working from the 2nd attempt onwards).

I have tried to reproduce the issue without any luck so far. I still continue to pursue to replicate the issue. If anyone else especially @jackwang2 knows what caused this and if he has fixed it in his pipeline, I would be very grateful. Any insights are hugely welcome.

The stack trace is as below:

Exception in thread "main" com.uber.hoodie.exception.HoodieCommitException: Failed to archive commits
Caused by: java.io.IOException: Not an Avro data file

@bvaradar
Copy link
Contributor

@smdahmed : Looked at the code to see how this can happen. Not clear how this can happen. Assuming you are using S3, Have you tried setting the consistency guard ( https://hudi.apache.org/configurations.html#withConsistencyCheckEnabled) ?

@smdahmed
Copy link

@bvaradar

Balaji - sincere apologies for not getting back earlier. The fact is that this happened only once and affected only 2 tables. I have reset the tables and since then, this issue has never resurfaced. I did see S3 logs etc to see if there was any abnormal activity at the platform infra level but there is nothing that we find.

I shall wait to see and in the next deployment, I shall push the consistency guard option that you recommend. Thanks for all the help.

@vinothchandar
Copy link
Member

Closing due to inactivity..

@mingujotemp
Copy link

experiencing exact same issue from using Hudi + Glue on EMR..

@mingujotemp
Copy link

mingujotemp commented Aug 3, 2020

Here's my config

hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'id',
#   'hoodie.index.class': '',
  'hoodie.index.type': 'GLOBAL_BLOOM',
  'hoodie.datasource.write.partitionpath.field': 'partition_test',
#   'hoodie.datasource.write.partitionpath.field': '',
#   'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.NonpartitionedKeyGenerator',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'upsert',
  'hoodie.datasource.write.precombine.field': 'updated_at',
  'hoodie.upsert.shuffle.parallelism': 2, 
  'hoodie.insert.shuffle.parallelism': 2,
  'hoodie.bulkinsert.shuffle.parallelism': 10,
  'hoodie.datasource.hive_sync.database': 'raw_staging',
  'hoodie.datasource.hive_sync.table': tableName,
  'hoodie.datasource.hive_sync.enable': 'true',
  'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
#   'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor',
  'hoodie.datasource.hive_sync.partition_fields': 'partition_test',
  'hoodie.combine.before.insert': 'true',
  'hoodie.combine.before.upsert': 'true',
  'hoodie.consistency.check.enabled': 'true',
  'hoodie.bloom.index.update.partition.path': 'true',
}

@bvaradar
Copy link
Contributor

bvaradar commented Aug 3, 2020

@mingujotemp : This is a old ticket possibly using different version of hudi. Can you kindly open new ticket with hudi version, symptom, steps to repro after looking at other open tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants