Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Presto cannot query hudi table #1329

Closed
popart opened this issue Feb 12, 2020 · 6 comments
Closed

[FIXED] Presto cannot query hudi table #1329

popart opened this issue Feb 12, 2020 · 6 comments
Assignees

Comments

@popart
Copy link

popart commented Feb 12, 2020

Describe the problem you faced

I made a non-partitioned Hudi table using Spark. I was able to query it with Spark & Hive, but when I tried querying it with Presto, I received the error Could not find partitionDepth in partition metafile.

To Reproduce

Steps to reproduce the behavior:

  1. Use an an emr-5.28.0 cluster
  2. Run spark shell:
spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --deploy-mode client
  1. Run spark code:
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.hive._
import org.apache.hudi.keygen.NonpartitionedKeyGenerator

val inputPath = "s3://path/to/a/parquet/file"
val tableName = "my_test_table"
val basePath = "s3://test-bucket/my_test_table" 

val inputDf = spark.read.parquet(inputPath)

val hudiOptions = Map[String,String](
    RECORDKEY_FIELD_OPT_KEY -> "dim_advertiser_id",
    PRECOMBINE_FIELD_OPT_KEY -> "update_time",
    TABLE_NAME -> tableName,
    KEYGENERATOR_CLASS_OPT_KEY -> classOf[NonpartitionedKeyGenerator].getCanonicalName, //needed for non partitioned table
    HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[NonPartitionedExtractor].getCanonicalName, //needed for non partitioned table
    OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
    HIVE_SYNC_ENABLED_OPT_KEY -> "true",
    HIVE_TABLE_OPT_KEY -> tableName,
    TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,
    "hoodie.bulkinsert.shuffle.parallelism" -> "10")

inputDf.write.format("org.apache.hudi").
    options(bulk_insert_hudiOptions).
    mode(Overwrite).
    save(basePath);
  1. Querying the table in Spark or Hive both work
  2. Querying the table in Presto fails
[hadoop@ip-172-31-128-118 ~]$ presto-cli --catalog hive --schema default
presto:default> select count(*) from my_test_table;

Query 20200211_185123_00018_pruwt, FAILED, 1 node
Splits: 17 total, 0 done (0.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20200211_185123_00018_pruwt failed: Could not find partitionDepth in partition metafile
com.facebook.presto.spi.PrestoException: Could not find partitionDepth in partition metafile
  at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:200)
  at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
  at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
  at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
  at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: Could not find partitionDepth in partition metafile
  at org.apache.hudi.common.model.HoodiePartitionMetadata.getPartitionDepth(HoodiePartitionMetadata.java:75)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:209)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.groupFileStatus(HoodieParquetInputFormat.java:158)
  at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:69)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:371)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:264)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:96)
  at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:193)
  ... 7 more

Expected behavior

Presto should return a count of all the rows. Other Presto queries should succeed.

Environment Description

  • EMR version: emr-5.28.0

  • Hudi version : 0.5.1-incubating, 0.5.0-incubating

  • Spark version : 2.4.4

  • Hive version : 2.3.6

  • Hadoop version : 2.8.5

  • Presto version: 0.227

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Stacktrace

Included in "Steps to reproduce".

Additional Info
When I used one of the columns as a partition column, I was able to query the table in Spark using spark.read.format("org.apache.hudi").load(basePath + "/*"). However, querying it in Hive resulted in:

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1580774559033_0082_2_00, diagnostics=[Vertex vertex_1580774559033_0082_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: my_test_table initializer failed, vertex=vertex_1580774559033_0082_2_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.HoodieHiveUtil.getNthParent(HoodieHiveUtil.java:66)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:313)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:98)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:58)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:71)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1580774559033_0082_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1580774559033_0082_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1580774559033_0082_2_00, diagnostics=[Vertex vertex_1580774559033_0082_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: my_test_table initializer failed, vertex=vertex_1580774559033_0082_2_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.HoodieHiveUtil.getNthParent(HoodieHiveUtil.java:66)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:313)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:98)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:58)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:71)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1580774559033_0082_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1580774559033_0082_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1

Querying it in presto-cli returned 0 rows.

@vinothchandar
Copy link
Member

@bhasudha can you please help out here

@bhasudha
Copy link
Contributor

@popart The stack trace you showed looks like somehow a partition metafile (".hoodie_partition_metadata") is created in the table path. If this file is present, Hudi tries to read the partition depth from this file by searchign for the key "partitionDepth" in that metafile. An exception "Could not find partitionDepth in partition metafile" is thrown in this case. Can you quickly check if this partition metafile is present in your table base path. If that is the case, we need to dig why that is created even though you chose a non partitioned table.

Also, just wanted to check did you mean presto version 0.227 ? Also, from the stack trace ( that you posted specific to Presto ), it looks liek this is coming from earlier Hudi version 0.5.0-incubating. Can you also confirm if this is true for further debugging ? Do you mind creating a Jira issue with these details ?

@popart
Copy link
Author

popart commented Feb 14, 2020

HI Bhavani! Thank you for taking a look. I filed https://issues.apache.org/jira/browse/HUDI-614.

Correct the Presto version is .227.

I tried running the spark-shell with both the Hudi 0.5.0 and the 0.5.1 jars, but got the same result. The EMR version has Hudi 0.5.0 installed, and I didn't specify anything different when running presto-cli, so I'd assume Presto is using the 0.5.0 version.

I do see the .hoodie_partition_metadata file in my S3 table path.

@vinothchandar
Copy link
Member

@bhasudha let's look at this more closely and confirm whats going on here? This stack trace indicates, just ipf.getSplits() being called.. and thus its general code. We do have tests around querying non-partitioned tables.. So need to reproduce this in docker setup or sth and go from therE?

@popart
Copy link
Author

popart commented Feb 24, 2020

Update: This problem does not occur in the docker environment. In the docker demo env, I was able to create a non-partitioned table in Spark (saved to hdfs), use run_sync_tool.sh to sync it to hive, and then query it successfully from presto. (It still made the .hoodie_partition_metadata file though).

@popart
Copy link
Author

popart commented Feb 26, 2020

I found the problem. We had client-side encryption configured for Spark & Hive using EMRFS, but not for Presto.

@popart popart closed this as completed Feb 26, 2020
@popart popart changed the title [SUPPORT] Presto cannot query non-partitioned table [FIXED] Presto cannot query hudi table Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants