Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector #7583

Open
codope opened this issue Dec 30, 2022 · 2 comments
Assignees
Labels
priority:major degraded perf; unable to move forward; potential bugs query-engine trino, presto, athena, impala, etc

Comments

@codope
Copy link
Member

codope commented Dec 30, 2022

Describe the problem you faced
Original issue: trinodb/trino#15368

Our team is testing the same on COPY ON WRITE HUDI (0.10.1) tables with metadata enabled at version using Trino 400. And we are facing the error while reading from partitioned tables.
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.

The issue was resolved by placing some dependencies in the classpath. Interestingly, those dependencies are already included in the trino-hudi-bundle. This particular issues tracks any gap in packaging.

To Reproduce

Steps to reproduce the behavior:

  1. Write a Hudi COW table with the below properties and metadata enabled.
  2. Query the same table using the trino-hudi connector (properties mentioned below) with hudi.metadata-enabled=true.

Trino Hudi Connector Properties:

connector.name=hudi
hive.metastore.uri={METASTORE_URI}
hive.s3.iam-role={S3_IAM_ROLE}
hive.metastore-refresh-interval=2m
hive.metastore-timeout=3m
hudi.max-outstanding-splits=1800
hive.s3.max-error-retries=50
hive.s3.connect-timeout=1m
hive.s3.socket-timeout=2m
hudi.parquet.use-column-names=true
hudi.metadata-enabled=true

Hudi Properties set while writing:

hoodie.datasource.write.partitionpath.field = "insert_ds_ist",
hoodie.datasource.write.recordkey.field = "id",
hoodie.datasource.write.precombine.field = "_hoodie_incremental_key", (self generated column),
hoodie.datasource.write.hive_style_partitioning = "true",
hoodie.datasource.hive_sync.auto_create_database = "true",
hoodie.parquet.compression.codec = "gzip",
hoodie.table.name = "<table_name>",
hoodie.datasource.write.keygenerator.class = "org.apache.hudi.keygen.SimpleKeyGenerator",
hoodie.datasource.write.table.type = "COPY_ON_WRITE",
hoodie.metadata.enable = "true",
hoodie.datasource.hive_sync.enable = "true",
hoodie.datasource.hive_sync.partition_fields = "insert_ds_ist",
hoodie.datasource.hive_sync.partition_extractor_class = "org.apache.hudi.hive.MultiPartKeysValueExtractor"

General information of table:
Total rows = 1,213,959,199
Total Partitions = 2400+
Total file objects = 120,000
Total Size on S3 = 12~13 GB
The table was upgraded from 0.9.0 to 0.10.1

Coordinator Relevant Logs:

Expected behavior

They query should work out-of-the-box without having to place jars in classpath.

Environment Description

  • Hudi version : 0.10.1

  • Spark version : 2.4

  • Trino version : 400

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Full stacktrace in
Partitioned_COW_Hudi_Coordinator_logs.log

@codope codope added priority:major degraded perf; unable to move forward; potential bugs query-engine trino, presto, athena, impala, etc labels Dec 30, 2022
@codope codope self-assigned this Dec 30, 2022
@codope
Copy link
Member Author

codope commented Dec 30, 2022

trino-hudi module adds hudi-common, hudi-hadoop-mr, hudi-client-common individually. Instead, we should consider replacing the three dependencies with the hudi-trino-bundle.

@codope
Copy link
Member Author

codope commented Dec 30, 2022

Current workaround is to add the hudi-trino-bundle in plugin path (<trino_install_dir>/plugin/hudi).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs query-engine trino, presto, athena, impala, etc
Projects
Status: 🚧 Needs Repro
Development

No branches or pull requests

1 participant