-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Problem
Parquet file are being generated of small size (almost 1mb) after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning. We wanted to have files with larger size for optimal processing
We have tried various configurations as well.
Following configurations worked when we run deltastreamer in UPSERT mode, it started generating parquet files relatively bigger files of arround 5-6mb. Dont know why 5-6mb, it should have created according to the values given below? But in bulk insert, it doesnt work.
hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
I tried passing following configs as well.
hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
hoodie.copyonwrite.record.size.estimate=150
hoodie.memory.merge.max.size=2004857600000
hoodie.insert.shuffle.parallelism=2000
hoodie.upsert.shuffle.parallelism=2000
hoodie.copyonwrite.insert.split.size=1000000
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=4
Expected behavior
It should work for bulk insert operation as well. And for upserts, it should consider values provided in config. It is always giving 5-6mb parquet file size irrespective of the values given in case of upsert.
Environment Description
-
Aws EMR: emr-6.8.0
-
Hudi version : 0.11.1
-
Spark version : 3.3.0
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : EMR cluster
Additional context
I tried clustering to merge smaller files created using bulk insert.
Command:
spark-submit \
--master local[4] \
--class org.apache.hudi.utilities.HoodieClusteringJob \
s3://<my-bucket>/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
--base-path s3://<my-bucket>/<my-table-name> \
--instant-time 20210122190240 \
--table-name <my-table-name> \
--props s3://<my-bucket>/clusteringjob.properties \
--spark-memory 1g
Stacktrace
23/02/09 10:43:55 INFO Javalin: Stopping Javalin ...
23/02/09 10:43:55 INFO Javalin: Javalin has stopped
23/02/09 10:43:55 ERROR UtilHelpers: Cluster failed
org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://<my-bucket>/<db-name>/<my-table-name>/.hoodie/20210122190240.replacecommit.requested
at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:763) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:264) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
at org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:90) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
at org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:106) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
at org.apache.hudi.table.action.cluster.SparkExecuteClusteringCommitActionExecutor.<init>(SparkExecuteClusteringCommitActionExecutor.java:45) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
Query: How this works and whats the ideal approach to handle this? And what I am missing in clustering as its not working for me?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status