Skip to content

[SUPPORT] Parquet file size is small after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning #8017

@ROOBALJINDAL

Description

@ROOBALJINDAL

Problem

Parquet file are being generated of small size (almost 1mb) after running deltastreamer in BULK_INSERT which results in large number of files under same partitioning. We wanted to have files with larger size for optimal processing

We have tried various configurations as well.

Following configurations worked when we run deltastreamer in UPSERT mode, it started generating parquet files relatively bigger files of arround 5-6mb. Dont know why 5-6mb, it should have created according to the values given below? But in bulk insert, it doesnt work.

hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824

I tried passing following configs as well.

hoodie.parquet.max.file.size=2147483648
hoodie.parquet.small.file.limit=1073741824
hoodie.copyonwrite.record.size.estimate=150
hoodie.memory.merge.max.size=2004857600000
hoodie.insert.shuffle.parallelism=2000
hoodie.upsert.shuffle.parallelism=2000
hoodie.copyonwrite.insert.split.size=1000000
hoodie.cleaner.policy=KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained=4

Expected behavior
It should work for bulk insert operation as well. And for upserts, it should consider values provided in config. It is always giving 5-6mb parquet file size irrespective of the values given in case of upsert.

Environment Description

  • Aws EMR: emr-6.8.0

  • Hudi version : 0.11.1

  • Spark version : 3.3.0

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : EMR cluster

Additional context

I tried clustering to merge smaller files created using bulk insert.

Command:

spark-submit \
--master local[4] \
--class org.apache.hudi.utilities.HoodieClusteringJob \
s3://<my-bucket>/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
--base-path s3://<my-bucket>/<my-table-name> \
--instant-time 20210122190240 \
--table-name <my-table-name> \
--props s3://<my-bucket>/clusteringjob.properties \
--spark-memory 1g

Stacktrace

23/02/09 10:43:55 INFO Javalin: Stopping Javalin ...
23/02/09 10:43:55 INFO Javalin: Javalin has stopped
23/02/09 10:43:55 ERROR UtilHelpers: Cluster failed
org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3://<my-bucket>/<db-name>/<my-table-name>/.hoodie/20210122190240.replacecommit.requested
        at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.readDataFromPath(HoodieActiveTimeline.java:763) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:264) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.util.ClusteringUtils.getRequestedReplaceMetadata(ClusteringUtils.java:90) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.common.util.ClusteringUtils.getClusteringPlan(ClusteringUtils.java:106) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]
        at org.apache.hudi.table.action.cluster.SparkExecuteClusteringCommitActionExecutor.<init>(SparkExecuteClusteringCommitActionExecutor.java:45) ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:?]

Query: How this works and whats the ideal approach to handle this? And what I am missing in clustering as its not working for me?

Metadata

Metadata

Assignees

Labels

area:storageStorage managementarea:writerWrite client and core write operations

Type

No type

Projects

Status

✅ Done

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions